Tuesday, July 27, 2010

Huffman encoding tree. SICP 2.68 - 2.72

 I can't help but notice that the further I get into sicp, the more of the blogs chronicling other peoples progress on it seem to just drop off, which is unfortunate. Honestly, if I was to post detailed solutions to everything, I doubt I would keep at it myself either.

 Anyway, the huffman encoding trees stuff was pretty cool, so I'll post my solutions here.

 First, some general implementation details for the trees(as always looking in the text for the explanations):

 This is all straight out of sicp.

(define (make-leaf symbol weight)
(list 'leaf symbol weight))

(define (make-leaf symbol weight)
  (list 'leaf symbol weight))
        
(define (leaf? object)
  (eq? (car object) 'leaf))

(define (symbol-leaf x) (cadr x))

(define (weight-leaf x) (caddr x))

(define (make-code-tree left right)
  (list left
        right
        (append (symbols left) (symbols right))
        (+ (weight left) (weight right))))

(define (left-branch tree)
  (car tree))

(define (right-branch tree)
  (cadr tree))

(define (symbols tree)
  (if (leaf? tree)
      (list (symbol-leaf tree))
      (caddr tree)))

(define (weight tree)
  (if (leaf? tree)
      (weight-leaf tree)
      (cadddr tree)))

(define (decode bits tree)
  (define (decode-1 bits current-branch)
    (if (null? bits)
        '()
        (let ((next-branch
               (choose-branch (car bits) current-branch)))
          (if (leaf? next-branch)
              (cons (symbol-leaf next-branch)
                    (decode-1 (cdr bits) tree))
              (decode-1 (cdr bits) next-branch)))))          
  (decode-1 bits tree))

(define (choose-branch bit branch)
  (cond ((= bit 0) (left-branch branch))
        ((= bit 1) (right-branch branch))
        (else (error "bad bit -- CHOOSE-BRANCH" bit))))

Also, as a "setup" sicp supplies a couple of functions for working with sets, to which I add one(element-of-set, which checks if a pair corresponding to a symbol is present)

(define (adjoin-set x set)
  (cond ((null? set) (list x))
        ((< (weight x) (weight (car set))) (cons x set))
        (else (cons (car set)
                    (adjoin-set x (cdr set))))))

(define (element-of-set? x set)
  (cond ((null? set) #f)
        ((eq? (symbol x) (symbol (car set))) #t)
        (else (element-of-set? x (cdr set)))))

(define (make-leaf-set pairs)
  (if (null? pairs)
      '()
      (let ((pair (car pairs)))
        (adjoin-set (make-leaf (car pair)
                               (cadr pair))
                    (make-leaf-set (cdr pairs))))))

SICP 2.68 encoding a message:

 Idea here is to encode one symbol at a time from the tree. SICP supplies us with encode:

(define (encode message tree)
  (if (null? message)
      '()
      (append (encode-symbol (car message) tree)
              (encode (cdr message) tree))))

And here's my implementation of encode-symbol:

(define (encode-symbol sym tree)
  (define (get-encoding branch)
    (cond ((leaf? branch) '())
          ((element-of-set? sym
                            (symbols (left-branch branch)))
           (cons 0 (get-encoding (left-branch branch))))
          ((element-of-set? sym
                            (symbols (right-branch branch)))
           (cons 1 (get-encoding (right-branch branch))))))
  (if (element-of-set? sym (symbols tree))
      (get-encoding tree)
      (error "Symbol not part of encoding set")))
 I start out with a simple check that the symbol is part of the set. From there we check which branch the element is, and work down the tree, consing on the relevant bits until we hit a leaf which returns a null and completes the encoded symbol.

  SICP 2.69 generating a huffman tree:

  Here's the real meat, and a surprisingly simple case once you get to it. First as part of the problem definition we're given generate-huffman-tree and told to implement successive-merge.

(define (generate-huffman-tree pairs)
  (successive-merge (make-leaf-set pairs)))
Here's my take on successive-merge:

(define (successive-merge pairs)
  (if (= (length pairs) 1)
      (car pairs)
      (successive-merge
       (adjoin-set (make-code-tree (cadr pairs)
                                   (car pairs))
                   (cddr pairs)))))
Note that as the pairs are sorted, we need only grab the smallest(first two) pairs, join them together into a new sub-tree, and then place it in the correct position in the list(according to its weight. This is achieved with the modified adjoin-set detailed earlier).

SICP 2.70 a huffman song:

 Not much to this question. Apply the routines to some sample data.

(define song-tree
  (generate-huffman-tree
   '((A 2) (BOOM 1) (GET 2) (JOB 2) (NA 16) (SHA 3) (YIP 9) (WAH 1))))

(length (encode '(get a job
                      sha na na na na na na na na
                      get a job 
                      sha na na na na na na na na
                      wah yip yip yip yip yip yip yip yip yip
                      sha boom) song-tree))
;Value: 84
As we can see it comes out to 84 bits. As a fixed-length code each word would be 3 bits, and with 36 words, we would get a grand total of 108 bits, which of course is longer.

SICP 2.71 and 2.72 some theory

 It should be quite apparent that if we form a tree in a pattern of 1,2,4,8,16... that each sub-tree formed would be merged one at a time with the next item, giving a tree of n-1 levels. The most frequent symbol would be a single access away, while the least would be n-1 levels away(since the final level has 2 leaves, not 1). All remaining items would be on the nth level.

 The interesting thing though is, that as we're dealing with a sorted list, to get the first(most frequent) item, it is right at the end of the list. We thus end up going through every entry of (symbols tree) to find it, but only once, giving a complexity of n. The final entry on the other hand is at the very head of the list, and thus immediately found each time. However it has to go through n levels of the tree, also giving an order of growth n, though we can assume there would be more overhead from cons and whatnot.

No comments: