205 Encyclopedia Construction


This week’s quiz comes from Tim Hunter. Thank you Tim for the great write up!

When I was a kid we had an encyclopedia. It had thousands of articles in alphabetical order, divided into about 25 volumes. Sometimes there were so many articles with the same initial letter (”M”, for example) that an entire volume was dedicated to articles with that initial letter. In some cases, though, there weren’t enough articles with the same initial letter to devote an entire volume to them, so those articles were grouped in with alphabetically adjacent articles in the same volume. For example, all the articles starting with W, X, Y, and Z were in the same volume. Of course, all the articles that started with the same initial letter were always in the same volume. Each volume was labeled with the initial character of the articles it contained. For example, the “M” volume was simply labeled “M”. If the articles were spread over more than one initial character, the label displayed the first and last characters in the range. The W, X, Y, and Z volume was labeled “W-Z”.

The quiz is to devise a way of grouping a list of encyclopedia articles into volumes. Since this is Ruby Quiz, we’ll say that for “articles” we have a list of about 100 words composed of alphabetical characters, a-z. A “volume” is an array.

Sort the list and then put the words into arrays using the following rules.

  1. All the words with the same initial character must be in the same array.
  2. If an array contains words with different initial characters, the words must be alphabetically adjacent to each other. That is, if “cat”, “dog”, and “elephant”, are in the list of articles, and you put “cat” and “elephant” in the same array, then “dog” must also be in that array.
  3. Without breaking rules 1 and 2, divide the words into about 10 arrays.
  4. Without breaking rules 1 and 2, each of the arrays should contain about the same number of words.

To display the encyclopedia, make a hash with one element for each array. The element value is the array. The element key is a single letter (”M”) when the array only contains words with the same initial character, and a range (“W-Z”) when the array contains words with more than one initial character. Print the hash with the “pp” command.

Here’s the list of words:

ape apple apricot aquamarine architect artichoke azure baboon badger banana bat battleship beeswax beetles black blogger blue bricklayer Brigadoon card cat cherry chokeberry coconut coral crimson currant dam debate deliberations demagogue dog durian elephant empathy encyclopedia fig fox gazelle giraffe goldenrod gray hippopotamus huckleberry ibex imprint indigo ivory jab jackrabbit jake khaki kiwi kudzu lavender lemming leopard lime loganberry mango maroon memory mole monkey mouse muskox Nigeria Navajo olive opera panther party peach pear penguin persimmon plum poet politician privy programmer rabbit red research rhino Rumpelstiltskin salmon silver snow soldier songsmith squirrel starship steel strawberry student tan tapir theater thesaurus tree turquoise umbrella warthog waterloo wildebeest wolf woodchuck xylophone yahoo yellow zebra


There were three submitted solutions to this week’s quiz. Srijayanth Sridhar provided the following:

#! /usr/bin/ruby

require 'pp'

words=File.readlines("../words.txt").map { |word| word.chomp.capitalize }
dict = {}
words.each do |word|
   dict[word[0..0]] ||= []
   dict[word[0..0]] << word

sub_array = []
main_dict =  {}
dict.keys.sort.each do |alphawords|
   sub_array += dict[alphawords]
   count += sub_array.flatten.size
   if ( count >= 10 )
      alpha=sub_array.flatten.map { |x| x[0..0] }.uniq
      alpha.size > 1 ? key=Range.new(alpha.min,alpha.max) : alpha[0]
      main_dict[key] = sub_array.flatten
      count = 0

# Attach last remaining words
alpha=sub_array.flatten.map { |x| x[0..0] }.uniq
alpha.size > 1 ? key=Range.new(alpha.min,alpha.max) : alpha[0]
main_dict[key] = sub_array.flatten
count = 0

pp main_dict

Srijayanth’s solution works by building a dictionary of all the words grouped by first letter. Starting from the beginning the words in the next letter are added until the total number of words in the current volume is greater than or equal to count (10 by default).

The advantage of this solution is that it is simple to implement and only needs to make one pass through the lists of words. One potential weakness is that if the words in the second to last letter added and it goes above the threshold, then the words beginning with the last letter won’t have any possibility to be merged in and could be left as a very small volume.

Luke Cowell’s solution repeatedly merges the smallest adjacent pairs until the requested number of volumes is reached. This is a simpler technique and produces pretty good results as long as the input data is well behaved.

Frank Fischer’s solution uses dynamic programming. Dynamic programming is a method of solving complex problems by breaking them down into simpler steps. There are two key attributes that a problem must have in order for dynamic programming to be applicable: optimal substructure and overlapping subproblems. Optimal substructure means that the solution to a given optimization problem can be obtained by the combination of optimal solutions to its subproblems. Overlapping subproblems means that the space of subproblems must be small, that is, any recursive algorithm solving the problem should solve the same subproblems over and over, rather than generating new subproblems.

Frank’s solution provides different methods to optimize the arrangement of the volumes.

# divide into +nbooks+ books by minimizing the maximal number
# of words in one book
def min_maximum( nbooks )
    dyn_min( nbooks ) { |s, rest| [s, rest || 0].max }


# divide into +nbooks+ books by minimizing the quadratic deviation
# from the average number of words in one book
def min_deviat2( nbooks )
    mean = sum( 0 ... @nletters.size ).to_f / nbooks
    dyn_min( nbooks ) { |s, rest| (rest || 0) + (s - mean) ** 2 }

Each of these methods passes a different block to the dyn_min method depending on what objective you wish to optimize. The dyn_min method iterates through the possible solutions and stores the best combinations of volumes according to the metric passed in.

Thank you Srijayanth, Frank, and Luke for your contributions this week.

Encyclopedia Construction (#205) - Solutions

Sunday, May 17, 2009