171 hexdump

Description

Quiz idea provided by Robert Dober.

This week’s quiz should be quick and easy for experienced Rubyists, and a good lesson for beginners. Your task this week is to write a utility that outputs a hex dump of the input.

There are a number of hex dump utilities in existence, that go by the names hd, od, hexdump… I’m sure there are more. Pick one you’d like to reproduce: If you’re on any variety of Unix or BSD (including Mac OS X), you can get man pages from the command-line to see how they work. On Windows, if you don’t have one installed, you can check out this man page for hexdump and use that as a model.

You are not required to implement all the various command-line switches, but I should be able to run your script on a file and, as a minimum, see output resembling this:

0000000 6573 2074 6c68 0a73 7973 746e 7861 6f20
0000010 0a6e 6f63 6f6c 7372 6863 6d65 2065 6564
0000020 6573 7472 0a0a 6573 2074 7865 6170 646e
0000030 6174 0a62 6573 2074 6174 7362 6f74 3d70
0000040 0a32 6573 2074 6873 6669 7774 6469 6874
0000050 323d 220a 6573 2074 6574 7478 6977 7464
0000060 3d68 3836 0a0a 2022 2051 6f63 6d6d 6e61
0000070 2064 6f74 7220 6665 726f 616d 2074 6170
0000080 6172 7267 7061 7368 6120 646e 6c20 7369
0000090 2e74 6e0a 6f6e 6572 616d 2070 2051 7167
00000a0 0a7d 0a0a                              
00000a4

Your submission should accept input either from a named file (part of the command-line arguments) or from standard input if no filename is provided.

Finally, when submitting, make sure to describe what existing hex dump program you are emulating/reproducing (if any), and what arguments to your script are needed, if any, to produce the basic output above.

Summary

When learning a new programming language, the first thing many coders do is write the traditional “Hello, world!” program. This generally provides the bare minimum needed for coding: base program structure, compilation if needed… In Ruby, this is very bare, as puts "Hello, world!" is sufficient. (See quiz #158 for some non-traditional versions.)

What also seems a tradition is the question, “What should I program now?” after “Hello, world!” is output to the console. New coders are looking for something to try, to expand their skills, without becoming overwhelmed. Often, I find, the easiest way to do this is to reproduce an existing program. You can focus on learning the new language and implementing an existing design, rather than coming up with something novel.

This week’s quiz was chosen with this in mind; it is a good project for new Rubyists, to dive into the language a bit without drowning. Hex dump utilities have been around for ages, and there are plenty of them, so we don’t have to think about implementing anything new; rather, we can focus on learning the Ruby. And writing a hex dump program let’s you deal with files, strings, arrays and output: some of the basics of any code.

I’m going to look at parts from each of the few solutions, to highlight some of the things you should know as a Rubyist. If you’re new to Ruby, you might consider trying the quiz first before reading this summary and the submissions. Then, after reading this summary, revise and refactor your solution to be leaner and cleaner.

First, let’s look at the non-golfed (and slightly modified) submission Mikael Hoilund. It’s short, but dense with good Ruby-isms.

i = 0
ARGF.read.scan(/.{0,16}/m) { |match|
        puts(("%08x " % i) + match.unpack('H4'*8).join(' '))
        i += 16

} 

ARGF is a special constant. It isn’t a file, but can be treated as such (as seen above, via the call to the IO#read method). It will sequentially read through all files provided on the command-line or, if none are provided, will read from standard input. It works together with ARGV, the array of arguments provided to your program, expecting that all values in ARGV are filenames. If you happen to have a script that also expects command-line options (such as --help), just make sure to process and/or remove them from ARGV before using ARGF.

String#scan which finds instances of the pattern provided in the source string. In this case, Mikael is using a regular-expression that grabs up to 16 characters (i.e. bytes) at a time, including newlines. (The m in the regular-expression indicates a multi-line match, in which newline characters are treated like any other character, rather than terminators.)

String#scan can return an array of matches, but it can also be used in block-form, as shown above, the block called once per match with the matching values passed in argument match.

Another trick here is replication. These aren’t really “tricks”, as they are standard functions defined on the class, but they can certainly save typing and keep the code clearer. Try these in irb:

> 'H4' * 8
=> "H4H4H4H4H4H4H4H4"

> [1, 2, 3] * 2
=> [1, 2, 3, 1, 2, 3]

String#unpack is a powerful function for handling raw data. It uses a format string (e.g. “H4H4H4H4H4H4H4H4”) to decode the raw data. In this case, H4 indicates that four nybbles (e.g. two bytes) should be decoded from the string. Doing that eight times decodes 16 bytes, which is how much we are reading at a time in Mikael’s code above.

String#unpack (and the reverse Array#pack) can do a lot of work in short-order. It just takes a bit of practice to understand, and easy-access to the formats table. (On the command-line, type: ri String#unpack.)

Finally, take a quick look at Mikael’s golfed solution. Aside from squeezing everything together, it makes use of some special globals: $< (equivalent to ARGF) and $& (evaluates to the current match from scan, eliminating the need for the match parameter to the block). Globals like this can certainly make it more fun to “golf” (i.e. the deliberate shrinking and obfuscation of a program), but aren’t recommended for clarity.

Robert Dober provides a clean, straightforward solution that needs little explanation. Make sure to look at the whole of it, while I examine briefly his output method.

require 'enumerator'

BYTES_PER_LINE = 0x10

def output address, line
  e = line.enum_for :each_byte
  puts "%04x %-#{BYTES_PER_LINE*3+1}s %s" % [ address, 
    e.map{ |b|  "%02x" % b }.join(" "),
    e.map{ |b|  
      0x20 > b || 0x7f < b  ? "." : b.chr
    }.join ]
end

The most useful bit here is the enumerator module, and the enum_for method that returns an Enumerable::Enumerator object. This object provides a number of ways to access the data. Here, Robert accesses it one byte at a time, having passed the argument :each_byte. Enumerators, of course, are not required to process each byte of the source string: a couple calls to each_byte could have done that as well. But the enumerator is a convenient package, which can be used multiple times, can be used as an Enumerable, and remove redundancy, all shown above.

Enumerators also have access to other ways to enumerate… What if you want to get three objects at a time from a collection? Disjointed or overlapping? You can use :each_cons or :each_slice to that effect.

> x = [1, 2, 3, 4, 5]
=> [1, 2, 3, 4, 5]

> x.enum_for(:each_cons, 3).to_a
=> [ [1, 2, 3], [2, 3, 4], [3, 4, 5] ]

> x.enum_for(:each_slice, 3).to_a
=> [ [1, 2, 3], [4, 5] ]

(Note that there are some changes going on with enumerators between Ruby 1.8.6 and 1.9; here is some good information on the changes in Ruby 1.9).

Now we look briefly at Adam Shelly’s solution, in particular his command-line option handling.

width = 16
group = 2
skip = 0
length = Float::MAX
do_ascii = true
file = $stdin

while (opt = ARGV.shift)
   if opt[0] == ?-
     case opt[1]
      when ?n
        length = ARGV.shift.to_i
      when ?s
        skip = ARGV.shift.to_i
      when ?g
        group = ARGV.shift.to_i
      when ?w
        width = ARGV.shift.to_i
      when ?a
        do_ascii = false
      else
        raise ArgumentError, "invalid Option #{opt}"
      end
    else
        file = File.new(opt)
    end
end 

ARGV.shift is a common pattern. It removes the first item from ARGV and returns it. Doing the assignment and while-loop test in one motion with ARGV.shift is a simple way to look at all the command-line arguments.

Adam’s arguments to his hexdump program are expected to be a single character preceded by a single dash. The question-mark notation (e.g. ?n) returns the integer ASCII value of the character immediately following. Likewise, single-character array access (e.g. opt[1]) also returns an integer ASCII value. (Note: This also differs in 1.9.) So by checking the first two characters of an argument pulled from ARGV against the dash character and various other options implemented, Adam can replace the default values provided at the top.

For a quick-and-dirty script, handling options in such a way is simple and convenient. For more complex option-handling, you would do well to make use of the standard optparse module, or third-party main.

That’s it for this week! Thanks for the submissions; I certainly learned a few things myself. (I can’t believe I didn’t know about ARGF…)


Wednesday, February 04, 2009