172 Code Heuristics

Description

This week, your task is to make my job simpler.

Each week, coders send in their submissions to Ruby Quiz problems, usually as a mix of quiz discussion and actual code. Your task, then, is to take a submission as input, and generate output that is the extracted code. Every line of the input that isn’t code should be prefixed with the comment marker, #.

Take the following submission to quiz #158:

I may have missed it, but I didn't see this solution:

require 'enumerator'
"Hello world!\n".enum_for(:each_byte).inject($stdout) { |res, b|
  res << b.chr
}

Which in 1.9 looks much better:

"Hello world!\n".each_char.inject($stdout) { |res, c|
  res << c
}

Stefano

Your code should take that as input and generate this output:

# I may have missed it, but I didn't see this solution:

require 'enumerator'
"Hello world!\n".enum_for(:each_byte).inject($stdout) { |res, b|
  res << b.chr
}

# Which in 1.9 looks much better:

"Hello world!\n".each_char.inject($stdout) { |res, c|
  res << c
}

# Stefano

Assuming you knew for certain which lines should be commented and which should not, your submission would be little more than a glorified cat. This quiz is more about identifying Ruby-looking text: what heuristics or patterns will you use to determine what lines or groups of lines are Ruby code or not.

Fortunately, there is a large sample set from which you can experiment: all existing Ruby Quiz submissions. Unfortunately – for this quiz, not in general – Ruby is very dynamic and flexible, which can make code identification a difficult problem. So I am not expecting perfection here, but I want to see what tricks, or general process, you can come up with for this problem.

Extra credit: Provide an optional command-line argument to “split” the output into multiple files. Each separate file would be a different chunk of the code, where code is separated by discussion. Commented discussion should be kept with the code that follows it.

Summary

To begin evaluating the solutions for this quiz, I ran each solution over the original submission text of each submission. In a way, this would probably be the most difficult test, seeing as how the solutions discussed themselves, with sample inputs and outputs. What parts should be considered code and which text? If I can’t figure it out, I can’t really expect an algorithm to do that. Still, it was interesting to see the results; in particular, what I did want to see, if nothing else, was that the actual code for the submission was identified as code.

And, except for one line, both worked well at identifying the main body of code as code.. (The solution provided by Jesse Brown identified it’s own last line – out.close – as text rather than code.) The results for the discussion portions of the submissions were mixed, mostly from false positives. After a quick examination, though, I believe Frederick Cheung was more consistent in differentiating code from commentary.

Perhaps the reason was the approach. As Frederick described:

I got lazy and thought I’d let ruby do the hard work. Given some text, I feed it through eval and see if an exception is raised or not (this does have limitations).

This seems to be a generally good approach. Most discussion will contain words or punctuation that I would not expect to pass safely through Ruby’s parser, and so the number of incorrect categorizations of text should be minimal. Frederick also notes the sorts of cases that are identified incorrectly as code; a more complete solution, perhaps integrating some ideas from solutions such as Jesse’s, would accommodate those classes. (Anyone want to write a meta-classifier?)

Let’s take a look at Frederick’s solution. We start with initialization:

class CodeExtractor
   attr_reader :lines, :output

   def initialize(text)
     @output = []
     @lines = text.split(/[\r\n]/)
   end

   def extract
     while lines.any?
       process lines
     end
   end

   # ...

   def self.extract(text)
     c= CodeExtractor.new text
     c.extract
     c.format_output
     nil
   end
end

# Here's how I used Frederick's CodeExtractor class:
CodeExtractor.extract(ARGF)

Pretty straightforward. A new instance of CodeExtractor is created via the extract class method, the input text is split on newline and carriage-return characters and saved in @lines. Then the instance method extract is called, which repeatedly calls process on all remaining lines (via the accessor created by attr_reader) until none remain.

So what does it mean to process?

   def process lines
     if (prefix_length = valid_code_prefix_length lines) > 0
       prefix_length.times { append_output lines.shift, true }
     else
       append_output lines.shift, false
     end
   end

   def append_output(line, code)
     @output << Struct::Line.new(line, code)
   end

A test is performed: valid_code_prefix_length (we’ll come back to that). Assuming the test result is positive, we remove that many lines from @lines and append them to @output (via lines.shift and the append_output method). Each line of text moved in this process is wrapped in the Line structure, which includes a flag indicating that these lines are code.

When valid_code_prefix_length returns zero, only the first line is moved as such, the flag set to indicate that it is not code. Eventually, all lines will have been moved from @lines to @output, identified either as code or not.

Let’s continue, looking next at valid_code_prefix_length:

   # returns the maximum number of lines (contiguous from
   # the start) that are valid ruby
   def valid_code_prefix_length lines
     max_valid_lines = 0
     code = ""
     lines.each_with_index do |line, index|
       code << line
       code << "\n"
       if valid_syntax? code
         max_valid_lines = index + 1
       end
     end
     return max_valid_lines
   end

valid_code_prefix_length takes as input an array of lines; in this case, the input data we are processing. Adding a line at a time to local variable code, it calls valid_syntax? to make a guess as to whether code is truly code. The loop (each_with_index) continues in this fashion, increasing max_valid_lines each time around, until valid_syntax? returns false. At that time, max_valid_lines is returned, indicating how many adjacent lines of code were found.

Finally, we get to valid_syntax?, the core of Frederick’s code identification.

   def valid_syntax?(code)
     io = StringIO.new
     original_err, $stderr= $stderr, io
     eval("BEGIN {return true}\n#{code}")
     raise 'here'
   rescue Exception
     false
   ensure
     $stderr = original_err
     return false if io.length > 0
   end

The first bit here is to create a StringIO object: an object that acts like an IO stream, but holds a string. This is assigned to $stderr (while the old value is simultaneously remembered in original_err). Reassigning $stderr allows us to examine any errors that occur when we call eval on the next line. You can also see, at the end of valid_syntax?, that the original value of $stderr is restored, whether any error occurs or not.

So what is going on in that eval? Obviously, the code to test is included, but so is BEGIN { return true }. I admit… I had to search the manual for this one. What BEGIN does is to evaluate its block before anything else in the file (or, in this case, the string) is evaluated. Since the block is simply return true, the evaluation exits before the argument is run.

What this means is that code is parsed enough to be syntax checked, but not actually executed. This, of course, matches Frederick’s description in his submission. If the parse is successful, no exceptions will be raised, and the { return true } block will be executed, causing valid_syntax? to return true. If the parse fails, an exception is thrown, which we see provides a return value of false.

(From what I understand, it seems there is some redundancy here… both in the call to raise and the check of io.length… since the evaluation should either immediately return true from a successful parse, or it is not successful. Still, perhaps there are cases that Frederick ran across that necessitates this pattern?)


Wednesday, February 04, 2009