167 Statistician I


This week begins a three-part quiz, the final goal to provide a little library for parsing and analyzing line-based data. Hopefully, each portion of the larger problem is interesting enough on its own, without being too difficult to attempt. The first part – this week’s quiz – will focus on the pattern matching.

Let’s look at a bit of example input:

You wound Perl for 15 points of Readability damage.
You wound Perl with Metaprogramming for 23 points of Usability damage.
Your mighty blow defeated Perl.
C++ walks into the arena.
C++ wounds you with Compiled Code for 37 points of Speed damage.
You wound C++ for 52 points of Usability damage.

Okay, it’s silly, but it is similar to a much larger data file I’ll provide end for testing.

You should definitely note the repetitiveness: just the sort of thing that we can automate. In fact, I’ve examined the input above and created three rules (a.k.a. patterns) that match (most of) the data:

[The ]<name> wounds you[ with <attack>] for <amount> point[s] of <kind>[ damage].
You wound[ the] <name>[ with <attack>] for <amount> point[s] of <kind>[ damage].
Your mighty blow defeated[ the] <name>.

There are a few guidelines about these rules:

  1. Text contained within square brackets is optional.
  2. A word contained in angle brackets represents a field; not a literal match, but data to be remembered.
  3. Fields are valid within optional portions.
  4. You may assume that both the rules and the input lines are stripped of excess whitespace on both ends.

Assuming the rules are in rules.txt and the input is in data.txt, running your Ruby script as such:

> ruby stats.rb rules.txt data.txt

Should generate the following output:

Rule 1: Perl, 15, Readability
Rule 1: Perl, Metaprogramming, 23, Usability
Rule 2: Perl
# No Match
Rule 0: C++, Compiled Code, 37, Speed
Rule 1: C++, 52, Usability

Unmatched input:
C++ walks into the arena.

Each line of the output corresponds to a line of the input; it indicates which rule was matched (zero-based index), and outputs the matched fields’ values. Any lines of the input that could not be matched to one of the rules should output an “No Match” comment, with all the unmatched input records printed in the “Unmatched input” section at the end (so the author of the rules can extend them appropriately).

One thing you should keep in mind while working on this week’s quiz is that you want to be flexible; followup quizzes will require that you modify things a bit.

For testing, I am providing two larger datasets: combat logs taken from Lord of the Rings Online gameplay. There is data for a Guardian and a Hunter; unzip before use. Both use the same ruleset:

[The ]<name> wounds you[ with <attack>] for <amount> point[s] of <kind>[ damage].
You are wounded for <amount> point[s] of <kind> damage.
You wound[ the] <name>[ with <attack>] for <amount> point[s] of <kind>[ damage].
You reflect <amount> point[s] of <kind> damage to[ the] <name>.
You succumb to your wounds.
Your mighty blow defeated[ the] <name>.


The heart of this problem, as suggested in the quiz description, is pattern matching. Essentially, we want to turn user-created rules into Ruby regular expressions, or at least some other method for comparing data against those rules.

I’ll come back to that in a minute. If we assume, for the moment, that we have the pattern matching in place, the rest of the code is pretty trivial. Read and parse the rules; then, read and parse the data according to those rules. Most of the submissions were pretty similar in this respect. Here’s the main loop from the solution of Benjamin Billian, as an example.

rules = [] # Set of Rules

# get Rules
File.open ARGV[0], 'r' do |file|
  file.each_line {|line| rules << create_rule(line)}  

All lines of the first file (containing the rules) are read and passed through the function create_rule. Those rules are then stored in an array to be used in the next section.

unmatched = [] # Array of unmatched lines

# get Data
File.open ARGV[1], 'r' do |file|
  file.each_line do |line|
    matched = false
    rules.each_with_index do |rule, i|
      if (match = rule.match(line)) != nil
        l = match.length
        a = match[1...l]
        a.delete nil
        puts "Rule #{i}: #{a.join ', '}"
        matched = true
    unless matched
      unmatched << line
      puts '# No Match'  

Now, all lines of the second file (containing the data) are read and compared against each of the rules. When a match is found, the field data is output along with the rule’s index. If no match is found, the input line is stored in the unmatched array, to be used at the end:

puts '-----'
puts 'Unmatched input:'
unmatched.each { |line| puts line }

That completes the quiz’s specification. There are a couple minor revisions I’d make to Benjamin’s inner loop, my revision shown here:

    rules.each_with_index do |rule, i|
      if match = rule.match(line)
        a = match.captures
        a.delete nil
        puts "Rule #{i}: #{a.join ', '}"
        matched = true

First, the match != nil test is redundant, since nil is false in that context; removing the comparison removes clutter. Second, the MatchData method captures gets the values wanted in a single method call. Another alternative would have been to use [1..-1] to avoid having to get the match length.

Finally, one alternative to using each_with_index (which requires that you keep some sort of tracking variable to know if a match was found) is using Enumerable#find, as shown Matthew Moss’s submission.

Now, back to pattern matching. While I commend the efforts of those who wrote rule parsers, I would also recommend to the same that they learn to use regular expressions, or at least review their knowledge. The rule format as specified by the quiz is intentionally simple, and takes little effort to create a regular expression that is compact and almost certainly more efficient than parsing by hand.

As a first example of how to create an appropriate regular expression from an input rule, let’s look again at Benjamin’s solution, his create_rule method:

  str.gsub! '.', '\\.'
  str.gsub! '[', '(?:'  # change [text] into (?:text)?
  str.gsub! ']', ')?'
  str.gsub! /<.+?>/, '(.+?)'  # change <text> into (.+?)
  Regexp.new "^#{str}$"

Each substitution here turns some portion of the rule into a Regexp. First, periods are escaped, since they have special meaning in a Regexp, while we want a literal period.

Next, as indicated in the comment, square brackets surrounding text are changed to (?:text)?. This is a regular expression group that is optional (from the trailing ?) and non-matching (from the ?: present after the opening parenthesis). Using a non-matching group allows us to operate on the group as a whole (e.g. making it optional) and avoids remembering the content of that group.

Benjamin’s next substitution changes <text> to (.+?), a non-optional, matching group. These are the field values we want to remember later. The .+? mechanism matches a string of any characters with length of at least one. The more familiar .+ does the same thing, but greedily, and would fail for any rule with more than one field (since the first field’s < would match the last field’s >). The question-mark of .+? turns this pattern into a non-greedy match.

Finally, the new string, after these substitutions, is prefixed with ^ and suffixed with $ to indicate the beginning and end of the line, and a new Regexp object is created.

Now, Benjamin’s solution for creating regular expressions is a start, but incomplete. It will work with the sample rules and data provided, but in order to make this a more general solution, we need to consider other special characters. What if a rule contained a question-mark, asterisk, or any other character special to regular expressions? We’d have to escape each of those. Likewise, we don’t want to escape square brackets (i.e. a special character both for our rules and for Ruby regular expressions), but instead transform them into a completely different pattern.

The answer to this problem can be found in a couple of submissions, particularly those of Jesus Gabriel and Sandro Paganotti. Both made use of the method Regexp.escape which does exactly the sort of escaping that Benjamin accomplishes with his first gsub! above. The difference between Jesus’s and Sandro’s solutions, then, is in the handling of square brackets.

Jesus deals with square brackets by replacing them with repeated underscore characters (i.e. some uncommon text), calling Regexp.escape, then changing the underscores back to brackets. While easy, I’m not fond of this particular approach in general situations, since as uncommon as the temporary text might be, it could show up in some user’s data set.

Sandro takes a different approach, allowing the square brackets to be escaped and dealing with the results.

Regexp.escape(r.chomp).gsub("\\\\[", "[").gsub("\\\\]", "]")

(Sandro’s solution actually had an extra \\, which does work but is unnecessary, since strings are being passed to gsub here, not regular expressions). The benefit of this technique is that it cannot be fooled by uncommon or rare input from the user’s rules or data set.

Sandro’s solution also contains other calls to substitute appropriate regular expression groups for the square and angle brackets. This, I believe, is the best solution for the transformation of our statistician rules into Ruby regular expressions.

Tomorrow, we will continue developing our Statistician, delving into modules and meta.

Wednesday, February 04, 2009