170 Records and Arrays

Description

In a typical application, you might use what is commonly referred to as an “array of records.” Essentially, this is a single array containing multiple objects that represent the same kind of structured data, either as defined class or OpenStruct instances. Take, for example, this bit of code that reads in some data regarding people: their names, ages and favorite colors.

require 'ostruct'
data = []
File.open("data.txt").each do |line|
  name, age, color = line.chomp.split(/,/)
  d = OpenStruct.new
  d.name = name
  d.age = age
  d.color = color
  data << d
end

The variable data is considered an “array of records”, since to get any particular piece of data, you must first access it as an array (to get a particular record), then access it as a record (to get a particular field).

> p data[3].name
=> "Matthew"

> p data[3].age
=> 36

However, at times, it is more convenient to store data as a “record of arrays”. Instead of one array containing multiple records, you have one object (i.e. record) containing multiple, parallel arrays. Access to data then is done first as a record, then as an array.

> p data.name[3]
=> "Matthew"

> p data.age[3]
=> 36

This sort of arrangement is useful when you want to access series of data at a time. For example, if I have a graphing component that takes two arrays – one for the domain axis, and another for the range axis – a “record of arrays” will make accessing that data trivial.

Your task this week is to define two functions that move data between “array of records” storage and “record of arrays” storage.

def aor_to_roa(arr)
  # This method accepts an array of records, and
  # should return a single record of arrays.
  #
  # This is your task!
end

def roa_to_aor(rec)
  # This method accepts a record of arrays, and
  # should return a single array of records.
  #
  # This is also your task!
end

You should make this work with OpenStruct; do not limit yourself to the example records shown above.

There are two, optional extra-credits for this week.

  1. Extend these two functions to accept arbitrary classes.

  2. As an alternative to these two functions, create an adapter class that can wrap around “array of records” data to provide a “record of arrays” interface without actually moving data about.

Summary

It is worthwhile to note that, for most real situations, this week’s quiz problem isn’t necessary. From a human perspective, we naturally organize things in aggregate, as structured records. Whether it be object-oriented data, rows in a SQL table, XML records or other, that’s how we naturally organize things. Except for a few specialized conditions, collections of records are sufficient for our organizational needs as either programmers or users.

What are those specialized conditions? I can think of a couple situations:

  1. Performance. When dealing with older hardware, or perhaps specialized hardware (e.g. vector units on some processors or GPUs), it can be quite beneficial to “de-organize” data. Tight, little loops can then quickly process individual bits of data, and not be required to either process nor hold the whole data record.

  2. Interface. If the data must be provided to another library or component that expects parallel bits of data (e.g. a graphing component as suggested by the quiz), then obviously we need to provide the data as requested.

In the quiz, I asked that first for a data moving solution. This is often appropriate for performance situations, since an adapter does not provide the compact data that performance may require. (For example, some vector units may have limited memory, so it’s important to extract only the data needed for vector processing.)

The extra-credit adapter (wrapper) solution is more appropriate for interface situations. Here, especially with Ruby, we can easily wrap a new interface around the old, and make one set of data work in numerous situations.

Let’s start by looking at a solution that moves data into a new “record of arrays” structure, submitted by Andrea Fazzi.

def aor_to_roa(arr)
  hash = { }
  arr.each do |record|
    record.marshal_dump.each_key { |field|
      (hash[field] ||= []) << record.send(field)
    }
  end
  OpenStruct.new(hash)
end

This function accepts arrays of OpenStruct instances, as specified by the quiz. The outer loop enumerates over each record. For each record, marshal_dump is called: this is an instance method of OpenStruct and generates a hash consisting of field names associated to the field values.

For each key in that hash, Andrea first ensures that the new hash is either already initialized, or gets initialized to an empty array. The old record is sent the field name as a message (which returns the field value), which is appended to the new hash’s array for that field. Finally, a new OpenStruct instance is returned, which takes the new hash as its content.

(On a side note… Technically, for vector units and similar situations, you’d need to move the actual data into a memory block, rather than mere copying of object references as done above… but the code is sufficient for displaying the basic idea.)

I’m going to pass on talking about the opposite function: that is, moving data from a “record of arrays” to an “array of records.” Given that we usually store data as an “array of records,” this is less important since we can refer to the original data. Still, take a look at Andrea’s roa_to_aor method, which maintains a similar approach and simplicity.

I’m also not going to talk about pushing updates back to the original data for this case. Generally, this isn’t desireable for the typical cases (i.e. performance) that move/duplicate data. Either the data isn’t being modified, or if being modified, it gets pushed out the end of the pipe and intentionally lost. (Had I considered this when writing the quiz, I would have skipped my request for the roa_to_aor method.)

Situations that focus on interface, however, will prefer to work with the original data. In fact, it’s important that the adapter allow the original data to be modified to the extent it was originally modifiable: read-only data should not suddenly become read-write. (Another thing I should have put into the original quiz description… sorry.)

First, I want to look at the adapter from James Gray, which is quite simple and satifies the quiz description as was written. (I’ve left out the method __enum__, since it only existed to support roa_to_aor.)

class MapToEnum
   instance_methods.each { |meth|
       undef_method(meth) unless meth =~ /\A__/
   }

   def initialize(enum)
     @enum = enum
   end

   def method_missing(meth, *args, &block)
     @enum.map { |o| o.send(meth, *args, &block) }
   end
end

The first line undefines almost all the instance methods of this class (excepting those that start with two underscores), and does so immediately. To see what this leaves behind, try the following in irb:

> class WhatIsLeft
>   instance_methods.each { |m| undef_method(m) unless m =~ /\A__/ }
>   instance_methods
> end

=> ["__id__", "__send__"]

Basically, the intent here is that MapToEnum do very little on its own: most work will be passed onto the data, which we’ll see in a moment. Handly little thing… at least when using the class, but better to comment it out while debugging it.

The MapToEnum initializer simply remembers the array passed in, which is then used in method_missing. Here, any methods called except those above are sent to each of the objects in @enum. Using @enum.map means that arrays will be returned. And because all instance methods were cleared out first, you can do some other things besides getting the field names.

# Assuming people is an array of Person structs...

> peeps = MapToEnum.new(people)
> peeps.name
=> ["James", "Dana", "Matthew"]

> peeps.name[2]
=> "Matthew"

> peeps.class
=> [Person, Person, Person]

What doesn’t work with James’ solution is modifying the data.

> peeps.name[2] = "Bob"
=> "Bob"

> peeps.name
=> ["James", "Dana", "Matthew"]

> people.map { |x| x.name }
=> ["James", "Dana", "Matthew"]

This is because the attempt to modify peeps.name[2] actually changes the reference contained in the third slot of the array generated by peeps.name. To change the source data, peeps.name[2] would need to represent the name field of the original record.

Let’s look at one more solution, this one from Frederick Cheung.

class ArrayOfRecordsWrapper
   def initialize(array)
     @array = array
   end

   def method_missing(name, *args)
     FieldProxy.new(@array, name)
   end

   class FieldProxy
     instance_methods.each { |m| undef_method m unless m =~ /(^__)/ }

     def initialize array, name
       @array, @name = array, name
     end

     def [](index)
       @array[index].send @name
     end

     def []=(index,value)
       @array[index].send(@name.to_s + '=', value)
     end

     def to_a
       @field_array = @array.collect {|a| a.send @name}
     end

     def method_missing(*args, &block)
       to_a.send *args, &block
     end
   end
end

Frederick’s solution is, in many ways, similar to James’ solution, though Frederick adds in a proxy object to manage named fields. In fact, if you remove the square bracket methods [] and []= in the proxy, simplify and refactor, you’d essentially have James’ solution.

Yet, Frederick’s solution does work for manipulating the data.

> people[2].name
=> "Matthew"

> peeps = ArrayOfRecordsWrapper.new(people)

> peeps.name
=> ["James", "Dana", "Matthew"]

> peeps.name[2] = "Bob"
=> "Bob"

> peeps.name
=> ["James", "Dana", "Bob"]

> people[2].name
=> "Bob"

This works specifically because Frederick provided the square bracket methods, which manipulate the original data directly, rather than through the array generated from to_a and method_missing. If we attempt to change the original data through those (rather than the proxy supplied square bracket methods), it won’t work.

> peeps.name
=> ["James", "Dana", "Bob"]

> peeps.name.to_a[2] = "Matthew"
=> "Matthew"

> peeps.name
=> ["James", "Dana", "Bob"]     # not changed!

It might be worthwhile to hide to_a, though I expect that’s not the only way to circumvent the goal of modifying the original data. (And with a dynamic language such as Ruby, I’m pretty sure of it.) If we set a goal of defining and supporting common usage, then Frederick’s solution is a pretty satisfactory one.


Wednesday, February 04, 2009