22 October 2011

Freeze Custom Ruby Strings When Used as Keys in Hash

Last week I spent quite some time chasing a single issue in my JavaClass Ruby gem. It really annoyed me and I could not find anything useful even using Google. I had to dig deep. Read what happened: I began with some kind of rich string, quite similar to the following class:
class RichString < String
  def initialize(string)
    super(string)
    @data = string[0..0] # some manipulation here
  end
  def data
    @data
  end
end

word = RichString.new('word')
puts word               # => word
puts word.data          # => w
That was not special and worked as expected.

Lost ... !!Then I happened to use instances of RichString as keys in a hash. Why shouldn't I? They were still normal Strings and their data should be ignored when used in the hash.
map = {}
map[word] = :anything

word_key = map.keys[0]
puts word_key           # => word
puts word_key.data      # => nil
The last line warned me "instance variable @data not initialized". Oops, my little @data went missing indicated by the bold nil in the last line. First I did not know what was causing the problems. I was baffled as all tests were green and had a good coverage. I spent some time digging and rewriting a lot of functionality until I found that Hash#keys() caused the trouble when given my RichStrings as hash keys.
puts word == word_key   # => true
puts word.object_id == word_key.object_id  # => false
Aha, Hash changed the keys. It's reasonable to prohibit key changes, so a String passed as a key will be duplicated and frozen. (RTFM always helps ;-) But how did it do that? It did not call dup() on the RichString. As Hash is natively implemented, I ended up in the C source hash.c.
/*
*  call-seq:
*     hsh[key] = value        => value
*     hsh.store(key, value)   => value
*/

VALUE
rb_hash_aset(hash, key, val)
  VALUE hash, key, val;
{
  rb_hash_modify(hash);
  if (TYPE(key) != T_STRING || st_lookup(RHASH(hash)->tbl, key, 0)) {
    st_insert(RHASH(hash)->tbl, key, val);
  }
  else {
    st_add_direct(RHASH(hash)->tbl, rb_str_new4(key), val);
  }
  return val;
}
So when the key is a String and not already included in the hash, then rb_str_new4 is called. (I just love descriptive names ;-) Furthermore string.c revealed some fiddling with the original key.
VALUE
rb_str_new4(orig)
  VALUE orig;
{
  VALUE klass, str;

  if (OBJ_FROZEN(orig)) return orig;
  klass = rb_obj_class(orig);
  if (FL_TEST(orig, ELTS_SHARED) &&
      (str = RSTRING(orig)->aux.shared) &&
      klass == RBASIC(str)->klass) {
    long ofs;
    ofs = RSTRING(str)->len - RSTRING(orig)->len;
    if ((ofs > 0) || (!OBJ_TAINTED(str) && OBJ_TAINTED(orig))) {
      str = str_new3(klass, str);
      RSTRING(str)->ptr += ofs;
      RSTRING(str)->len -= ofs;
    }
  }
  else if (FL_TEST(orig, STR_ASSOC)) {
    str = str_new(klass, RSTRING(orig)->ptr, RSTRING(orig)->len);
  }
  else {
    str = str_new4(klass, orig);
  }
  OBJ_INFECT(str, orig);
  OBJ_FREEZE(str);
  return str;
}
Frozen StringI didn't quite understand what was going on in rb_str_new4(), but it was sufficient to read a few lines: If the original string was frozen, then it was used directly. I verified that.
map = {}
map[word.freeze] = :anything

word_key = map.keys[0]
puts word_key           # => word
puts word_key.data      # => w
Excellent, finally my @data showed up as expected. Fixing the problem added some complexity dealing with frozen values, but it worked.

Freeze your custom Ruby strings when you use them as keys in a hash (and want to retrieve them with Hash#keys())

No comments: