12 August 2011

Word Wrap Kata Variants

It's time for some exercise - time for a code kata. I like the simple ones because they don't take much time and still provide a certain amount of training. After having done the Prime Factors Kata more than 20 times using Java, Ruby, C#, Turbo Pascal, BASIC and even Forth, I feel like trying a new one for a change. I choose the Word Wrap Kata, also by Uncle Bob. Unlike Prime Factors, Word Wrap seems to be less popular, there are only a few experiences with it published, e.g. Word Wrap using Python.

Recursive
The kata's task is to write a function that, like a word processor, breaks the line by replacing the last space in a line with a newline. The most straight forward solution to this is IMHO the recursive one. I only need to consider the first blank or forced break and call myself with the remaining text.
public String wrap(String line, int maxLineLen) {
   if (line.length() <= maxLineLen) {
      return line;
   }
   int indexOfBlank = line.lastIndexOf(BLANK, maxLineLen);
   int split;
   int offset;
   if (indexOfBlank > -1) {
      split = indexOfBlank;
      offset = 1;
   } else {
      split = maxLineLen;
      offset = 0;
   }
   return line.substring(0, split) + NEWLINE +
      wrap(line.substring(split + offset), maxLineLen);
}
Usually kata is not about the final solution, but about the process to get there. Still I want to compare different solutions, so let's analyse this one. If the line needs to be split n times (into n+1 shorter lines) then this solution creates 3*n String objects and additional n StringBuilders for string concatenation. ... 4*n objects are created.

Exercise Time at Karate DojoTail Recursive
To be tail recursive a function's last statement must be the recursive call.
public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   wrap(line, maxLineLen, accumulator);
   return accumulator.toString();
}
private void wrap(String remainingLine, int maxLineLen, StringBuilder accumulator) {
   if (remainingLine.length() <= maxLineLen) {
      accumulator.append(remainingLine);
      return;
   }
   int indexOfBlank = remainingLine.lastIndexOf(BLANK, maxLineLen);
   int split;
   int offset;
   if (indexOfBlank > -1) {
      split = indexOfBlank;
      offset = 1;
   } else {
      split = maxLineLen;
      offset = 0;
   }
   accumulator.append(remainingLine.substring(0, split));
   accumulator.append(NEWLINE);
   wrap(remainingLine.substring(split + offset), maxLineLen, accumulator);
}
This solution creates the new String in the very end. It needs 2 Strings per new line and only one StringBuilder and a final String to return it. ... 2*n+2 objects are created.

Loop
A tail recursive function can be rewritten to reuse the stack frame transforming it into a plain loop. As Java does not support that optimisation, I do it by hand.
public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   String remainingLine = line;
   while (remainingLine.length() > maxLineLen) {
      int indexOfBlank = remainingLine.lastIndexOf(BLANK, maxLineLen);
      int split;
      int offset;
      if (indexOfBlank > -1) {
         split = indexOfBlank;
         offset = 1;
      } else {
         split = maxLineLen;
         offset = 0;
      }
      accumulator.append(remainingLine.substring(0, split));
      accumulator.append(NEWLINE);
      remainingLine = remainingLine.substring(split + offset);
   }
   accumulator.append(remainingLine);
   return accumulator.toString();
}
The loopy :-) solution creates the same number of objects as the tail recursive one, but has reduced call overhead. ... Still 2*n+2 objects are created.

Optimised Loop
Let's optimise away the splitting of the remaining line because it gets split again in the next call, so all these Strings are only temporarily used.
public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   int pos = 0;
   while (pos + maxLineLen < line.length()) {
      int indexOfBlank = line.lastIndexOf(BLANK, pos + maxLineLen);
      int split;
      int offset;
      if (indexOfBlank > pos - 1) {
         split = indexOfBlank;
         offset = 1;
      } else {
         split = pos + maxLineLen;
         offset = 0;
      }
      accumulator.append(line.substring(pos, split));
      accumulator.append(NEWLINE);
      pos = split + offset;
   }
   accumulator.append(line.substring(pos));
   return accumulator.toString();
}
Now only one String is created per new line, one for the last remaining part and one after the final concatenation. ... n+3 objects are created.

wrappedUsing a Buffer
If I could get rid of half of the String splitting, why not drop the other half too?
public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   accumulator.append(line);
   int pos = 0;
   int inserted = 0;
   while (pos + maxLineLen < line.length()) {
      int indexOfBlank = line.lastIndexOf(BLANK, pos + maxLineLen);
      if (indexOfBlank > pos - 1) {
         accumulator.setCharAt(inserted + indexOfBlank, NEWLINE);
         pos = indexOfBlank + 1;
      } else {
         accumulator.insert(inserted + pos + maxLineLen, NEWLINE);
         pos = pos + maxLineLen;
         inserted++;
      }
   }
   return accumulator.toString();
}
Only one StringBuilder and one String are created. ... 2 objects are created. This is definitely the most garbage collector friendly solution because it creates only one temporary object, the StringBuilder.

Copying Characters
If there are blanks in line then all characters get copied once in append() and all solutions behave similar. (The method substring() does not copy characters, it just creates a new String with different pointers in the character array of the original String.) But if there are no blanks in line then the last solution copies all characters after the split point around one or more times. Copying large numbers of characters might be slower than allocating an object, especially as the JVM/Hotspot is optimised for short lived, small objects.

Resizing the StringBuilder
For all solutions using an explicit StringBuilder another optimisation is to avoid automatic resizing. The size of accumulator must be large enough to contain all additional newlines. If a line has a blank then it's replaced, so no new character is added. Only when a line contains no blank, then a newline is inserted. This can happen up to lineLen / maxLineLen times. Flooring (rounding down in integer division) is the right thing because after the last part, e.g. remainingLine, nothing is added.
...
   StringBuilder accumulator =
      new StringBuilder(calcMaxSize(line.length(), maxLineLen));
...
private int calcMaxSize(int lineLen, int maxLineLen) {
   int maxCharsAdded = lineLen / maxLineLen;
   return lineLen + maxCharsAdded;
}
Note that all these optimisations are highly theoretical. In a typical web or database application it doesn't matter at all if a few temporary objects are created or not. Remember that "Early optimisation is the root of all evil" (Donald Knuth). I would stick with the first solution because it's the shortest and easy to understand.

Bonus Round
public String wrap(String line, int maxLineLen) {
   return line.replaceAll("([^ ]{" + maxLineLen + "})" + // 1
      "(?=[^ ])" +                                       // 2
      "|" +                                              // 3
      "(.{1," + maxLineLen + "})" +                      // 4
      " ",                                               // 5
      "$1$2" + NEWLINE);
}
This solution is even shorter, just one line using a Regular Expression. It replaces the split points with a newline or adds one. The pattern matches areas of line which (1) contain exact maxLineLen characters that are not blanks, (2) which must be followed by a character that's not a blank (and which is not consumed) (3) or it matches areas of (4) one up to maxLineLen characters (5) which are followed by a blank. The match is replaced with the first (1) and second (3) groups together with a newline. The single blank (5) is not a member of any group and is dropped.

(Get the source.)

3 comments:

Chuck Krutsinger said...
This comment has been removed by the author.
Peter Kofler said...

Chuck,
thank you for the comment on the Bonus Round solution. Yes there is a problem if maxLineLen is greater than the length of line, then no newline should be inserted.

I had a look and fixed the regex, which makes it even more incomprehensible ;-) Between the lines // 3 and // 4 I added

"(?=.{" + maxLineLen + "}.)" +

which by negative look-ahead makes sure that // 4 and // 5 only match if there are more than maxLineLen characters in the line. Thank you for pointing it out. I update the source.

Peter Kofler said...

I compare the different implementations of Word Wrap in my post Absolute Priority Premise, an Example.