12 August 2011

Word Wrap Kata Variants

It's time for some exercise - time for a code kata. I like the simple ones because they don't take much time and still provide a certain amount of training. After having done the Prime Factors Kata more than 20 times using Java, Ruby, C#, Turbo Pascal, BASIC and even Forth, I feel like trying a new one for a change. I choose the Word Wrap Kata, also by Uncle Bob. Unlike Prime Factors, Word Wrap seems to be less popular, there are only a few experiences with it published, e.g. Word Wrap using Python.

Recursive
The kata's task is to write a function that, like a word processor, breaks the line by replacing the last space in a line with a newline. The most straight forward solution to this is IMHO the recursive one. I only need to consider the first blank or forced break and call myself with the remaining text.
public String wrap(String line, int maxLineLen) {
   if (line.length() <= maxLineLen) {
      return line;
   }
   int indexOfBlank = line.lastIndexOf(BLANK, maxLineLen);
   int split;
   int offset;
   if (indexOfBlank > -1) {
      split = indexOfBlank;
      offset = 1;
   } else {
      split = maxLineLen;
      offset = 0;
   }
   return line.substring(0, split) + NEWLINE +
      wrap(line.substring(split + offset), maxLineLen);
}
Usually kata is not about the final solution, but about the process to get there. Still I want to compare different solutions, so let's analyse this one. If the line needs to be split n times (into n+1 shorter lines) then this solution creates 3*n String objects and additional n StringBuilders for string concatenation. ... 4*n objects are created.

Exercise Time at Karate DojoTail Recursive
To be tail recursive a function's last statement must be the recursive call.
public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   wrap(line, maxLineLen, accumulator);
   return accumulator.toString();
}
private void wrap(String remainingLine, int maxLineLen, StringBuilder accumulator) {
   if (remainingLine.length() <= maxLineLen) {
      accumulator.append(remainingLine);
      return;
   }
   int indexOfBlank = remainingLine.lastIndexOf(BLANK, maxLineLen);
   int split;
   int offset;
   if (indexOfBlank > -1) {
      split = indexOfBlank;
      offset = 1;
   } else {
      split = maxLineLen;
      offset = 0;
   }
   accumulator.append(remainingLine.substring(0, split));
   accumulator.append(NEWLINE);
   wrap(remainingLine.substring(split + offset), maxLineLen, accumulator);
}
This solution creates the new String in the very end. It needs 2 Strings per new line and only one StringBuilder and a final String to return it. ... 2*n+2 objects are created.

Loop
A tail recursive function can be rewritten to reuse the stack frame transforming it into a plain loop. As Java does not support that optimisation, I do it by hand.
public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   String remainingLine = line;
   while (remainingLine.length() > maxLineLen) {
      int indexOfBlank = remainingLine.lastIndexOf(BLANK, maxLineLen);
      int split;
      int offset;
      if (indexOfBlank > -1) {
         split = indexOfBlank;
         offset = 1;
      } else {
         split = maxLineLen;
         offset = 0;
      }
      accumulator.append(remainingLine.substring(0, split));
      accumulator.append(NEWLINE);
      remainingLine = remainingLine.substring(split + offset);
   }
   accumulator.append(remainingLine);
   return accumulator.toString();
}
The loopy :-) solution creates the same number of objects as the tail recursive one, but has reduced call overhead. ... Still 2*n+2 objects are created.

Optimised Loop
Let's optimise away the splitting of the remaining line because it gets split again in the next call, so all these Strings are only temporarily used.
public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   int pos = 0;
   while (pos + maxLineLen < line.length()) {
      int indexOfBlank = line.lastIndexOf(BLANK, pos + maxLineLen);
      int split;
      int offset;
      if (indexOfBlank > pos - 1) {
         split = indexOfBlank;
         offset = 1;
      } else {
         split = pos + maxLineLen;
         offset = 0;
      }
      accumulator.append(line.substring(pos, split));
      accumulator.append(NEWLINE);
      pos = split + offset;
   }
   accumulator.append(line.substring(pos));
   return accumulator.toString();
}
Now only one String is created per new line, one for the last remaining part and one after the final concatenation. ... n+3 objects are created.

wrappedUsing a Buffer
If I could get rid of half of the String splitting, why not drop the other half too?
public String wrap(String line, int maxLineLen) {
   StringBuilder accumulator = new StringBuilder();
   accumulator.append(line);
   int pos = 0;
   int inserted = 0;
   while (pos + maxLineLen < line.length()) {
      int indexOfBlank = line.lastIndexOf(BLANK, pos + maxLineLen);
      if (indexOfBlank > pos - 1) {
         accumulator.setCharAt(inserted + indexOfBlank, NEWLINE);
         pos = indexOfBlank + 1;
      } else {
         accumulator.insert(inserted + pos + maxLineLen, NEWLINE);
         pos = pos + maxLineLen;
         inserted++;
      }
   }
   return accumulator.toString();
}
Only one StringBuilder and one String are created. ... 2 objects are created. This is definitely the most garbage collector friendly solution because it creates only one temporary object, the StringBuilder.

Copying Characters
If there are blanks in line then all characters get copied once in append() and all solutions behave similar. (The method substring() does not copy characters, it just creates a new String with different pointers in the character array of the original String.) But if there are no blanks in line then the last solution copies all characters after the split point around one or more times. Copying large numbers of characters might be slower than allocating an object, especially as the JVM/Hotspot is optimised for short lived, small objects.

Resizing the StringBuilder
For all solutions using an explicit StringBuilder another optimisation is to avoid automatic resizing. The size of accumulator must be large enough to contain all additional newlines. If a line has a blank then it's replaced, so no new character is added. Only when a line contains no blank, then a newline is inserted. This can happen up to lineLen / maxLineLen times. Flooring (rounding down in integer division) is the right thing because after the last part, e.g. remainingLine, nothing is added.
...
   StringBuilder accumulator =
      new StringBuilder(calcMaxSize(line.length(), maxLineLen));
...
private int calcMaxSize(int lineLen, int maxLineLen) {
   int maxCharsAdded = lineLen / maxLineLen;
   return lineLen + maxCharsAdded;
}
Note that all these optimisations are highly theoretical. In a typical web or database application it doesn't matter at all if a few temporary objects are created or not. Remember that "Early optimisation is the root of all evil" (Donald Knuth). I would stick with the first solution because it's the shortest and easy to understand.

Bonus Round
public String wrap(String line, int maxLineLen) {
   return line.replaceAll("([^ ]{" + maxLineLen + "})" + // 1
      "(?=[^ ])" +                                       // 2
      "|" +                                              // 3
      "(.{1," + maxLineLen + "})" +                      // 4
      " ",                                               // 5
      "$1$2" + NEWLINE);
}
This solution is even shorter, just one line using a Regular Expression. It replaces the split points with a newline or adds one. The pattern matches areas of line which (1) contain exact maxLineLen characters that are not blanks, (2) which must be followed by a character that's not a blank (and which is not consumed) (3) or it matches areas of (4) one up to maxLineLen characters (5) which are followed by a blank. The match is replaced with the first (1) and second (3) groups together with a newline. The single blank (5) is not a member of any group and is dropped.

(Get the source.)

8 August 2011

Maven Plugin Harness Woes

Last year I figured out how to use the Maven Plugin Harness and started using it. I added it to my Maven plugin projects. Recently I started using DEV@cloud. DEV@cloud is a new service which contains Jenkins and Maven source repositories. CloudBees, the company behind DEV@cloud, offers a free subscription with reduced capabilities which are more than enough for small projects. I set up all my projects there in no time, but had serious problems with the integration tests.

Using a local Maven repository other than ~/.m2/repository
Repository (unnamed)You don't have to use the default repository location. It's possible to define your own in the user's settings.xml or even in the global settings. But I guess most people just use the default. On the other hand in an environment like DEV@cloud, all the different builds from different users must be separated. So CloudBees decided that each Jenkins job has its own Maven repository inside the job's workspace. That is good because the repository is deleted together with the project.

Problem
The Testing Harness embeds Maven, i.e. forks a new Maven instance. It fails to relay the modified settings to this new process. During the execution of the integration test a new local repository is created and the original local one is used as a remote one (called "local-as-remote"). But without any hints, Maven uses ~/.m2/repository. So the true local repository is not found and all needed artefacts are downloaded again. This takes a lot of time (and wastes bandwidth). Dependencies that exist only in the local repository, e.g. snapshots of dependent projects, are not found and the integration test fails.

Solution
RepositoryTool.findLocalRepositoryDirectoy() uses an instance of MavenSettingsBuilder to get the settings. Its only implementing class is DefaultMavenSettingsBuilder and it tries to determine the repository location from the value of the system property maven.repo.local. Then it reads the user settings and in the end it uses ~/.m2/repository. The solution is to set the maven.repo.local system property whenever the local repository is not under ~/.m2/repository. Add -Dmaven.repo.local=.repository into the field for Goals and Options of the Jenkins job configuration.

Using additional dependencies while building integration test projects
indirectionAfter the plugin under test is built and installed into the new local repository the Maven Plugin Harness runs Maven against the integration test projects inside the src/test/resources/it folder. The approach which I described last year forks a Maven with pom, properties and goals defined by the JUnit test method.

Problem
The integration tests know the location of the new local repository (because it is set explicitly) and are able to access the plugin under test. But they know nothing about the local-as-remote repository. They can only access all artefacts which have been "downloaded" from the local-as-remote repository during the build of the plugin under test. So the problem is similar to the previous problem but occurs only when an integration test project needs additional artefacts. For example a global ruleset Maven module might consists of XML ruleset configuration files. The test module depends on the Checkstyle plugin and executes it using the newly build rulesets. So the object under test (the rules XML) is tested indirectly through the invocation of Checkstyle but the ruleset module itself does not depend on Checkstyle.

Solution
All POMs used during integration test have to be "mangled", not just the POM of the plugin under test. The method manglePomForTestModule(pom) is defined in the ProjectTool but it's protected and not accessible. So I copied it to AbstractPluginITCase and applied it to the integration test POMs.

Using settings other than ~/.m2/repository/settings.xml
Cluster ConfigurationIf you need artefacts from repositories other than Maven Central you usually add them to your settings.xml. Then you refer to them in the Jenkins job configuration. Behind the scenes Jenkins calls Maven with the parameter -s custom_settings.xml.

Problem
Similar to the repository location, the custom settings' path is not propagated to the embedded Maven and it uses the default settings. This causes no problems if all needed artefacts are either in the local-as-remote repository or can be downloaded from Maven Central. For example a Global Ruleset might contain some Macker Architecture Rules. The snapshot of the Macker Maven Plugin is deployed by another build job into the CloudBees snapshot repository. The test module depends on this Macker plugin and runs it using the newly built rulesets.

Solution
AbstractPluginITCase calls BuildTool's createBasicInvocationRequest() to get an InvocationRequest and subsequently executes this request. Using any system property the InvocationRequest can be customised:
if (System.getProperty(ALT_USER_SETTINGS_XML_LOCATION) != null) {
   File settings =
      new File(System.getProperty(ALT_USER_SETTINGS_XML_LOCATION));
   if (settings.exists()) {
      request.setUserSettingsFile(settings);
   }
}
Then the value of the used system property is added into the field for Jenkins' Goals and Options: -s custom_settings.xml -Dorg.apache.maven.user-settings=custom_settings.xml.

Alas I'm not a Maven expert and it took me quite some time to solve these problems. They are not specific to CloudBees but they result from using non default settings. Other plugins that fork Maven have similar problems.

6 August 2011

Diaries of a New Employment

SkyscraperI have got a new job. My last employer, sort of a startup, went bankrupt. That's particularly sad, because they offered "Research Days". Similar to Google Friday, you were allowed to work on a random topic one day in a month. One day is not much, still it's so much more than other companies offer. The new one is big. After some nasty times at big companies, I'm not sure if big is good for me, but still I have to try.

(Sort of a) Diary
Since first of July I've been working there. Some friends asked me how am I doing. So here is my diary. It's biased because I'm too strict. To be fair - everything looks good. I'm not dying from excitement, but it's ok. I have to get used to it.
  • Day 1 - Shocked. It's my first day and I'm already totally shocked. A senior team member commits production code containing System.out.println because he doesn't bother to remove them. Later another team member doesn't know Eclipse's "Compare With/Each Other" function. I feel like running away.

  • Day 4 - Déjà vu. I'm the new guy in the team and the new guy is supposed to update the development setup documentation.

  • Day 6 - First Blood. Out of curiosity I take a brief look at the existing code base: There are more than 7000 classes with about 600000 non commenting source statements. There is all the classic legacy stuff you would expect to be hidden in such a large code base, e.g. methods containing up to 700 lines, many empty catch blocks and much more. But my favourite findings are the 60 Java source files which have been commented out completely. (I didn't know the Java compiler allowed empty files without any class declaration.)

  • Day 8 - Deadlock. I'm supposed to add information about myself to the internal directory. To edit my entry, I need to fill in a valid phone extension, which I don't have yet. So I try to get one but the phone extension registry would not give me one because my directory entry is incomplete.

  • Day 11. For two hours I'm unable to start my e-mail client. I'm trying all kind of workarounds but it just wouldn't work. I'm getting desperate. In the end it turns out that there is an internal application to clean up and restart the mail client which finally solves my problem.

  • DiaryDay 12. Today I saw the first person wearing shorts in the office. I was already getting stressed by being the only one wearing t-shirts and shorts among many people wearing suit.

  • Day 14 - Hope. We are preparing a list of refactorings that would improve maintainability of the code. I stay quiet because I'm the new one and don't have all the information. I'm glad when another team member proposes what I wanted to say. There is hope.

  • Day 15. The project manager wants to give me responsibilty for the continuous integration build. I cringe because setting up and maintaining the build is cumbersome and boring. Fortunately my manager objects that "there are other important things for Peter to do". +1 for saving me.

  • Day 18 - Public Relations. I know it's a waste of time, but I'm stubborn and ask the person responsible for PR/communications about the mode and support for publishing and presenting. I never got an answer.

  • Day 20 - Coder from Hell. A senior colleague does not care for compiler warnings because "he knows what he is doing". Well he should know better, especially as we are tasked to clean up some legacy mess that exists because of people like him. Anyway he doesn't give a shit about clean code. I haven't met such a kind of a developer before.

  • Day 22 - Print a Page. I need to print a page. This is usually not a problem. I install the printer driver, print the page, go to the machine and find that the device is out of order. Then a colleague tells me that it's broken all the time. Where is the next printer? So I find the printer on another floor, install, print, go there and find that the room containing the printer is locked. Ok, I repeat the whole schema with the printer of another floor, go there - but it's not "there". I can't find this bloody printer. I repeat with the printer on yet another floor, go there - and finally I'm able to collect my printout.

  • Day 28 - Long Line. I just found a single line of code containing 2881 characters. Yes, a single line. That's by far the longest line I have ever seen and it's even way beyond any long lines recorded by fellow craftsmen.