30 September 2009

Slacking Off

... or why automatic checks are necessary.

Human Factors
I must confess, I'm a slacker. For example I've been writing this post for three months and still haven't finished. I skip my workouts again and again. More important things just pop up all the time. Concepts like interesting or important are subjective and priorities are likely to change between individuals and over time. So everybody has his or her sweet spot of slacking. It's impossible (and probably also unwise) to work hard on all aspects of life. When everything runs smoothly, people get sloppy. (Again that might be something good for boring, repetitive tasks - except when a surgeon performs his 1.000th appendix extraction.) When things work out great, we might even get delusions of grandeur and bathe in the glow of our own greatness. Everybody does it, you do it, I do it. Only Chuck Norris does not.

Hmm. I'm mixing different behaviours here: slacking, sloppiness, laziness, lack in motivation, doing things half-hearted, leaving things unfinished. I use all these words synonymously. I Know that's not entirely correct. (Probably that's the reason I can't get this post into proper shape. I've already rewritten it five times. I know that I must not ship shit but I'm getting tired. So I will have to live with it. I'm sloppy myself ;-)

SlothThere are several causes for these factors, e.g. lack of interest (I don't care), boredom (I do it the hundredth time), distraction (I'm not able to concentrate on it - I just love cubicle spaces.), lack of background information (why do I do this crap), fear of wasted effort (I might not need it later) and time pressure (I have no time to do it proper).

Oh My!
What implications do these factors have for code quality? (By code quality I mean the internal code quality, maintained by the developer day after day.) Consider a product 'A'. Features have been added to it for the last five years. The natural laziness of all developers has taken its toll. The code is a mess. Maintenance costs go up. Suddenly code quality gets important. Suddenly management is interested in coding conventions and development processes. Suddenly people are aware of a need for an architecture. Suddenly people want to stop slackerism. But when the product is in trouble it's too late. Not really too late, as software is soft and can be changed all the time, it's just much more expensive. All these things are not new. It's well known that software erodes over time. Slacking developers may just be one of its causes.

Check What?
After this lengthy introduction I prepared my point: The need for automatic checks. Checks are good for you. (Like daily sit-ups.) Do them. Even better, set them up so you don't have to do them yourself. (Somebody does all the sit-ups for you. Every day. Isn't it great ;-) Remember: if it's not checked, it's not there. Paper is patient, automatic checks are not. Really, make your checks and reviews automatic. It's important, like your daily vitamins.

Automated testing is only one aspect of checking your code, albeit the most popular one. The test infected community already knows that if it's not tested, it's broken.. So next to testing you need to check other aspects of your code, like coding conventions. Usually these include whitespace policy, formatting, naming and other design idioms. Coding conventions cover a much broader area than most people think. They are not only about naming. They are also about higher level boiler plate code, e.g. how to handle transactions, how to access the database, how to log, how to handle exceptions, etc. These things are project specific and depend on the overall architecture.

Slacker Vandalism? End Work Check It!
All projects have some sort of coding conventions. But are they complete? Are they documented? Do developers comply with them? Unlikely. They need to be documented and even more need to be checked automatically. Probably most of your rules are not checked. It's time to write them down and define some concrete checks for them. Most tools and even some IDEs ship with basic rules for simple things like whitespace, naming or common coding idioms. These are perfect for a start. Start small. Use a few rules. You can always add more later. The limit of what you can check depends entirely on your determination: design rules, layering, modularity, architecture, code coverage and documentation and much more.

The problem is that rule enforcement provokes opposition. People don't want to leave their cosy comfort zone. Discussing and agreeing on a new coding convention is not a problem. But adding a new rule to already checked coding convention might be a fight. You have to convince developers to accept it. You have to argue with management for time to remove rule violations in legacy code. You have to struggle through, especially when you're only a grunt. Small steps are crucial. Don't press on it too much. If there is opposition, offer to drop the new rule. Make it look like there is the option of not having it. This enables discussion. (Of course that's not an option and you are not really offering it, but people like to have options to discuss about.) As soon as some rules showed their value, developers will vote for them if you oppose them, be the devil's advocate.

Automatic!
So let's finish this rant about human nature. I'm a slacker. Most likely there are some more in our trade. We must accept that. we are lazy. We make mistakes. Sometimes we are weak. That's normal. We just have to be aware of it. So be paranoid. Don't trust anyone. Automate anything that you might screw up. (Robustness #2) Automatic checks are your safety net. They help you avoiding making the same mistake twice. If there is a bug in your code, create a unit test to ensure the bug stays fixed. If you have inconsistent formatting, add format checks to your daily build. If you notice wrong usage of a design idiom during a review, create a custom rule to enforce proper usage. If ... well you get the point.

All this leads to the 2nd Law of Code Quality - Automatic Checks to fight slackerism.

4 September 2009

Running JUnit in Parallel

Back in 2008 I had to speed up our daily build. (I should have posted about it since long, but I just didn't make it. Recently when I saw a related post on a similar topic my bad conscience overwhelmed me.) The first thing was to get a faster machine, something with four 3 GHz cores. It worked excellent! All file based operations like compile performed 3 times faster just out of the box, thanks to the included RAID 0+1 disk array. As our automated tests took half of the total build time, I dealt with them first: I applied the usual optimisations as told in my talk about practical JUnit testing, tuning section. So I managed to halve JUnit execution time.

ForkGood, but still not fast enough. The problem was how to utilise all the shiny new cores during one build to speed it up as much as possible. So test execution needed to run in parallel. Some commercial build servers promised to be able to spread build targets over several agents. Unfortunately I had no opportunity to check them out, they cost quite beyond my budget. The only free distributed JUnit runner I found was using ComputeFarm JINI in a research project which did not look mature enough for production usage. Worth mentioning is GridGain’s JunitSuiteAdapter. It's able to distribute JUnit tests across a cluster of nodes. GridGain is a free cloud implementation, it's really hot stuff. But it's not a build solution, so integrating it into the existing build would have been difficult.

As I did not find anything useful had to come up with a minimalist home grown solution. I started with a plain JUnit target junitSequential which ran all tests in sequence:
<target name="junitSequential">

<junit fork="yes" failureproperty="failed"
haltonfailure="false" forkmode="perBatch">

<classpath>
<fileset dir="${lib.dir}" includes="*.jar" />
<pathelement location="${classes.dir}" />
</classpath>

<batchtest>
<fileset dir="${classes.dir}"
includes="**/*Test.class" />
</batchtest>
</junit>

<fail message="JUnit test FAILED" if="failed" />

</target>
I used haltonfailure="false" to execute all tests regardless if some failded or not. Otherwise <batchtest> would stop after the first broken test. With failureproperty="failed" and <fail if="failed" /> the build still failed if necessary. There is nothing special here.

Ant is able to run tasks in parallel using the <parallel> tag. (See my related post about forking several Ant calls in parallel.) A parallel running target would look like
<target name="junitParallelIdea">
<parallel>
<antcall target="testSomeJUnit" />
<antcall target="testOtherJUnit" />
</parallel>
</target>
Good, but how to split the set of tests into Some and Other? My first idea was to separate them by their names, i.e. by the first letter of the test's class name, using the inclusion pattern **/${junit.letter}*Test.class in the <batchtest>'s fileset. So I got 26 groups of tests running in parallel.
<target name="junitParallelNamedGroups">
<parallel>

<antcall target="-junitForLetter">
<param name="junit.letter" value="A" />
</antcall>
<antcall target="-junitForLetter">
<param name="junit.letter" value="B" />
</antcall>
<antcall target="-junitForLetter">
<param name="junit.letter" value="C" />
</antcall>
<!-- continue with D until Z -->

</parallel>
</target>

<target name="-junitForLetter">

<junit fork="yes" forkmode="perBatch">

<!-- classpath as above -->

<batchtest>
<fileset dir="${classes.dir}"
includes="**/${junit.letter}*Test.class" />
</batchtest>
</junit>

</target>
forkmode="perBatch" created a new JVM for each group. Without forking each test class would get it's own class loader, filling up the perm space. Setting reloading="false" made things even worse. All those singletons started clashing even without considering race conditions. So I took the overhead of creating additional Java processes.

Streets of SplitUnfortunately the grouping by letter approach had some problems. First the number of threads needed to be specified with <parallel>'s threadsperprocessor or threadcount attribute, else there would be 26 parallel processes competing for four cores. My experiments showed that two threads per processor performed best for the given set of JUnit tests. (Those JUnit tests were not "strictly unit", some tests called the database or web services, freeing the CPU during blocking. For tests with very little IO it might have looked different.)

Also my haltonfailure approach did not work because <antcall> does not return any properties set inside the called -junitForLetter target. There was no Ant command that supported that. But AntCallBack of the Antelope Ant extensions was able to do the trick: After registering the custom task with name="antcallback" I replaced the plain <antcall>s with <antcallback target="..." return="failed">.

Separating JUnit test cases by their names produced unbalanced and therefore unpredictable results regarding overall execution time. Depending on naming conventions some groups would run much longer than others. Ant's Custom Selectors are a much better way to split a fileset into a given number of parts producing a few balanced filesets with roughly the same number of test classes.
import java.io.File;

import org.apache.tools.ant.BuildException;
import org.apache.tools.ant.types.Parameter;
import org.apache.tools.ant.types.selectors.BaseExtendSelector;

public class DividingSelector extends BaseExtendSelector {

private int counter;
/** Number of total parts to split. */
private int divisor;
/** Current part to accept. */
private int part;

public void setParameters(Parameter[] pParameters) {
super.setParameters(pParameters);
for (int j = 0; j < pParameters.length; j++) {
Parameter p = pParameters[j];
if ("divisor".equalsIgnoreCase(p.getName())) {
divisor = Integer.parseInt(p.getValue());
}
else if ("part".equalsIgnoreCase(p.getName())) {
part = Integer.parseInt(p.getValue());
}
else {
throw new BuildException("unknown " + p.getName());
}
}
}

public void verifySettings() {
super.verifySettings();
if (divisor <= 0 || part <= 0) {
throw new BuildException("part or divisor not set");
}
if (part > divisor) {
throw new BuildException("part must be <= divisor");
}
}

public boolean isSelected(File dir, String name, File path) {
counter = counter % divisor + 1;
return counter == part;
}
}
One of the four available cores was used for static code analysis, which was very CPU intensive and one was used for integration testing. The remaining two cores were dedicated to unit tests. Using 4 balanced groups of tests executing in parallel, the time spent for JUnit tests was halved again: Yippee
<target name="junitParallel4Groups">
<parallel threadcount="4">
<antcallback target="-junitForDivided" return="failed">
<param name="junit.division.total" value="4" />
<param name="junit.division.num" value="1" />
</antcallback>
<antcallback target="-junitForDivided" return="failed">
<param name="junit.division.total" value="4" />
<param name="junit.division.num" value="2" />
</antcallback>
<antcallback target="-junitForDivided" return="failed">
<param name="junit.division.total" value="4" />
<param name="junit.division.num" value="3" />
</antcallback>
<antcallback target="-junitForDivided" return="failed">
<param name="junit.division.total" value="4" />
<param name="junit.division.num" value="4" />
</antcallback>
</parallel>
<fail message="JUnit test FAILED" if="failed" />
</target>

<target name="-junitForDivided">

<junit fork="true" failureproperty="failed"
haltonfailure="false" forkmode="perBatch">

<!-- classpath as above -->

<batchtest>
<fileset dir="${classes.dir}">
<include name="**/*Test.class" />
<custom classname="DividingSelector" classpath="classes">
<param name="divisor" value="${junit.division.total}" />
<param name="part" value="${junit.division.num}" />
</custom>
</fileset>
</batchtest>

</junit>

</target>
(Download source code of DividingSelector.)

Epiloge
Using this approach I kept the option to execute the tests one after another with num=1 of total=1 providing an easy way to switch between normal and parallel execution. This was useful when debugging the build script...
<target name="junitSequential">
<antcallback target="-junitForDivided" return="failed">
<param name="junit.division.total" value="1" />
<param name="junit.division.num" value="1" />
</antcallback>
<fail message="JUnit test FAILED" if="failed" />
</target>