18 August 2007

Java Unicode Constants

Java has a strong Unicode support since always. That's nice and is supposed to save us some headache with encodings and code pages as well as allowing us to write real i18n applications (and using fancy symbols). So let's imagine you are working on your revolutionary new application which does some symbolic computations and you need to display an arrow. Maybe you know that it's just '\u2192' or you found it in the tables of the Unicode Database. Rosetta Stone However, by putting it into your code you will introduce a 'magic' character code. Magic numbers are a coding flaw and should not occur in your code. They need to be defined in some place with some reasonable name. So you end up defining all kind of Unicode letters and symbols you need.

Instead you might want to use these Java UniCode Constants (UCC). Using a small Ruby script these constants were derived directly from the Unicode Database textual representation. For every character there is a constant with its official name and corresponding char or int value. All characters of the Unicode version 4.2.0 up to \u1FFFF are covered except CJK Ideographs. For each Unicode block, e.g. Basic Latin (\u0000..\u007F) or Aegean Numbers (\u10100..\u1013F), there is a separate interface with the block's name defining all code-points defined in this block. First you need to import the blocks, e.g. import unicode.AegeanNumbers. Then you can use the constants in your code like here:
Character.charCount(BasicLatin.DIGIT_NINE)) // 1
Character.getNumericValue(BasicLatin.DIGIT_NINE)) // 9
Character.charCount(NumberForms.ROMAN_NUMERAL_FIVE_HUNDRED)) // 1
Character.getNumericValue(NumberForms.ROMAN_NUMERAL_FIVE_HUNDRED)) // 500
Character.charCount(AegeanNumbers.NUMBER_EIGHT)) // 2
Character.getNumericValue(AegeanNumbers.NUMBER_EIGHT)) // 8
(And yes, I know, interfaces are a poor place for constants. They should only be used to model a behaviour of a class. See the AvoidConstantsInterface rule. But I was young and needed the money... ;-)

Download and Installation
Download UCC 1.00 (330 KB), together with source. Extract the ucc-*.zip and add ucc.jar to your classpath. UCC is JDK 1.1 compliant and does not depend on any other libraries. To use characters beyond \u10000, called code-points, you need Java 1.5 or newer. UCC is Open Source under the GPL license.

2 comments:

Olivier said...

Great and useful work for JUnit tests. Thanks a lot.
Did you ever think of creating an artifact on mvnrepository.com ? Not that there should be numerous additional versions, but to make integration in line with current practices ?

Peter Kofler said...

Thank you Oliver.

Getting it into mvnrepository.com is some work, esp. for such small OS projects.

But I quickly released it to my own Maven repository. Just add the repository to your pom. See the project's generated site for more information.