Code Cop: Unicode

6 May 2010

Umlaut Fail

I enjoy reading while commuting, so I am able to make good use of this time by reading a lot. I save web pages to my phone, print articles or carry magazines around. Yesterday I read an issue of IEEE Computer (March 2009). It was quite good, but I spotted a mistake on the very first page.

Advertisement for Thinking on the Web: Berners-Lee, Gödel and Turing

Advertisement for Thinking on the Web: Berners-Lee, Gödel and Turing

Well, who is Gdel supposed to be? Come on IEEE, who is supposed to get this encoding stuff right if you guys can't! ;-)

18 August 2007

Java Unicode Constants

Java has a strong Unicode support since always. That's nice and is supposed to save us some headache with encodings and code pages as well as allowing us to write real i18n applications (and using fancy symbols). So let's imagine you are working on your revolutionary new application which does some symbolic computations and you need to display an arrow. Maybe you know that it's just '\u2192' or you found it in the tables of the Unicode Database.

However, by putting it into your code you will introduce a 'magic' character code. Magic numbers are a coding flaw and should not occur in your code. They need to be defined in some place with some reasonable name. So you end up defining all kind of Unicode letters and symbols you need.

Instead you might want to use these Java UniCode Constants (UCC). Using a small Ruby script these constants were derived directly from the Unicode Database textual representation. For every character there is a constant with its official name and corresponding char or int value. All characters of the Unicode version 4.2.0 up to \u1FFFF are covered except CJK Ideographs. For each Unicode block, e.g. Basic Latin (\u0000..\u007F) or Aegean Numbers (\u10100..\u1013F), there is a separate interface with the block's name defining all code-points defined in this block. First you need to import the blocks, e.g. import unicode.AegeanNumbers. Then you can use the constants in your code like here:

Character.charCount(BasicLatin.DIGIT_NINE)) // 1
Character.getNumericValue(BasicLatin.DIGIT_NINE)) // 9
Character.charCount(NumberForms.ROMAN_NUMERAL_FIVE_HUNDRED)) // 1
Character.getNumericValue(NumberForms.ROMAN_NUMERAL_FIVE_HUNDRED)) // 500
Character.charCount(AegeanNumbers.NUMBER_EIGHT)) // 2
Character.getNumericValue(AegeanNumbers.NUMBER_EIGHT)) // 8

(And yes, I know, interfaces are a poor place for constants. They should only be used to model a behaviour of a class. See the AvoidConstantsInterface rule. But I was young and needed the money... ;-)

Download and Installation
Download UCC 1.00 (330 KB), together with source. Extract the ucc-*.zip and add ucc.jar to your classpath. UCC is JDK 1.1 compliant and does not depend on any other libraries. To use characters beyond \u10000, called code-points, you need Java 1.5 or newer. UCC is Open Source under the GPL license.

Code Cop

6 May 2010

Umlaut Fail

18 August 2007

Java Unicode Constants

About Me

In Public

More Creations

Content Series

Community

Labels

Archive