Tuesday, January 31, 2012

Kakasi-java: born again

The Japanese language has several symbols, including kanji and hiragana/katakana. In software, we sometimes need to switch a text from one system to the other, and it is difficult.

Kakasi and MeCab are Open Source libraries dedicated to the problem of converting kanji to hiragana or katakana. For instance they can transform "国際財務報告基準" to "こくさいざいむほうこくきじゅん" or even to "kokusaizaimuhoukokukijun". In clear, it transforms logograms (symbols with multiple possible readings) to syllables.
That is very tricky, because for instance "経緯" can be transformed to "keii", but also to "ikisatsu" depending on the context or speaker. Kakasi sometimes gets it wrong, but usually it is not that bad. MeCab is actually better at that.

Yesterday I decided to add a "furigana" feature to my Android flashcards app. Furigana helps people read difficult kanjis, they are used a lot in mass media: books, newspapers, signs, advertisements.
Kakasi and MeCab are both conversion tools, but their internal algorithms are very different, leading to different speed/quality/simplicity characteristics. Before running to MeCab, I decided to also give Kakasi a try.

Unfortunately, Kakasi is written in C, and thus not easy to run on Android. Porting from C to Java would be possible, but before doing it I had to make sure nobody had ported it already. After multiple searches, I finally found a tar file of the blog of Kenichi Maehashi, saying "現在どこからも入手できないようです". In clear: Kakasi-java can not be found anymore on the Internet, so he uploaded the 0.4 version he miraculously found in his backups.

To make improvements and fixes possible, I took the source, compiled, tested it, wrote a little README file and created a project for it on GitHub. Code contributions are welcome :-)

The best would be a Java port of MeCab, but that does not seem to exist. MeCab has a Java binding, but it is not 100% Java, requiring JNI calls, which is not a great idea for Android.
Nicolas Raoul

No comments:

Post a Comment