Hangul spellcheck for Vim
Power Overwhelming 2024-11-01
There’s got to be a better way to do this. Someone please enlighten me.
Modern Korean is written in 한글 (Hangul), which uses a syllabic alphabet. It includes spaces between words, unlike Chinese or Japanese, which means that it’s possible to have meaningful spellchecking.
So of course one day I decided I wanted to configure Vim to support spellchecking Hangul. Unfortunately, there’s no file ko.utf-8.spl
at ftp.vim.org, and in a cursory search I couldn’t find an.
On the other hand, the hunspell
tool does have a Korean dictionary, and there’s a PKGFILE provided for ARCH, so by running pikaur -S hunspell-ko
I was able to obtain the files
/usr/share/hunspell/ko_KR.aff
-
/usr/share/hunspell/ko_KR.dic
.
In theory, if you then run the Vim command
:mkspell /tmp/ko /usr/share/hunspell/ko_KR
then Vim would create the file /tmp/ko.utf-8.spl
, which you can then place into ~/.vim/spell
and get spellchecking.
Unfortunately, theory is not the same as practice, by which I mean that the process threw a bunch of warnings and the resulting file it totally didn’t work — every word was being marked as misspelled.
So I spent a bit of time trying to debug it, and being like, “how come these two words that look exactly the same are showing up as different?”
It turns out there’s actually (at least) two different ways to encode of Hangul into Unicode, namely NFD (decomposed) and NFC (composed). The input program I’m using produces NFC glyphs, where each Hangul syllable block is a single code point; but the spellcheck file /usr/share/hunspell/ko_KR.dic
instead has entries in NFD, where each atom within the syllable block is a character.
Actually, NFD makes sense for a spellchecker, because you’d want something like 한 (Romanized han
) and 항 (Romanized hang
) to have an edit distance of 1/3 rather than 1. Then, in order to deal with NFC inputs, the /usr/share/hunspell/ko_KR.aff
provides many ICONV
and OCONV
directives that tell the spellchecker how to convert from NFC input into NFD and then back. So hunspell
works well.
The problem is that Vim’s :mkspell
command apparently doesn’t actually support ICONV
and OCONV
. In order to force it to work, I ended up just writing a Python script that stripped all the unsupported commands from ko_KR.aff
, and converted ko_KR.dic
into NFC format.
import unicodedataUNSUPPORTED_WORDS = ( "LANG", "WORDCHARS", "ICONV", "OCONV", "AF", "MAXCPDSUGS", "MAXNGRAMSUGS", "MAXDIFF", "COMPOUNDMORESUFFIXES",)# Make the aff file but take out things unsupported by Vimwith open("/usr/share/hunspell/ko_KR.aff", "r", encoding="utf-8") as infile, open( "ko_KR.aff", "w", encoding="utf-8") as outfile: for line in infile: if not any(line.startswith(word) for word in UNSUPPORTED_WORDS): print(line.strip(), file=outfile)# Make the dic file but re-normalize it to NFCwith open("/usr/share/hunspell/ko_KR.dic", "r", encoding="utf-8") as infile: content = infile.read()content = unicodedata.normalize("NFC", content)with open("ko_KR.dic", "w", encoding="utf-8") as outfile: print(content, file=outfile)
After storing these mutilated files into ~/dotfiles/vim/spell/korean-setup/ko_KR
and running :mkspell /tmp/ko ~/dotfiles/vim/spell/korean-setup/ko_KR
, the outputted /tmp/ko.utf-8.spl
can now check for spelling errors (at least if the words are in NFC format).
A summary of this process is posted on my dotfiles GitHub. To actually use this, just download the ko.utf-8.spl
file directly, no need to re-follow the steps.
The issue is that because the spellcheck dictionary is using NFC now, while it can highlight the red words, the “suggestions” provided aren’t particularly good. If you look at the suggestions for the spellcheck for the typo’d word 항글, the top 10 are:
"한글""고글" "궁글""담글""답글""댓글""덧글""동글""둥글""빙글"
The problem is that because NFD encodes the characters by block, any change to the entire first syllable cause an equally bad edit distance. So while I’ve managed to get quick highlighting of mistakes, the autocorrection of those mistakes isn’t really there.
I feel like this whole process was a convoluted hack. Is there a better way to do this?