Hangul spellcheck for Vim

Power Overwhelming 2024-11-01

There’s got to be a better way to do this. Someone please enlighten me.

Modern Korean is written in 한글 (Hangul), which uses a syllabic alphabet. It includes spaces between words, unlike Chinese or Japanese, which means that it’s possible to have meaningful spellchecking.

So of course one day I decided I wanted to configure Vim to support spellchecking Hangul. Unfortunately, there’s no file ko.utf-8.spl at ftp.vim.org, and in a cursory search I couldn’t find an.

On the other hand, the hunspell tool does have a Korean dictionary, and there’s a PKGFILE provided for ARCH, so by running pikaur -S hunspell-ko I was able to obtain the files

/usr/share/hunspell/ko_KR.aff
/usr/share/hunspell/ko_KR.dic.

In theory, if you then run the Vim command

:mkspell /tmp/ko /usr/share/hunspell/ko_KR

then Vim would create the file /tmp/ko.utf-8.spl, which you can then place into ~/.vim/spell and get spellchecking.

Unfortunately, theory is not the same as practice, by which I mean that the process threw a bunch of warnings and the resulting file it totally didn’t work — every word was being marked as misspelled.

So I spent a bit of time trying to debug it, and being like, “how come these two words that look exactly the same are showing up as different?”

It turns out there’s actually (at least) two different ways to encode of Hangul into Unicode, namely NFD (decomposed) and NFC (composed). The input program I’m using produces NFC glyphs, where each Hangul syllable block is a single code point; but the spellcheck file /usr/share/hunspell/ko_KR.dic instead has entries in NFD, where each atom within the syllable block is a character.

That diff is spooky.

Actually, NFD makes sense for a spellchecker, because you’d want something like 한 (Romanized han) and 항 (Romanized hang) to have an edit distance of 1/3 rather than 1. Then, in order to deal with NFC inputs, the /usr/share/hunspell/ko_KR.aff provides many ICONV and OCONV directives that tell the spellchecker how to convert from NFC input into NFD and then back. So hunspell works well.

The problem is that Vim’s :mkspell command apparently doesn’t actually support ICONV and OCONV. In order to force it to work, I ended up just writing a Python script that stripped all the unsupported commands from ko_KR.aff, and converted ko_KR.dic into NFC format.

import unicodedataUNSUPPORTED_WORDS = (    "LANG",    "WORDCHARS",    "ICONV",    "OCONV",    "AF",    "MAXCPDSUGS",    "MAXNGRAMSUGS",    "MAXDIFF",    "COMPOUNDMORESUFFIXES",)# Make the aff file but take out things unsupported by Vimwith open("/usr/share/hunspell/ko_KR.aff", "r", encoding="utf-8") as infile, open(    "ko_KR.aff", "w", encoding="utf-8") as outfile:    for line in infile:        if not any(line.startswith(word) for word in UNSUPPORTED_WORDS):            print(line.strip(), file=outfile)# Make the dic file but re-normalize it to NFCwith open("/usr/share/hunspell/ko_KR.dic", "r", encoding="utf-8") as infile:    content = infile.read()content = unicodedata.normalize("NFC", content)with open("ko_KR.dic", "w", encoding="utf-8") as outfile:    print(content, file=outfile)

After storing these mutilated files into ~/dotfiles/vim/spell/korean-setup/ko_KR and running :mkspell /tmp/ko ~/dotfiles/vim/spell/korean-setup/ko_KR, the outputted /tmp/ko.utf-8.spl can now check for spelling errors (at least if the words are in NFC format).

It’s working. Sorta.

A summary of this process is posted on my dotfiles GitHub. To actually use this, just download the ko.utf-8.spl file directly, no need to re-follow the steps.

The issue is that because the spellcheck dictionary is using NFC now, while it can highlight the red words, the “suggestions” provided aren’t particularly good. If you look at the suggestions for the spellcheck for the typo’d word 항글, the top 10 are:

"한글""고글" "궁글""담글""답글""댓글""덧글""동글""둥글""빙글"

The problem is that because NFD encodes the characters by block, any change to the entire first syllable cause an equally bad edit distance. So while I’ve managed to get quick highlighting of mistakes, the autocorrection of those mistakes isn’t really there.

I feel like this whole process was a convoluted hack. Is there a better way to do this?