Saturday, January 18, 2014

Using the Ispell dictionary in PostgreSQL Text Search

You can use Ispell-compatible dictionaries in Postgres Text Search to get the infinitive form of words. For example, it will convert "tallest" to "tall". It doesn't do spell-checking.

This example uses the Canadian English dictionary, but you can try it with others as well. These steps are for Windows.

Outline

  1. Combine en_CA.dic with en_CA.dic_delta, sort them, and save them as en_ca.dict with UTF-8 encoding to the right folder
  2. Save en_CA.aff as en_ca.affix to the right folder
  3. Create the Ispell dictionary in Postgres

Steps

  1. Open http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_CA.dic
  2. Select all of the text, copy it, and paste it to Text Mechanic: http://textmechanic.com/Sort-Text-Lines.html. Add a line break at the end
  3. Open http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_CA.dic_delta
  4. Select all of the text, copy it, and paste it below the previously pasted text in Text Mechanic.
  5. Scroll to the top, and get rid of the first line. It should be a 5-digit number
  6. Click the Alphabetical button, and wait for the text to sort
  7. Select all of the text and copy it to the clipboard
  8. Open Windows Notepad as an administrator
  9. Paste the text from Step 7 into Notepad
  10. Save the file as en_ca.dict (with UTF-8 encoding) to your Postgres text search folder. Mine is C:\Program Files\PostgreSQL\9.3\share\tsearch_data .
  11. Open http://src.chromium.org/svn/trunk/deps/third_party/hunspell_dictionaries/en_CA.aff, and copy it to Notepad. Save the file as en_ca.affix to your Postgres text search folder.
In PgAdmin, run the following SQL:
create text search dictionary ispell_en_ca (
  template = ispell,
  dictfile = en_ca,    --uses en_ca.dict
  afffile = en_ca,     --uses en_ca.affix
  stopwords = english  --uses english.stop
);

--make sure it works:
select * from ts_lexize('ispell_en_ca', 'colours');

/*
result:
ts_lexize
text[]
{coloured,colour}
*/

You will need to create a new text search configuration to use the dictionary.

No comments: