I decided to try Peter Norvig's spelling corrector directly in the database.
I got pretty good results. The same accuracy as Norvig, and only marginally slower (13.5 Hz vs 17 Hz). It started out about 10x slower, but I was able to speed it up after a few hours of tweaking.
Response time for a single correctly spelled word is < 10ms. For a word with edit distance 1, the response time is also < 10ms. For edit distance 2, the response time is around 100ms.
I used PL/V8 for most of it, as it seems to be the fastest of the PL languages, for calculations at least.
Here's the code:
This function finds all words with an edit distance of 1 of the target word. There will be 54*length+25 words. For example, if the target word is "united" (length 6), then there will be 349 words returned. Some will be nonsense words, so we match it against our corpus later.
/* inspired by http://blog.astithas.com/2009/08/spell-checking-in-javascript.html */ CREATE OR REPLACE FUNCTION edits1(text) RETURNS text[] LANGUAGE plv8 IMMUTABLE STRICT AS ' var i, results = [], letters = "abcdefghijklmnopqrstuvwxyz".split(""); // deletion for (i=0; i < $1.length; i++) results.push($1.slice(0, i) + $1.slice(i+1)); // transposition for (i=0; i < $1.length-1; i++) results.push($1.slice(0, i) + $1.slice(i+1, i+2) + $1.slice(i, i+1) + $1.slice(i+2)); // alteration for (i=0; i < $1.length; i++) letters.forEach(function (l) { results.push($1.slice(0, i) + l + $1.slice(i+1)); }); // insertion for (i=0; i <= $1.length; i++) letters.forEach(function (l) { results.push($1.slice(0, i) + l + $1.slice(i)); }); return results; ';
This one finds all the words with an edit distance of 2. We find the 2-distance words of all the 1-distance words, so there are about 100,000 unique words returned. Most are nonsense, so we join them against the corpus later.
CREATE OR REPLACE FUNCTION edits2(text[]) RETURNS text[] LANGUAGE plv8 IMMUTABLE STRICT AS ' var edits1 = plv8.find_function("edits1"); var words = {}; //using object keys avoids duplicates, dropping edits2 from ~200k to 100k for(var i = 0; i < $1.length; i++) { var e2 = edits1($1[i]); for(var j = 0; j < e2.length; j++) { words[e2[j]] = null; } } return Object.keys(words); ';
And this one does the actual correcting. It has to be volatile due to the temp table creation. Create the temp table outside the function and you can mark it as stable, which will let Postgres cache results in the same transaction.
"big" is a table with unique terms and their counts, based on the large corpus here http://norvig.com/big.txt (6M chars, ~30,000 unique words).
CREATE OR REPLACE FUNCTION correct(text) RETURNS text LANGUAGE plpgsql VOLATILE STRICT AS $$ declare tmp text; e1 text[]; begin --if the word is in the corpus, return it: if exists( select 1 from big where word = $1 ) then return $1; end if; --get all the 1-distance words: e1 = edits1($1); select word from big where word = any ( select unnest( e1 ) ) order by nentry desc limit 1 into tmp; --if there are 1-distance words that match, return the most common one: if found then return tmp; end if; --using a session temp table is much faster than comparing to an array create temp table if not exists edits2temp ( word text ); truncate table edits2temp; --get all the 2-distance edits (~100k of them) and put in temp table: insert into edits2temp select unnest( edits2(e1) ); select big.word from big inner join edits2temp edits on big.word = edits.word order by big.nentry desc limit 1 into tmp; --if there are 2-distance words that match, return the most common one: if found then return tmp; end if; truncate table edits2temp; --nothing found, return the original word: return $1; end $$;
1 comment:
Thank you for every other magnificent article. Where else may just anyone get that kind of information in such a perfect manner of writing? I have a presentation next week, and I'm at the search for such information. The sentence corrector generator is online site.
Post a Comment