Sunday, August 3, 2014

Norvig-Style Spelling Correction in Postgres. It's reasonably fast.

I wanted to do some data matching but the input data had errors.

I decided to try Peter Norvig's spelling corrector directly in the database.

I got pretty good results. The same accuracy as Norvig, and only marginally slower (13.5 Hz vs 17 Hz). It started out about 10x slower, but I was able to speed it up after a few hours of tweaking.

Response time for a single correctly spelled word is < 10ms. For a word with edit distance 1, the response time is also < 10ms. For edit distance 2, the response time is around 100ms.

I used PL/V8 for most of it, as it seems to be the fastest of the PL languages, for calculations at least.

Here's the code:


This function finds all words with an edit distance of 1 of the target word. There will be 54*length+25 words. For example, if the target word is "united" (length 6), then there will be 349 words returned. Some will be nonsense words, so we match it against our corpus later.

/* inspired by http://blog.astithas.com/2009/08/spell-checking-in-javascript.html */

CREATE OR REPLACE FUNCTION edits1(text) RETURNS text[] LANGUAGE plv8 IMMUTABLE STRICT AS '
 var i, results = [], letters = "abcdefghijklmnopqrstuvwxyz".split("");

 // deletion
 for (i=0; i < $1.length; i++)
     results.push($1.slice(0, i) + $1.slice(i+1));

 // transposition
 for (i=0; i < $1.length-1; i++)
     results.push($1.slice(0, i) + $1.slice(i+1, i+2) + $1.slice(i, i+1) + $1.slice(i+2));
  
 // alteration
 for (i=0; i < $1.length; i++)
     letters.forEach(function (l) {
         results.push($1.slice(0, i) + l + $1.slice(i+1));
 });

 // insertion
 for (i=0; i <= $1.length; i++)
     letters.forEach(function (l) {
         results.push($1.slice(0, i) + l + $1.slice(i));
 });

 return results;
';


This one finds all the words with an edit distance of 2. We find the 2-distance words of all the 1-distance words, so there are about 100,000 unique words returned. Most are nonsense, so we join them against the corpus later.

CREATE OR REPLACE FUNCTION edits2(text[]) RETURNS text[] LANGUAGE plv8 IMMUTABLE STRICT AS '
 
 var edits1 = plv8.find_function("edits1");

 var words = {}; //using object keys avoids duplicates, dropping edits2 from ~200k to 100k

 for(var i = 0; i < $1.length; i++) {

  var e2 = edits1($1[i]); 

  for(var j = 0; j < e2.length; j++) {
   words[e2[j]] = null;
  }
 }

 return Object.keys(words);

';


And this one does the actual correcting. It has to be volatile due to the temp table creation. Create the temp table outside the function and you can mark it as stable, which will let Postgres cache results in the same transaction.

"big" is a table with unique terms and their counts, based on the large corpus here http://norvig.com/big.txt (6M chars, ~30,000 unique words).

CREATE OR REPLACE FUNCTION correct(text) RETURNS text LANGUAGE plpgsql VOLATILE STRICT AS $$
declare
 tmp text;
 e1 text[];
begin

  --if the word is in the corpus, return it:
  if exists( select 1 from big where word = $1 ) then 
    return $1; 
  end if;

  --get all the 1-distance words:
  e1 = edits1($1);
 
  select 
    word 
  from big 
  where 
    word = any ( select unnest( e1 ) ) 
  order by 
    nentry desc 
  limit 1 
  into tmp;
 
  --if there are 1-distance words that match, return the most common one:
  if found then 
    return tmp; 
  end if;

  --using a session temp table is much faster than comparing to an array
  create temp table if not exists edits2temp ( word text );
  truncate table edits2temp;

  --get all the 2-distance edits (~100k of them) and put in temp table:
  insert into edits2temp select unnest( edits2(e1) );

  select
    big.word
  from 
    big
    inner join edits2temp edits on big.word = edits.word
  order by big.nentry desc
  limit 1
  into tmp;

  --if there are 2-distance words that match, return the most common one:
  if found then 
    return tmp; 
  end if;

  truncate table edits2temp; 

  --nothing found, return the original word:
 return $1;

end $$;
  

1 comment:

aliyaa said...

Thank you for every other magnificent article. Where else may just anyone get that kind of information in such a perfect manner of writing? I have a presentation next week, and I'm at the search for such information. The sentence corrector generator is online site.