If you have a web site with a search function, you will rapidly realize that most mortals are terrible typists. Many searches contain mispelled words, and users will expect these searches to magically work. This magic is often done using levenshtein distance. In this article, I'll compare two ways of finding the closest matching word in a large dictionary. I'll describe how I use it on rhymebrain.com not for corrections, but to search 2.6 million words for rhymes, for every request, with no caching, on my super-powerful sock-drawer datacenter:
Algorithm #1
The levenshtein function take two words and returns how far apart they are. It's an O(N*M) algorithm, where N is the length of one word, and M is the length of the other. If you want to know how it works, go to this wikipedia page.
But comparing two words at a time isn't useful. Usually you want to find the closest matching words in a whole dictionary, possibly with many thousands of words. Here's a quick python program to do that, using the straightforward, but slow way. It uses the file /usr/share/dict/words. The first argument is the misspelled word, and the second argument is the maximum distance. It will print out all the words with that distance, as well as the time spent actually searching. For example:
smhanov@ubuntu1004:~$ ./method1.py goober 1 ('goober', 0) ('goobers', 1) ('gooier', 1) Search took 4.5575 s
Here's the program:
#!/usr/bin/python #By Steve Hanov, 2011. Released to the public domain import time import sys DICTIONARY = "/usr/share/dict/words"; TARGET = sys.argv[1] MAX_COST = int(sys.argv[2]) # read dictionary file words = open(DICTIONARY, "rt").read().split(); # for brevity, we omit transposing two characters. Only inserts, # removals, and substitutions are considered here. def levenshtein( word1, word2 ): columns = len(word1) + 1 rows = len(word2) + 1 # build first row currentRow = [0] for column in xrange( 1, columns ): currentRow.append( currentRow[column - 1] + 1 ) for row in xrange( 1, rows ): previousRow = currentRow currentRow = [ previousRow[0] + 1 ] for column in xrange( 1, columns ): insertCost = currentRow[column - 1] + 1 deleteCost = previousRow[column] + 1 if word1[column - 1] != word2[row - 1]: replaceCost = previousRow[ column - 1 ] + 1 else: replaceCost = previousRow[ column - 1 ] currentRow.append( min( insertCost, deleteCost, replaceCost ) ) return currentRow[-1] def search( word, maxCost ): results = [] for word in words: cost = levenshtein( TARGET, word ) if cost <= maxCost: results.append( (word, cost) ) return results start = time.time() results = search( TARGET, MAX_COST ) end = time.time() for result in results: print result print "Search took %g s" % (end - start)
Runtime
Improving it
k a t e 0 1 2 3 4 c 1 1 2 3 4 a 2 2 1 2 3 t 3 3 2 1 2
... continue reading