
A few days ago Ian Barber wrote an article about the automated spelling correction.
Today I had the time to read it. Good quality and great presentation as always. However, the first thing I noticed is that the solution presented by Ian was not making use of the Soundex algorithm at all, which seemed slightly strange to me according to my experience. So, I have quickly refined that solution using the standard soundex() PHP function.
What is Soundex?
Although I linked the Soundex page on Wikipedia above, I would like to quote for you the summary explanation of what Soundex does.
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation (called soundex key) so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter.
Let’s use it
NOTE: To understand the following code, don’t forget to read Ian’s article first.
The new train() function
If you compare this function to Ian’s, you will notice that my dictionary is now indexed by soundex keys. For each key, we have a list of words and their frequency in the dictionary.
The new correct() function
What are the differences between this and the original function from Ian? Let’s see.
The first one is related, again, to the use of Soundex. The function takes into account only the words with the same soundex key as the input word in order to create a first set of candidates at the correction. This way, we can neglect all those words that might have a relevant Levenshtein distance but are very likely to be a wrong correction anyway, because they have a different soundex key.
Then, once we have our set of potential candidates, we look for words with a relevant Levenshtein distance and weigh their frequency with their distance value. The words we haven’t found a relevant Levenshtein distance for are removed from the set.
Finally, this set of candidates is reversely ordered by the weighed frequency and the first one is chosen as best correction.
If you now run the very same test Ian has written in his article, you will get an accuracy of 83%, noticeably better than the 71% achieved by the Norvig’s solution presented by Ian.
Further improvements
Both Ian and myself have relied on a limited dictionary and have not performed any kind of text preprocessing (such as stemming, etc.). With a larger dictionary and all the necessary preprocessing, I am confident to say you can expect this solution to perform even better.



Inspiring:
http://ao2.it/en/blog/2009/11/06/ironing-delicious
Depending on the frequency of application of this solution, the usage of Memcaching here and there could speed up stuff.
For example, saving and memcaching results so that next time the computation is O(1).
Even though, I didn’t do a Space analysis to see if saving the results is actually feasible.
Well, performance was actually out of scope in this article. But yes, it could be worth it using caching strategies, especially if your name is Google.
However, the computational complexity of this solution is not a real problem. The space complexity (for storing the dictionary) is.
Any algo of soundex for the italian language ?
@Luca: it’s possible to adapt the Soundex algorithm changing the weights for the consonants. I have no reference at hand right now. I’ll get back to you.
Incidentally, performance is about 500 times improved with this algorithm over the pure levenshtein version.
I fail to see how this works. Soundex assumes you spell a word as it sounds. While this would improve words where someone mixes up a vowel or swaps a q for a k, etc.,
… it would totally destroy matching for a letter transposition, and that’s one of the most common type of mistakes.
bkaer vs. baker, etc.
I think something that has the same soundex and a low edit distance is a better match, but I do not think you can toss candidates based on it.
Oops. My example was poor: try caress v. acress
We are talking about an algorithm for automatic spelling correction. Acress IS NOT the mispelling of caress. It is a typo, if you meant to write caress. In fact, if you type acress, this will very likely be intended as the mispelling of actress.
That’s a pretty big assumption right there. In general, your approach will break on many letter transpositons, though — and all words where the first letter changes.
Acress is a letter transposition. Swap a for c. It *could* be actress, but it’s just as likely it’s caress. In fact, you’d probably have to use word bigrams to solve that one.
No big assumption. Letter transposition is not spelling mistake. Different problem, different algorithm. As a matter of fact, even Google suggests «actress» as correction for acress. Quite obviously, I’d say.
So, like I said, you could tackle the transposition problem with an additional algorithm, however this leads you to another problem: which of two algorithms should we apply first? How do I know it was a typo and not a genuine mispelling? You might find some tradeoff, but there isn’t going to be any exact approach.
Generally, no one cares about transposition.
I won’t argue with you, but you’re just plain wrong IMO. I’m sure anyone who wants to write a credible spell checker would have to consider both transposition, other typos, as well as “sounds-like” errors. And, worse, you can’t consider each one in a vacuum. That’s silly. See the “Editex” algorithm.
They’re all spelling mistakes. Semantics aside, nobody cares how it happened. There are numerous gray areas esp. when you consider doubled letters, etc. Edit distance and typos are the same thing to me.
I don’t think we are understanding each other. I might be wrong saying that no one cares about transpositions (I don’t, personally and that always panned out). But the point stands: transposition *is not* a spelling mistake (I am not talking from an end-user point of view, here). And it requires to be specifically considered. Which is what Editex does.
All I’m saying is, if you have an eCommerce site:
User types “cameras,” “camras,” “camrea,” “kamera,” etc. He doesn’t care if he typod or spelt it wrong. He wants cameras.
I think other web-site proprietors should take this site as an model, very clean and excellent user genial style and design, let alone the content. You are an expert in this topic!
I’ve just found this, and love the idea, I have just written a system(entering testing tomorrow) that stores client details, I think it would be a great idea when updating the client name to store a soundex of the first and last names.
When entering a new name I can use these to match against the possible name matches as efficiently as possible, and I can combine this with the transpose comments, and match the soundex profile for different first characters(most likely to cause problems).
should be
if (array_key_exists($word, $dic[soundex($word)])) {
return $word;
}
Then Jaimie you just need to deal with transpositions also, and not just mispelling. Two different problems, two different techniques to tackle them and you implement them both if you want the behaviour you are describing. That’s it. Too much fuss over a simple matter.
At most, the retailer could supply you with a brand new vacuum, most
almost certainly the identical brand and model as the a person you’re preparing to dispose of. You are able to spend some time with the trainers on the standard basis for modifications in skilled coaching of incoming product sales.
Beach photography is not only for the kids but also
an excellent setting for wedding photography.
Prefolds are soft cotton diapers that should be folded in thirds to fit inside
the cover.