Spelling correction with Soundex

A few days ago Ian Barber wrote an article about the automated spelling correction.

Today I had the time to read it. Good quality and great presentation as always. However, the first thing I noticed is that the solution presented by Ian was not making use of the Soundex algorithm at all, which seemed slightly strange to me according to my experience. So, I have quickly refined that solution using the standard soundex() PHP function.

What is Soundex?

Although I linked the Soundex page on Wikipedia above, I would like to quote for you the summary explanation of what Soundex does.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation (called soundex key) so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter.

Let’s use it

NOTE: To understand the following code, don’t forget to read Ian’s article first.

The new train() function

If you compare this function to Ian’s, you will notice that my dictionary is now indexed by soundex keys. For each key, we have a list of words and their frequency in the dictionary.

The new correct() function

What are the differences between this and the original function from Ian? Let’s see.
The first one is related, again, to the use of Soundex. The function takes into account only the words with the same soundex key as the input word in order to create a first set of candidates at the correction. This way, we can neglect all those words that might have a relevant Levenshtein distance but are very likely to be a wrong correction anyway, because they have a different soundex key.
Then, once we have our set of potential candidates, we look for words with a relevant Levenshtein distance and weigh their frequency with their distance value. The words we haven’t found a relevant Levenshtein distance for are removed from the set.
Finally, this set of candidates is reversely ordered by the weighed frequency and the first one is chosen as best correction.

If you now run the very same test Ian has written in his article, you will get an accuracy of 83%, noticeably better than the 71% achieved by the Norvig’s solution presented by Ian.

Further improvements

Both Ian and myself have relied on a limited dictionary and have not performed any kind of text preprocessing (such as stemming, etc.). With a larger dictionary and all the necessary preprocessing, I am confident to say you can expect this solution to perform even better.

Tags: ,

22 Responses to “Spelling correction with Soundex”

  1. ao2
    6 November 2009 at 11:46 pm #

    Inspiring:
    http://ao2.it/en/blog/2009/11/06/ironing-delicious

  2. Detro
    7 November 2009 at 1:35 pm #

    Depending on the frequency of application of this solution, the usage of Memcaching here and there could speed up stuff.
    For example, saving and memcaching results so that next time the computation is O(1).
    Even though, I didn’t do a Space analysis to see if saving the results is actually feasible.

  3. Vincenzo
    7 November 2009 at 7:45 pm #

    Well, performance was actually out of scope in this article. But yes, it could be worth it using caching strategies, especially if your name is Google.

    However, the computational complexity of this solution is not a real problem. The space complexity (for storing the dictionary) is.

  4. Luca →
    11 November 2009 at 3:56 pm #

    Any algo of soundex for the italian language ?

  5. Vincenzo
    12 November 2009 at 10:36 am #

    @Luca: it’s possible to adapt the Soundex algorithm changing the weights for the consonants. I have no reference at hand right now. I’ll get back to you.

  6. Rob Young
    2 December 2009 at 4:18 pm #

    Incidentally, performance is about 500 times improved with this algorithm over the pure levenshtein version.

  7. Jaimie Sirovich
    9 April 2011 at 7:41 am #

    I fail to see how this works. Soundex assumes you spell a word as it sounds. While this would improve words where someone mixes up a vowel or swaps a q for a k, etc.,

    … it would totally destroy matching for a letter transposition, and that’s one of the most common type of mistakes.

    bkaer vs. baker, etc.

    I think something that has the same soundex and a low edit distance is a better match, but I do not think you can toss candidates based on it.

  8. Jaimie Sirovich
    9 April 2011 at 7:45 am #

    Oops. My example was poor: try caress v. acress

  9. Vincenzo Russo
    16 April 2011 at 10:35 am #

    We are talking about an algorithm for automatic spelling correction. Acress IS NOT the mispelling of caress. It is a typo, if you meant to write caress. In fact, if you type acress, this will very likely be intended as the mispelling of actress.

  10. Jaimie Sirovich
    16 April 2011 at 5:46 pm #

    That’s a pretty big assumption right there. In general, your approach will break on many letter transpositons, though — and all words where the first letter changes.

    Acress is a letter transposition. Swap a for c. It *could* be actress, but it’s just as likely it’s caress. In fact, you’d probably have to use word bigrams to solve that one.

  11. Vincenzo
    10 May 2011 at 7:31 am #

    No big assumption. Letter transposition is not spelling mistake. Different problem, different algorithm. As a matter of fact, even Google suggests «actress» as correction for acress. Quite obviously, I’d say.

    So, like I said, you could tackle the transposition problem with an additional algorithm, however this leads you to another problem: which of two algorithms should we apply first? How do I know it was a typo and not a genuine mispelling? You might find some tradeoff, but there isn’t going to be any exact approach.

    Generally, no one cares about transposition.

  12. Jaimie Sirovich
    17 June 2011 at 3:09 am #

    I won’t argue with you, but you’re just plain wrong IMO. I’m sure anyone who wants to write a credible spell checker would have to consider both transposition, other typos, as well as “sounds-like” errors. And, worse, you can’t consider each one in a vacuum. That’s silly. See the “Editex” algorithm.

    They’re all spelling mistakes. Semantics aside, nobody cares how it happened. There are numerous gray areas esp. when you consider doubled letters, etc. Edit distance and typos are the same thing to me.

  13. Vincenzo Russo
    17 June 2011 at 6:45 am #

    I don’t think we are understanding each other. I might be wrong saying that no one cares about transpositions (I don’t, personally and that always panned out). But the point stands: transposition *is not* a spelling mistake (I am not talking from an end-user point of view, here). And it requires to be specifically considered. Which is what Editex does.

  14. Jaimie Sirovich
    17 June 2011 at 5:13 pm #

    All I’m saying is, if you have an eCommerce site:

    User types “cameras,” “camras,” “camrea,” “kamera,” etc. He doesn’t care if he typod or spelt it wrong. He wants cameras.

  15. antique furniture restoration
    19 June 2011 at 10:03 am #

    I think other web-site proprietors should take this site as an model, very clean and excellent user genial style and design, let alone the content. You are an expert in this topic!

  16. scragar →
    3 August 2011 at 7:55 pm #

    I’ve just found this, and love the idea, I have just written a system(entering testing tomorrow) that stores client details, I think it would be a great idea when updating the client name to store a soundex of the first and last names.
    When entering a new name I can use these to match against the possible name matches as efficiently as possible, and I can combine this with the transpose comments, and match the soundex profile for different first characters(most likely to cause problems).

  17. anonymous →
    8 August 2011 at 10:28 pm #

    should be

    if (array_key_exists($word, $dic[soundex($word)])) {
    return $word;
    }

  18. Vincenzo
    3 March 2012 at 3:17 pm #

    Then Jaimie you just need to deal with transpositions also, and not just mispelling. Two different problems, two different techniques to tackle them and you implement them both if you want the behaviour you are describing. That’s it. Too much fuss over a simple matter.

  19. Merri
    21 February 2014 at 8:17 pm #

    I’m not sure exactly why but this web site is loading very slow for me.
    Is anyone else having this problem or is it a problem on
    my end? I’ll check back later on and see if the problem still
    exists.

  20. Best Electric Shaver
    8 April 2014 at 7:03 am #

    Some archaeologists think it was to reduce the budget of
    your remodel. Result – razor burn if you’re not getting underarm hair
    a close shave because your razor isn’t getting close enough
    to the root of the hair can actually penetrate the skin.
    Prevention And TreatmentThe easiest way to do this. With the
    sun recently making an unexpected but very welcome appearance in our
    skies this week, the store will be holding another barber
    event June 18-19. Keep in mind that you will have the right idea.

    my web blog Best Electric Shaver

Trackbacks/Pingbacks

  1. Spelling Correction - PHP/ir - 6 November 2009

    [...] [...]

  2. Spelling correction with Soundex | Artetecha - 9 September 2013

    […] PS: this was originally written on Vincenzo’s personal blog […]