Wednesday, September 10, 2014

Eine Kleine Porter Stemmer

One of the things I have been looking at recently is automatic text summarization and extraction. This set me off down the massive rabbit hole that is natural language processing or NLP for short. NLP is a vast subject dealing with taking language the way we humans actually speak it and trying to work with it in a way that computers can understand. One of the first questions you can ask is whether two sentences are talking about the same topic. Sometimes, this is very easy:

  1. Bill went fishing Yesterday.
  2. Bill is going fishing Today.

Obviously, those two sentences are about Bill fishing. But if we take the next two sentences, we run into the problem that is the topic of this post.

  1. I feel happy.
  2. I don't know what happiness is.

A human would definitely say that these two sentences are both about happiness, but happy and happiness are not the same word so the computer would be confused. In any natural language, a word can have multiple forms, e.g. plurals of nouns, tenses of verbs, etc. English isn't the most regular of languages, but must of these words come from appending suffixes to the base word following some general rules. The process of "pruning" off these suffixes to get at the base is known as stemming. One of the original, yet still popular, stemming algorithms as developed by Martin Porter, so is known as the Porter stemmer.

Porter published the details of his algorithm back in 1980: Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137 and maintains an excellent reference website covering the original stemming algorithm and enhancements, as well as extensions of the algorithm to many other languages (e.g. French, Spanish, etc.). The original algorithm is fairly simple, consisting of just five steps of stripping/replacement rules. I won't go over all the steps here, but just to give you a taste for what the rules are here is an example of the first suffix reduction step dealing with plurals:

Step 1a
SSES    ->      SS     |     caresses    ->      caress
IES     ->      I      |     ponies      ->      poni
                       |     ties        ->      ti
SS      ->      SS     |     caress      ->      caress
S       ->             |     cats        ->      cat

In general, the algorithm deals with a replacement pattern on the left that may just be a literal (e.g. SSES) or may contain requirements of the preceding characters as in Step 1c:

Step 1c
(*v*) Y     ->      I     |    happy     ->     happi
                          |    sky       ->     sky

The (*v*) requirement means that preceding the Y there must be at least one vowel. This is denoted by v for a single vowel and the * corresponds to a wildcard (i.e. any or no other letters can surround the vowel, including more vowels). That is why happy is mapped to happi, but sky is not affected by this rule because no vowel precedes the y

On the official Porter stemmer website, he also provides implementation code in several different programming languages, including JavaScript with free license for use. So rather than re-invent the wheel, I grabbed the code to use in my summarization project. Below is a form that will take a word you input and produce the stem via the Porter algorithm. Start typing and give it a go!

Input word:
Output stem:

One thing to note is that the stem itself is not necessarily a word. In the happy/happiness example, both stem to happi. This isn't a problem because the goal of the algorithm wasn't to produce the "base" word, it was to produce a common stem that to computer could recognize between the two forms. In a future post, we'll look at how using the stemmed words can allow us to compare how similar sentences are and then rank them in importance for summarizing a text.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.