Wednesday, September 10, 2014

Eine Kleine Porter Stemmer

One of the things I have been looking at recently is automatic text summarization and extraction. This set me off down the massive rabbit hole that is natural language processing or NLP for short. NLP is a vast subject dealing with taking language the way we humans actually speak it and trying to work with it in a way that computers can understand. One of the first questions you can ask is whether two sentences are talking about the same topic. Sometimes, this is very easy:

  1. Bill went fishing Yesterday.
  2. Bill is going fishing Today.

Obviously, those two sentences are about Bill fishing. But if we take the next two sentences, we run into the problem that is the topic of this post.

  1. I feel happy.
  2. I don't know what happiness is.

A human would definitely say that these two sentences are both about happiness, but happy and happiness are not the same word so the computer would be confused. In any natural language, a word can have multiple forms, e.g. plurals of nouns, tenses of verbs, etc. English isn't the most regular of languages, but must of these words come from appending suffixes to the base word following some general rules. The process of "pruning" off these suffixes to get at the base is known as stemming. One of the original, yet still popular, stemming algorithms as developed by Martin Porter, so is known as the Porter stemmer.

Porter published the details of his algorithm back in 1980: Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137 and maintains an excellent reference website covering the original stemming algorithm and enhancements, as well as extensions of the algorithm to many other languages (e.g. French, Spanish, etc.). The original algorithm is fairly simple, consisting of just five steps of stripping/replacement rules. I won't go over all the steps here, but just to give you a taste for what the rules are here is an example of the first suffix reduction step dealing with plurals:

Step 1a
SSES    ->      SS     |     caresses    ->      caress
IES     ->      I      |     ponies      ->      poni
                       |     ties        ->      ti
SS      ->      SS     |     caress      ->      caress
S       ->             |     cats        ->      cat

In general, the algorithm deals with a replacement pattern on the left that may just be a literal (e.g. SSES) or may contain requirements of the preceding characters as in Step 1c:

Step 1c
(*v*) Y     ->      I     |    happy     ->     happi
                          |    sky       ->     sky

The (*v*) requirement means that preceding the Y there must be at least one vowel. This is denoted by v for a single vowel and the * corresponds to a wildcard (i.e. any or no other letters can surround the vowel, including more vowels). That is why happy is mapped to happi, but sky is not affected by this rule because no vowel precedes the y

On the official Porter stemmer website, he also provides implementation code in several different programming languages, including JavaScript with free license for use. So rather than re-invent the wheel, I grabbed the code to use in my summarization project. Below is a form that will take a word you input and produce the stem via the Porter algorithm. Start typing and give it a go!

Input word:
Output stem:

One thing to note is that the stem itself is not necessarily a word. In the happy/happiness example, both stem to happi. This isn't a problem because the goal of the algorithm wasn't to produce the "base" word, it was to produce a common stem that to computer could recognize between the two forms. In a future post, we'll look at how using the stemmed words can allow us to compare how similar sentences are and then rank them in importance for summarizing a text.

Saturday, September 6, 2014

Lehmer Codes and the kth Permutation

First a little background on why I was interested in indexing the kth permutation (skip to the 4th paragraph to cut straight to the chase). I've been reading up a bit on information theory recently and that got me into information encoding, which naturally led to cryptography. I started fooling around with one of the simplest type of cipher, the substitution cipher. You've probably seen a substitution cipher before since it is what is used in those "decode a famous quote" puzzles (though I guess the last time I saw one of those was in a newspaper and you don't really see newspapers anymore...) The way a substitution cipher works is you take each symbol in the original message (usually letters, but could be words or something else as your "symbol") and replace it with a new symbol. In principle, the new symbols could be anything, e.g. replace "a" with the Chinese character for dog, "b" with the "poo emoji" etc. In practice, the new symbols don't actually matter because they don't make the new cipher any more of less secure (hopefully more on that later).

Since changing the symbols doesn't really improve anything, you can just use the original set of symbols as the coded symbols where each symbol in the plaintext is mapped to a different symbol from the same set in the ciphertext, e.g. "a" goes to "h", "b" goes to "m", etc. Now you need some way to communicate what particular mapping (the so called "key") you have chosen to the person who needs to decode the ciphertext. A simplified version of a substitution cipher uses a shift of all the letters for the key. A key of 3 would mean shift every letter by 3 down the alphabet wrapping around at the end, e.g. "a" goes to "d", "b" goes to "e", "c" goes to "f" ... "x" goes to "a", "y" goes to "b", "z" goes to "c" This so-called Caesar cipher (because Julius himself supposedly used it) is actually slightly weaker than a general substitution cipher because once we know one of the symbol mappings, say "g" to "j" then we can get all of the other mappings. The nice property that it has is the simple key that gives the shift of the letters. However, the number of possible keys is limited to N, the number of symbols we have in the message, e.g. 26 for purely lowercase letters, which would make our message easily decoded by brute-force attack.

We need to have a key for our general substitution that can allow for any mapping we'd like so we can use codes that don't allow the whole message to be decoded if a single symbol mapping is discovered. The general substitution cipher is a permutation of symbols in the messages and from combinatorics we can count all of the possible permutations. We have N choices for the first symbol, N-1 choices for the second symbol, etc. giving N! possible permutations so we will need this many different keys to identify our encoding. Note 26! ~ 4 x 1026 so this will vastly increase the number of possible keys from 26 for the rotation cipher and makes it more from a naive brute-force key attack (though this cipher is still horrendously weak and hopefully in a later post I'll talk about breaking substitution ciphers with letter frequency attacks). So now we have to figure out how to take one of our 4 x 1026 keys and figure out which symbol mapping to use.

That is where the actual topic of this post comes in, Lehmer codes and the factorial number system. The first problem we'll tackle is how to identify a permutation and then we'll figure out how to convert that to a number. To uniquely identify a permutation we'll need an ordering for our elements. Then we can just write down which element we choose at each point. Note that the ordering doesn't change as we choose elements, but the number identifying the element may. If we choose an element from the middle for our first choice we have to shift all the elements after it over one once it is removed.

An example with 'abcd' permuted to 'cadb' will hopefully make it clear. 'abcd' is alphabetically numbered and we can assign the ordering '0123' 'c' is the first choice and that is the '2' element. After removing 'c' we have 'abd' which is now numbered '012' The next choice was 'a' which is the '0' element leaving 'bd' numbered '01' Now, we choose 'd' which is '1' element and are left with 'b' as the '0' element. Putting all the choices together, that gives us (2,0,1,0) which we can use to label the permutation 'cadb' (2,0,1,0) is the Lehmer code for this permutation.

Now that could work fine as a key, but it would be better to have one number rather than a bunch of numbers in a list. It turns out that there is a direct correspondence between the Lehmer code a single number via the factorial number system. In the normal decimal system, the number '2010' actually means 2*103 + 0*102 + 1*101 + 0*100. In the factorial number system, instead of multiplying by the base raised to the nth power for the nth digit, we multiply by n!. So, '2010' in the factorial number system would be 2*3! + 0*2! + 1*1! + 0*0!, which is 13 in the decimal number system. It turns out that this is exactly the number of ordered permutations to get to 'cadb' starting with 'abcd' = 1, 'abdc' = 2, etc.

To see this we can start with the 'c' in the first position. To get to 'c' in the first position we would have to go through all the permutations with 'a' in the first spot and then all the permutations with 'b' in the first spot. If 'a' is in the first spot then we have 'bcd' left which has 3! permutations. So there are 3! permutations with 'a' in the first spot. Then with 'b' in the first spot we would have 'acd' left which also has 3! permutations, so just to get to 'c' in the first spot we would have 3! + 3! permutations, or 2*3! permutations. Now, once 'c' is in the first spot we need to figure out how many more permutations to get from 'cabd' to 'cadb'. It takes 0 extra permutations to get 'a' in the next spot since 'a' is the lowest in order. So that gives us an additional 0*2! permutations. Now moving on to the last two letters we see that to get from 'bd' to 'db' is 1 permutation or 1*1! Finally the last letter is always in the correct position so results in 0*0! Adding it all up we get 13, the same answer as the Lehmer code expressed in the factorial number system. You can see that what we are doing is counting the permutations of lower-ranked elements at each step to get the final result.

Combining these two results give us an easy way to generate the kth permutation of a set. You just take the number k, convert it to the factorial number system and then grab each element from your set by the digit in the factorial number. Below is a simple bit of Python code to kth permutation of some items using this method.


from math import factorial

def kth_permutation(items, k):
    _items = list(items)
    if(k>=factorial(len(_items))):
        raise ValueError('k must be less than factorial(len(items)))')
    result = []
    for n in reversed(range(len(_items))):
        base =  factorial(n)
        digit = k / base
        result.append(_items.pop(digit))
        k %= base
    return result

The first thing this does is put the items into a new list. There are two reasons for that. In Python, lists are mutable so if we passed in a list we would mess with the actual list inside our function, which is poor form. The other reason is that this automatically puts an ordering on the input items So if items was something like a set originally now it is an ordered list. That way we have a unique way to choose the permutation order. The next thing is we catch the value error. We can't generate a permutation that is more than the total number of permutations. Note, this would also throw an out of bounds error later, but that would be confusing as to the root cause. After creating an output list, we get into the actual loop. The loop counts down from n = N-1, N-2 ... 1, 0 and determines the digit by dividing by n! It pops the item given by digit off our input list and into the result. Finally, it takes the remainder using the modulo % and continues the loop. At the end of the loop, our result has all of the incoming data ordered as the kth permutation.

For use of this snippet in the actual substitution cipher code see the subcipher git. In a future post we can quickly go over how to reverse the process and compute which kth permutation you have given an ordering of the items.

Thursday, August 21, 2014

Welcome! Add-One!

Welcome to F-plus! I started this to post my various stumbles as I try to learn about various topics such as data analysis, visualization, general programming, machine learning, cryptography, information theory, etc. and apply the ideas in various ways. If you stumbled into here, I hope that you will also find something useful, amusing or interesting in my stumbles too.

The first thing I thought I would put up is a simple JavaScript implementation of the "add-one" exercise from "Thinking Fast and Slow" by Daniel Kahneman. Kahneman discusses this deceptively simple exercise in his book as the classic example of a task that overloads the brain's heavy thinking circuitry, what he calls System 2. Apparently, immediately after starting the exercise, you'll be at your maximum mental exertion (sort of a brain sprint) with some associated physiological effects, one of the more interesting being pupil dilation in proportion to your mental exertion.

So, on to the actual exercise. The task is started by viewing a random string of four digits and then verbally reciting them back with each digit incremented by one (modulo 10, so 9 goes to 0), i.e. 3974 should be read back as 4085. Kahneman suggests a cadence three seconds for viewing the digits, then read back incremented by one, before viewing the next set of four digits. This exercise should be enough to challenge most people, but to increase the difficulty it can be extended to "add-three" where each digit should be incremented by three or the cadence can be sped up as well.

Add-One!

And the code for this little exercise is pretty simple. We just need an html element with id="add-one" that we will update. Here, I choose an h1 tag to make it easy to read. And then we have a little bit of JavaScript to attach an event listener to the page loading window.addEventListener that calls setInterval to call our defined function named addOne() to update that h1 element at set intervals. The addOne() function just loops through and appends four digits to the string that will be inserted into the id="add-one" element. The full code is below with both the cadence and number of digits as variables:

<h1 align="center" id="add-one">Add-One!</h1>

<script type="text/javascript">
    var count_time = 3000
    var num_digits = 4
        
    function addOne() {
        var new_string='';
        for(var i=0;i<num_digits;i++) { 
            new_string+=String(Math.floor(Math.random()*10));
        }
        document.getElementById('add-one').innerHTML = new_string;
    }      
    window.addEventListener('load', 
        setInterval('addOne()', count_time),false);
</script>