Saturday, September 6, 2014

Lehmer Codes and the kth Permutation

First a little background on why I was interested in indexing the kth permutation (skip to the 4th paragraph to cut straight to the chase). I've been reading up a bit on information theory recently and that got me into information encoding, which naturally led to cryptography. I started fooling around with one of the simplest type of cipher, the substitution cipher. You've probably seen a substitution cipher before since it is what is used in those "decode a famous quote" puzzles (though I guess the last time I saw one of those was in a newspaper and you don't really see newspapers anymore...) The way a substitution cipher works is you take each symbol in the original message (usually letters, but could be words or something else as your "symbol") and replace it with a new symbol. In principle, the new symbols could be anything, e.g. replace "a" with the Chinese character for dog, "b" with the "poo emoji" etc. In practice, the new symbols don't actually matter because they don't make the new cipher any more of less secure (hopefully more on that later).

Since changing the symbols doesn't really improve anything, you can just use the original set of symbols as the coded symbols where each symbol in the plaintext is mapped to a different symbol from the same set in the ciphertext, e.g. "a" goes to "h", "b" goes to "m", etc. Now you need some way to communicate what particular mapping (the so called "key") you have chosen to the person who needs to decode the ciphertext. A simplified version of a substitution cipher uses a shift of all the letters for the key. A key of 3 would mean shift every letter by 3 down the alphabet wrapping around at the end, e.g. "a" goes to "d", "b" goes to "e", "c" goes to "f" ... "x" goes to "a", "y" goes to "b", "z" goes to "c" This so-called Caesar cipher (because Julius himself supposedly used it) is actually slightly weaker than a general substitution cipher because once we know one of the symbol mappings, say "g" to "j" then we can get all of the other mappings. The nice property that it has is the simple key that gives the shift of the letters. However, the number of possible keys is limited to N, the number of symbols we have in the message, e.g. 26 for purely lowercase letters, which would make our message easily decoded by brute-force attack.

We need to have a key for our general substitution that can allow for any mapping we'd like so we can use codes that don't allow the whole message to be decoded if a single symbol mapping is discovered. The general substitution cipher is a permutation of symbols in the messages and from combinatorics we can count all of the possible permutations. We have N choices for the first symbol, N-1 choices for the second symbol, etc. giving N! possible permutations so we will need this many different keys to identify our encoding. Note 26! ~ 4 x 1026 so this will vastly increase the number of possible keys from 26 for the rotation cipher and makes it more from a naive brute-force key attack (though this cipher is still horrendously weak and hopefully in a later post I'll talk about breaking substitution ciphers with letter frequency attacks). So now we have to figure out how to take one of our 4 x 1026 keys and figure out which symbol mapping to use.

That is where the actual topic of this post comes in, Lehmer codes and the factorial number system. The first problem we'll tackle is how to identify a permutation and then we'll figure out how to convert that to a number. To uniquely identify a permutation we'll need an ordering for our elements. Then we can just write down which element we choose at each point. Note that the ordering doesn't change as we choose elements, but the number identifying the element may. If we choose an element from the middle for our first choice we have to shift all the elements after it over one once it is removed.

An example with 'abcd' permuted to 'cadb' will hopefully make it clear. 'abcd' is alphabetically numbered and we can assign the ordering '0123' 'c' is the first choice and that is the '2' element. After removing 'c' we have 'abd' which is now numbered '012' The next choice was 'a' which is the '0' element leaving 'bd' numbered '01' Now, we choose 'd' which is '1' element and are left with 'b' as the '0' element. Putting all the choices together, that gives us (2,0,1,0) which we can use to label the permutation 'cadb' (2,0,1,0) is the Lehmer code for this permutation.

Now that could work fine as a key, but it would be better to have one number rather than a bunch of numbers in a list. It turns out that there is a direct correspondence between the Lehmer code a single number via the factorial number system. In the normal decimal system, the number '2010' actually means 2*103 + 0*102 + 1*101 + 0*100. In the factorial number system, instead of multiplying by the base raised to the nth power for the nth digit, we multiply by n!. So, '2010' in the factorial number system would be 2*3! + 0*2! + 1*1! + 0*0!, which is 13 in the decimal number system. It turns out that this is exactly the number of ordered permutations to get to 'cadb' starting with 'abcd' = 1, 'abdc' = 2, etc.

To see this we can start with the 'c' in the first position. To get to 'c' in the first position we would have to go through all the permutations with 'a' in the first spot and then all the permutations with 'b' in the first spot. If 'a' is in the first spot then we have 'bcd' left which has 3! permutations. So there are 3! permutations with 'a' in the first spot. Then with 'b' in the first spot we would have 'acd' left which also has 3! permutations, so just to get to 'c' in the first spot we would have 3! + 3! permutations, or 2*3! permutations. Now, once 'c' is in the first spot we need to figure out how many more permutations to get from 'cabd' to 'cadb'. It takes 0 extra permutations to get 'a' in the next spot since 'a' is the lowest in order. So that gives us an additional 0*2! permutations. Now moving on to the last two letters we see that to get from 'bd' to 'db' is 1 permutation or 1*1! Finally the last letter is always in the correct position so results in 0*0! Adding it all up we get 13, the same answer as the Lehmer code expressed in the factorial number system. You can see that what we are doing is counting the permutations of lower-ranked elements at each step to get the final result.

Combining these two results give us an easy way to generate the kth permutation of a set. You just take the number k, convert it to the factorial number system and then grab each element from your set by the digit in the factorial number. Below is a simple bit of Python code to kth permutation of some items using this method.


from math import factorial

def kth_permutation(items, k):
    _items = list(items)
    if(k>=factorial(len(_items))):
        raise ValueError('k must be less than factorial(len(items)))')
    result = []
    for n in reversed(range(len(_items))):
        base =  factorial(n)
        digit = k / base
        result.append(_items.pop(digit))
        k %= base
    return result

The first thing this does is put the items into a new list. There are two reasons for that. In Python, lists are mutable so if we passed in a list we would mess with the actual list inside our function, which is poor form. The other reason is that this automatically puts an ordering on the input items So if items was something like a set originally now it is an ordered list. That way we have a unique way to choose the permutation order. The next thing is we catch the value error. We can't generate a permutation that is more than the total number of permutations. Note, this would also throw an out of bounds error later, but that would be confusing as to the root cause. After creating an output list, we get into the actual loop. The loop counts down from n = N-1, N-2 ... 1, 0 and determines the digit by dividing by n! It pops the item given by digit off our input list and into the result. Finally, it takes the remainder using the modulo % and continues the loop. At the end of the loop, our result has all of the incoming data ordered as the kth permutation.

For use of this snippet in the actual substitution cipher code see the subcipher git. In a future post we can quickly go over how to reverse the process and compute which kth permutation you have given an ordering of the items.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.