August 8, 2012

Symmetric Crypto with PyCrypto, Part 3

Now that we have a grasp on how encryption works (from a high-level black box perspective, at least), we concern ourselves with how to generate cryptographically strong keys. I highly recommend reading Part 1 and Part 2 first.

Generating a Key

Before we get to our first semi-practical file encryption script, one more concept needs an introduction. Our cipher of choice in this tutorial, AES, accepts keys in these lengths:

cipher key size (bytes in ASCII)
AES-128 128 bits (16 bytes)
AES-192 192 bits (24 bytes)
AES-256 256 bits (32 bytes)

However, it's fairly unreasonable to expect the casual user to create a password that is memorable, secure, and exactly one of 16, 24, or 32 characters long. We need a way to convert our standard idea of a password into a fixed-size key.

One ill-conceived strategy might be to simply pad the user's password out to the desired key size. So if they used a password of “fluffy”, for example, and used the character “0” for padding, the key would be “fluffy0000000000”. Aside from the obvious problem of using an easily-guessable password, such a key drops the effective level of cryptographic security to practically zero. Earlier I mentioned that we should build an encryption system as if every part (except the user's key itself) will be publicly known. If an attacker knew that this is how we determine the key in this particular application, she would start searching the key space with the key “0000000000000000” and try alphanumeric combinations with the left-most characters first. Finding the key this way would be absolutely trivial, even if the password itself was relatively strong.

Another problem: What happens when the user's password is longer than the chosen key size? We could truncate it down to the maximum length, but we want to reward password complexity if at all possible. We need a way to derive a fixed-size key from a password of arbitrary length.

This sounds like it could be a job for a hash function. If you haven't heard of a cryptographic hash function before, here's the short version: For a given chunk of data, a hash function yields a fixed-length string called a hash. (The terms message digest and checksum are also often used interchangeably.) The hash is something like a fingerprint of the data. Every time you generate a hash from the same chunk of data, the resulting hash should be the same. If even one bit is different–regardless of the size of the chunk–the hash will come out completely different.

Two popular examples of hash functions are MD5 and the SHA family of functions. Some required characteristics of a cryptographically-secure hash function are:

  • The output is a fixed-length (128, 160, and 256 bits are common)
  • The output should be indistinguishable from random data, no matter the input
  • It should be impossible to guess anything about the data from the output
  • It should be impossible to craft a set of data that will yield a particular output (within a reasonable amount of time)
  • It should be incredibly rare to find two sets of data which yield the same output

If you've ever downloaded a CD or DVD image of a Linux distribution, you may have noticed that they often provide a an MD5 or SHA hash string for each ISO image. If you generate the hash of the ISO file after you've downloaded it, you can compare it to the hash provided by the site that you downloaded it from. If they match, then you have proven that your copy of the ISO file is identical to the one on the site that you downloaded it from. The benefit here is that you didn't have to download the full ISO file twice in order to be sure. If you have a Linux system handy, you can tinker with hash functions on the command line easily. See the man pages for the utilities md5sum and sha256sum.

Circling back a bit, we started this discussion with the goal of producing a random-looking fixed-length string from an arbitrary variable-length string. (In particular, a password.) Here is one way we could do it in Python:

import hashlib
 
# An example of a strong password
password = 'uSh{ei3aiV'
 
print 'SHA256 hash (in hex) of password:'
print hashlib.sha256(password).hexdigest()

Example 5: Generating an SHA256 hash from a string on the command line

This shouldn't need much explanation, we simply use Python's own SHA256 implementation (from hashlib) to derive a key from a password of arbitrary length. It should be noted that in general, this doesn't offer much actual security. The next section explains why.

Key Stretching Techniques

Security and IT professionals are always harping on users to choose strong passwords for a reason: most of us don't. Even computer experts who should know better will use weak passwords from time to time. If an attacker wanted to get at the contents of an encrypted file, the first thing she might try is a list of dictionary passwords. Followed by names, places, jargon, words in foreign languages, patterns on the keyboard, and so on. Then multiple variations and combinations of all of those are tried next. If, after all this searching, the password has still not been discovered, then the user might have chosen a fairly strong password. Finding it might take a few minutes.

Ciphers like AES are designed to be able to withstand direct brute-force attacks well into the distant future, so one of the easiest attack vectors is simply to try to guess the password used to create the key. Even if you have a fairly good password, poor design of the software itself can allow an attacker to guess the key much more quickly than they can brute-force the encrypted data itself.

Take our example above. If an attacker knows that the password was hashed with MD5 before being used as the key, all she has to do is add an MD5 function to the password-guessing routine. MD5 and other hash functions are designed to be fast, so adding one to the attack program is really not much bother at all. What we need is a way to make large-scale password guessing a slow and downright tedious process. This is called key stretching.

Key stretching functions make it harder for an attacker to generate keys from a long list of passwords. We set up an obstacle that provides minimal inconvenience to legitimate users of the system while inflicting maximum inconvenience to attackers. In this case, the procedure that converts a plaintext password into a stretched encryption key needs to be somewhat computationally expensive. In an ideal case, we want to make it so that the password-to-key conversion for a single key takes a second or two on moderately powerful hardware.

The upshot is that when the user enters his password, the computer will have to spend a couple seconds computing the encryption key. Thus an attacker on similar hardware has to spend the same amount of time computing a key. (Relative to differences in hardware, of course.) But in the attacker's case, she has to try to guess thousands or millions of passwords to make her attack worthwhile. As you can imagine, having to wait a few seconds between each guess slows down her search for a valid decryption key tremendously to the point that the $5 wrench option starts looking like a viable alternative.

One of the simplest ways to stretch a key is to simply run the password through a hash function some pre-determined number of times. While a single execution of a hash function is quite fast, thousands of sequential executions of a hash function will not be. We take advantage of this for key stretching.

import hashlib
 
password = 'uSh{ei3aiV'
iterations = 100000
key = ''
 
for i in xrange(iterations):
    m = hashlib.sha256()
    m.update(key + password)
    key = m.digest()
 
print 'SHA256 hash (in hex) of password after %d iterations:' % iterations
print m.digest().encode('hex')

Example 6: Basic key stretching with a hash function

In this example, the variable key holds the current digest throughout the iterations and is initially set to an empty string. The for loop creates a new hashlib object (using the sha1() constructor) each iteration of the loop. Each iteration creates a new hash out of the concatenation of the previous key and the password. Once the specified number of iterations is reached, the final digest is printed as hex. If this were part of a real-world encryption system, the final digest might then be utilized as the user's encryption key.

Better Key Stretching

One downfall of the above technique is that an attacker can, at least theoretically, precompute a large number of keys based on common passwords. (These are often called rainbow tables.) It would take a lot of time and a lot of computing power, but it could be done. If we consider that a certain percentage of users will inevitably use weak passwords, a large enough list will give the attacker the means to decrypt a certain percentage of data–something we clearly wish to avoid if possible.

The best way to guard against this is to concatenate a string of random data (a salt) to the password before running it through the key strengthening function. The benefit to us is that the attacker then has to spend time computing the keys for a password with a specific salt attached, meaning that a precomputed table of keys is worthless. Additionally, if we use a different salt for each object that we encrypt (say, each file in a directory) and by some miracle the attacker stumbles upon the correct key for one file without guessing the password, the other objects are still as strongly protected as before. If the salt is different for each object, it follows that the key is therefore different as well, even if the password is the same.

import hashlib
import os
 
password = 'uSh{ei3aiV'
iterations = 100000
key = ''
salt = os.urandom(32)
 
for i in xrange(iterations):
    m = hashlib.sha256()
    m.update(key + password + salt)
    key = m.digest()
 
print 'Random salt (in hex):'
print salt.encode('hex')
print 'SHA256 hash (in hex) of password after %d iterations with salt: ' % iterations
print m.digest().encode('hex')

Example 7: Key stretching with a salt

This example functions similarly to the previous one except that we now concatenate a random 32-byte (256-bit) salt on each iteration as well. (The size of the salt is somewhat arbitrary, although you won't go wrong making it the same length as the key you're generating.) One thing to notice is that thanks to the addition of the salt, each invocation of this program results in an entirely different key whereas the previous example generated the same key every time. This is what gives strength to the improved approach. It also carries the consequence that we must store the salt with the key so that we can reproduce the key later when it comes time to compute the hash from the user's password as entered.

PBKDF2 is Your Friend

Example 7 would probably work reasonably well in many situations, but it's a naive approach. Luckily for us, some very smart people have spent a good deal of time thinking about the problem and have come up with some solutions. One of the most well-known is PBKDF2, which stands for Password-Based Key Derivation Function, version 2. How it works internally is beyond the scope of this tutorial (a primary benefit of using functions) but should you get curious you can always peruse section 5.2 of RFC 2898.

import os
 
from Crypto.Protocol.KDF import PBKDF2
 
password = 'uSh{ei3aiV'
iterations = 5000
key = ''
salt = os.urandom(32)
 
key = PBKDF2(password, salt, dkLen=32, count=iterations)
 
print 'Random salt (in hex):'
print salt.encode('hex')
print 'PBKDF2-derived key (in hex) of password after %d iterations: ' % iterations
print key.encode('hex')

Example 8: Key stretching with PBKDF2

The example above requires PyCrypto version 2.5 or greater. The PBKDF2 function is remarkably easy to use–just pass it a password, salt, the length of the key it should return (dkLen), and how many iterations to perform (count). It returns a string of bytes suitable for use as a symmetric cipher key. Note that we've significantly reduced the number of iterations in this example because PBKDF2 does more “work” per iteration as compared to our naive examples above.

Besides tying up the CPU for seconds at a time, there are other approaches to key derivation as well. The scrypt function, for example, makes it a memory-bound problem instead.

You Have the Tools, Now Build

By this point, you should have a reasonable grasp of the basics. Enough to be dangerous, they say. I highly encourage you to experiment with all of the tools I have shown you so far. In the next article, we'll utilize these tools to make a basic file encryption utility.

3 comments:

Anthony Wong said...

This is fantastic article! I've learnt a lot reading through Part 1 to Part 3.

Not sure if you ever followed up on your file encryption utility example you mentioned at the end?

charles said...

Hi Anthony, thanks for the kind words. I had part four sitting in a half-done state for ages but haven't quite gotten around to finishing it yet. I'll get it done eventually, it's just rather low on my very long list of programming/writing projects. Thanks again!

Anonymous said...

thanks Anthony; very interesting set of articles.

one tiny thing: example 7 code would match your excellent explanation if the salt was set within the loop rather than ahead of it.