Before we dive right in, a disclaimer: I am writing these articles in the hope that they will be educational and useful to others like myself and to solidify the concepts in my own mind. Crypto is hard to do right. Please don't use these articles as a replacement for a professional security consultant.
We are going to write a Python program that will take an unencrypted file, prompt for a password, and spit out an encrypted version of the file. It will, of course, also be quite capable of the reverse: read an encrypted file, prompt for a password, and give you the unencrypted file back. This is called symmetric encryption because one key allows you to perform both the encryption and the decryption operations.
Before we can talk semi-intelligently about crypto, we need to know a few basic definitions. More than almost any other technical field, crypto is rife with metaphors. The jargon, therefore, is generally pretty easy to comprehend.
key: The piece of information that allows you to either encrypt or decrypt your data. Although it's tempting to think of a crypto key as being similar to a physical key, a slightly better real-world analogy is that of the combination to a combination lock. A combination is a random-looking series of numbers which can be memorized, transferred, and transformed easily, rather like an encryption key.
plaintext: The information that you want to keep hidden, in its unencrypted form. The information does not actually have to be text, it's just the term that cryptographers use. The plaintext can be any data at all: a picture, a spreadsheet, or even a whole hard disk. The synonym "message" is sometimes used as well, especially when dealing with situations where encrypted data is being passed from one party or computer to another.
ciphertext: The information in encrypted form. If the encryption software is written correctly and all of the proper procedures are followed, the ciphertext is completely unreadable and unbreakable without the key. Although it's generally considered good security practice to reasonably protect your ciphertext from third parties, the theory goes that you could print it in a national newspaper if you wanted to and nobody would be able to decode it to plaintext without knowing the correct key.
cipher: The algorithm that converts plaintext to ciphertext and vice-versa. We won't go into any details about how the internals of cipher algorithms work, as it would be a very technical subject well beyond the scope of this introduction. Fortunately, ciphers tend to be simple enough to use that a high-level overview of the basic concepts is sufficient to make full use of them. An analogy might be that you don't have to know how a compression algorithm works in order to "zip" a file.
A First Example
To visualize how these parts work together, consider this example code for an encryption operation:
plaintext = 'I say, mater, cabbage crates coming over the briny.'
key = 'ocelot'
ciphertext = encrypt(plaintext, key)
# The variable 'ciphertext' now contains the string:
And the reverse operation, decryption:
ciphertext = 'Qng5tnrf97OG4pHooBCT96aSykSsdAiHb92RPXQPVDfdAKapuX4'
key = 'ocelot'
plaintext = decrypt(ciphertext, key)
# The variable 'plaintext' now contains the string:
# 'I say, mater, cabbage crates coming over the briny.'
Note that the key is the same in both operations. That's what makes this symmetric encryption. If we were to leave the first listing alone but change the key to "caribou" in the second listing, the decrypt() function would return either gibberish or nothing at all, depending on the implementation.
Doing Encryption Right
In order to stand up against a focused attack, the encryption system as a whole must be well designed and well implemented. There are certain properties of each part of the encryption system that--when put together--make it secure. We'll briefly mention some of these properties here.
The complete workings of a cryptographic system should be open, public, and available for analysis by expert cryptographers. Many ciphers have been invented, and many have been subverted due to various design flaws. A cipher is deemed secure by experts only after it has withstood sufficient scrutiny from the public and experts in the field of cryptography. A programmer implementing an encryption system should never rely on some hidden feature or obfuscation for the security of the system. Even if the system is never intended to be released publicly, it should be designed as if it will be. The only component that should ever be treated as truly secret is the user's encryption key itself.
The ciphertext produced by a cipher should be completely indistinguishable from random data. If the ciphertext had any observable pattern at all, an adversary could theoretically use that information to make a guess about the nature of the plaintext or use whatever is known about the pattern to more easily crack the cipher. The reverse is not necessarily true, however: random-looking cipher output is not by itself any indication of a secure cipher.
Another closely-related principle is a small change in the either the plaintext or the key should cause a dramatic change in the ciphertext as whole. If either input is off by a even single bit, the ciphertext should come out completely different.
The key should be as secure as possible. In many implementations, users will choose a password which is then converted into a key. The problem with passwords is that they can be easy to guess. If an adversary wants access to your encrypted data and you have an easily guessable key, they don't need millions of computers brute forcing the password for millions of years, they can just try a few hundred thousand obvious keys until they find the one that unlocks your data. Using a weak password nullifies any benefit obtained from using encryption. In fact, it can make things worse on the whole, because it could lull you into a false sense of security.
Key-strengthening techniques have been devised to counter these types of brute-force attacks. They are no replacement for a strong password, but they make it harder to pre-compute the keys from a password list. We'll discuss these techniques later.
An encryption key has to be large enough so that simply trying every possible combination of bits would be an insurmountable task. Imagine, again, the key as the combination to a lock on a safe. The fewer numbers in the combination, the easier it is to find the right one to open the safe simply by incrementing the combination for each try. In cryptography, a small key can be easily found with the same method. For example, an attacker trying to find an 8-bit key (one byte) only has try to all 256 (2^8) possible different combinations.
For each bit that is added to the key, the number of combinations to try is doubled. So, a 9-bit key takes 512 guesses, a 10-bit key, 1024 guesses, and so on. The standard in symmetric cryptography is 128 bits. If you enter 2 to the power of 128 into a calculator, you'll see an extremely large number as the result. Experts have theorized that trying to sequentially search for a key of this size would take more than 10 trillion years to exhaust the key space. So if an attacker were able to set up a device to brute-force a 128-bit key at the moment the universe was created, they will have had only about a 1 in 1000 chance of finding it by now. (Source.)
Barring any unforeseen fantastic breakthroughs in mathematics or physics, 128 bits is more than plenty for a good long time. Many ciphers support key sizes of up to 256 bits to make them stronger against unforeseen attacks. The extra 128 bits provide an enormous margin of safety in case weaknesses are later discovered in the ciphers or (more likely) their implementations.
These days, modern encryption ciphers tend to be very well engineered. They are invented by people with a firm grasp of the mathematical theories behind cryptography and they tend to be peer-reviewed, openly discussed, and publicly available. Researchers are constantly on the lookout for ways to break popular ciphers or at least compute them with far less effort than it would take to brute-force them with ordinary means. Those that withstand such scrutiny over time become trusted.
History is littered with ciphers that were once thought to be secure but have since been proven to have serious flaws. Some symmetric ciphers that are considered secure (as far as we know) are AES (a.k.a. Rijndael), Blowfish, Serpent, and Twofish.
In Part 2, we cover AES, write some code to test PyCrypto functionality, and discuss modes of operation.