Saturday, July 10, 2010

Compilation and Compression

Quick, which is bigger in general: source code or the resulting compiled executable?

Try to answer that before reading on. If you are being nit picky consider source code that has no comments and minimal whitespace. And consider *any* sort of language compilation process: python to bytecode, c to machine code, etc.

Recently while reading "What is Thought?" by Eric Baum (a great mix of AI, computational theory, cognitive science, evolutionary theory) he stated matter-of-factly that when you compile a program it gets bigger. He was using this in analogy with how DNA is (relatively) concise but the things it builds (e.g. all the connections in the human brain) are much, much bigger (in the Shannon information sense).

That seemed completely wrong to me. Humans are inefficient and verbose. Machines are efficient and concise. Surely some optimizing compiler (or even just a regular compiler) will show my code to be a joke and transform it into a small jewel-like essence that I wasn't capable of at my high level of abstraction.

And so I started looking at examples. .py to .pyc files. c to binaries. .hs to binaries. And wouldn't you know it, more often then not the resulting "compiled" entity was larger.

I was surprised that my intuition was so far off on this. I asked around with my co-workers and more often than not they had the same faulty intuition as me. So at least I'm not alone. But it's strange that my intuition was so far off on this.

Now I think my intuition is properly adjusted and here's how I think about it now.

Humans aren't inefficient and verbose. Quite the opposite. Saying something like:
print "hello world"

as a simple python program is a very terse way of making *lots* of low level details happen. In fact all programming languages (no matter how high or low you think of them) are very abstract (ie concise) compared to what is really going on. Even assembler is a massive collection of shortcuts and abstractions compared to the actual physics going on.

So no matter what language you are working in you are essentially working with shortcuts that need to actually be fleshed out into more precise and verbosely specific instructions.

What's really funny is that from reading "What Is Thought" I actually see a deep relationship between compilation and compression that I had never considered before. A little Information Theory is a dangerous thing...

No comments: