information – Energy and Time

Howard Bloom, The God Problem: How A Godless Cosmos Creates

I think it was Paul Halmos who said that, for a book to be about something, it must NOT be about a great many other things. Howard Bloom seems unable to follow this prescription. The God Problem contains many interesting facts, most of them completely unrelated to its main quest of explaining how a universe can create complexity. In Bloom’s chatty, verbose, and peripatetic peregrinations, the very first sentence which sheds any light on the book’s nominal topic occurs on page 431.

Bloom’s choice of writing the entire book in the 2nd-person-as-substitute-for-1st-person (“you are Jewish”) is odd, at best confusing and at worst irritating. The book is repetitive. Like the movie The Hobbit, it could have been made half as long and twice as good with some decent editing.

What makes this especially frustrating is that Bloom does, occasionally, as if by accident, touch on some important and difficult topics, like the difference between “information” and “meaning”. But he seems unable to stay focused on them long enough to make much progress. He also seems unaware of much previous work.

I suppose I should mention that the book contains 4 false statements. I leave it to the interested reader to find the 4.

Using the Mandelbrot set as a model for how complexity arises has some merits, but also some drawbacks. Yes it’s a simple iterated rule that creates immense amounts of detail. But it doesn’t create any meaning, or even information: the Kolmogorov complexity of the whole thing is no greater than that of the equation generating it. If you want to explain the complexity of, say, a eukaryotic genome, you have to look elsewhere.

The kind of complexity we are interested in requires both nonlinearity (gain, chaos, solitons) and entropy creation (non-equilibrium thermodynamics, metabolism). But Bloom is prevented from understanding any of this by his insistence that the law of entropy is simply wrong. Life and evolution climb upstream against a constant flow of degradation; how they manage to do that is one of the key components of the answer Bloom purports to seek, but refuses to see.

Shannon entropy is not the best measure for attacking this problem; this has been well-known for some time. The state of maximum entropy, total randomness, is dead because it has no structure. The state of minimum entropy, a perfect crystal close to absolute zero, is dead because it has no variety. Life, and all complexity generation, has to exist in between order and chaos. Bloom spends so much time flogging Shannon’s dead horse that he is unable to say much about what alternative he prefers. He seems unaware of Fisher Information, and makes little or no use of Kolmogorov complexity. We could use a workable theory of meaning. Bloom is probably right that any such theory has to be receiver-dependent, but he fails to actually propose one. This makes his contribution eerily parallel to, and about as useless as, the creationist information theory of Dr. Werner Gitt (In the Beginning was Information).

This book bills itself as a rocket to new heights of understanding, but in the end it feels more like a bunch of firecrackers going off on the ground: lots of little pyrotechnics, but no real progress.

James Gleick, The Information: A History, A Theory, A Flood

Anyone familiar with James Gleick’s earlier book Chaos needs no introduction to one of the finest science writers of our generation. In his new book, The Information: A History, A Theory, A Flood, Gleick tackles an even slipperier subject.

The depth of historical detail is impressive, and includes some surprising topics like how African talking drums work. It is amusing to see quotes of people complaining of information overload in the 1600s (due to the printing press).

Following Walter Ong, he explores how reading and writing changed the way we think; the reasoning of preliterate oral peoples is substantially different from what we now think of as normal:

A typical question:
     In the Far North, where there is snow, all bears are white.
     Novaya Zembla is in the Far North, and there is always snow there.
     What color are the bears?
Typical response: "I don't know.  I've seen a black bear.  I've never seen any others. ... Each locality has its own animals."
By contrast, a man who has just learned to read and write responds, "To go by your words, they should all be white."

Only with writing does information become detached from specific things and experiences, so that logic and reasoning become possible.

We delve into cuneiform, including Don Knuth’s amazing 1972 discovery that a broken Old Babylonian tablet, laying part in the British Museum and part in Berlin (and part missing), held an algorithm for taking square roots. It ends “This is the procedure.”

Gleick spends several chapters on organizing information. The first attempt at an alphabetical dictionary was published in 1604; before that, most educated intelligent people had never seen any kind of alphabetized or numerically sorted list in their entire lives. Library catalogs were in shelf order. Street addresses did not exist; people had to say things like “at London by Thomas Vautroullier dwelling in the blak-friers by Lud-gate” to specify a location.

Charles Babbage and Ada Lovelace get a thorough treatment, moving through a crowd which included Boole and De Morgan. Ada solves Peg Solitaire by hand and wonders (to Babbage)

... if the problem admits of being put into a mathematical formula, & solved in this manner. ... There must be a definite principle, a compound I imagine of numerical & geometrical properties, on which the solution depends, & which can be put into symbolic language.

Not long after, she was writing recursive algorithms to solve Taylor series like

e^x = 1 + x + x^2/2 + x^3/6 + … + x^n/n! + …

on a computer that existed only on paper, and speculating about how it might be programmed to play chess or compose music.

Telegraphs (including pre-electric mechanical systems like Napoleon’s) get detailed coverage, and drove the first frenzy of data compression: when each word costs you money, how much can you say in how few? Codes and codebooks abounded both for compression and for secrecy; cryptography became a public fad. Mathematician John Wilkins published a book of codes in 1641, one of which used 2 letters in groups of 5 to encode a 32-character alphabet, probably the first such use of binary, and the last for several centuries. Babbage, Poe, Verne and Balzac were all amateur cryptographers.

After a chapter on telephones, we start hitting the technical meat with Shannon, Godel, and Turing. There won’t be too much surprising here for people who have already been over that ground, but the coverage includes Russel and Whitehead, Berry’s Paradox, Shannon’s thesis on relay circuits, and so on. It covers the conventional topics well, but fails to push into less conventional areas nearby. There is no mention of paraconsistent logics, within some of which the Godel incompleteness proof fails (so while any fully consistent theory of arithmetic must be incomplete, a paraconsistent one might not have to be). There is no mention of Fisher Information, which predates Shannon Information by two decades and has powerful physical implications (more on that in my previous note). Gleick is also a bit muddy on the confusion between “information = entropy” (Shannon’s original way of putting it) and “information = negative entropy” (the viewpoint of Wiener and Schrodinger, which I think is correct and clearer). Shannon himself knew what he was doing and never let this lead him into error, but several generations of subsequent physicists have been sloppier and not so fortunate. Gleick uses both viewpoints somewhat interchangeably, which is unhelpful for beginners and annoying for experts. The minus sign matters.

The coverage of Maxwell’s Demon is extensive, but again, it does not embody the fastest route to true understanding. Most physicists (including Von Neumann) thought Brillouin has exorcised the demon by showing that measurement takes work, but we now know this is incorrect. Measurement can be done reversibly; even quantum systems can do something like measurement by entanglement (“measurement” has in QM a very specific meaning, different from the general one), which does not collapse the wavefunction and is reversible. What does generate entropy is any irreversible action, like erasing a bit of memory, so the real reason the demon fails is that it can’t keep recording results reversibly forever into a finite memory, and eventually has to start erasing memory to make room for a new computation. Pressing the reset button – going from one of a collection of unknown states to a single known state – must generate entropy. Gleick covers Brillouin in chapter 9 as though he was right, and doesn’t correct the misconception until chapter 13 (on Bennett and Landauer and quantum computing).

The coverage of biological information is brief, given the vast scale of the subject, and mostly focuses on the discovery of DNA and the working out of the triplet code. The former part is covered better in The Double Helix, and the latter is too vague to understand any of the details. There is almost no reference to any modern (post-1960) topic. (We now know, for example, that there are over 13 different genetic codes operating on the planet (including two inside your own body, nuclear and mitochondrial), and that they form a nested family tree. We know, in other words, that the genetic code evolved over time, and can even begin to guess which amino acids were missing in earlier versions.)

A chapter on memes covers chain letters and hula hoops before Dawkins and internet viruses.

The chapter on Chaitin-Kolmogorov complexity is very nice, more lucid than Chaitin’s own popular writing which tends to get bogged down in technical details. It manages to relate computability, compressibility, decidability, incompleteness, randomness, and inductive inference without a single equation. This is Gleick in top form.

Elsewhere in the book we peer into the inner workings of the Oxford English Dictionary and of Wikipedia, consider the historical impact of printing, and follow the origins of words like network and ba-da-bing. All in all, a fine read. Because of omissions noted above, this is nothing like “the last book you’ll ever need” on information theory, but it’s a great place for a beginner to start.

Information and Energy

The late physicist John Archibald Wheeler had a vision of something he called “It From Bit” … a theory of the universe where everything would be based on information. A lot of little things are pointing in that direction. For example, Holevo’s Theorem says that you can’t get more than one bit of classical information about the state of a quantum bit (like, say, the spin of an electron). Even though a qubit is something that can take on a continuous range of states, when you ask it what state it’s in you can only get a yes/no answer, and after you get that answer the original state is totally destroyed and there’s no way to get any further information about it.

Most people working in this area assume that the information that will be the basis of this grand theory is classical information, also known as Shannon information. It was crisply defined in Claude Shannon’s 1948 paper and has well-understood ties to physical entropy. It is the only kind of information most physicists know.

But there is another, older kind of information. Based on work as early as 1898 by others, it was codified by the great statistician R.A. Fisher in the 1920s, and has come to be called Fisher information or FI for short.

In statistics, FI was initially used to measure how much information about an unknown parameter you can get from measurements of a random variable whose probability distribution depends on that parameter. The key to this use is the Cramer-Rao inequality, which says that the variance (mean squared error) of any estimate has to be greater than or equal to 1 over the FI. The more FI you get, the tighter your estimate can be.

The formal definition of the intrinsic FI in a probability distribution is “the expectation of the square of the gradient of the natural log of the probability density”. That’s a mouthful, but in practice it’s usually not too hard to compute. For a 1-dimensional distribution, the gradient is just the derivative. One thing to note is that FI is dimensioned, unlike Shannon information which is always dimensionless. If you’re measuring position, then the variance has dimensions of length squared (because the error is a length, and the variance is the mean squared error), so the (positional) FI has to have dimensions of 1 over length squared for Cramer-Rao to make sense.

Roy Frieden, an emeritus physicist at U Arizona, thinks FI can be the information basis of Wheeler’s dream. His book Science From Fisher Information argues that many physical laws, including not just things like Statistical Mechanics but also Quantum Mechanics and classical Electromagnetism, can be derived directly from FI considerations.

I’ve been working my way through Frieden’s books, and decided I needed to take a few simple concrete examples and compute the FI, just to make sure I knew how it worked in practice. The first example I chose was the simplest imaginable quantum system, the 1-D “particle in a box” or “particle in an infinite square well”. It is well known (and easy for students to prove themselves) that the energy eigenfunctions are just sine waves, with the nth wavefunction in a box of length L being just sqrt(2/L)sin(n pi x/L). The energy levels are just hbar^2 pi^2 n^2 / 2m L^2, so they depend not just on L, but also on the mass m of the particle. All the energy here is kinetic energy (since the potential is zero everywhere inside the box), and heavier particles move slower.

Cranking through the FI calculation is pretty straightforward (skip this paragraph if the details are boring). The probability density is the square of the wavefunction: P_n = (2/L)sin^2(n pi x/L). The natural log of this is log(P_n) = log(2) – log(L) + 2log(sin(n pi x/L)). Taking the derivative with respect to x, the first two terms disappear (they are constants and so their derivative is zero); the last term can be handled by the chain rule df(g(x))/dx = df(g)/dg * du/dx. Setting f(g) = log(g) and g(h) = sin(h) and h(x) = n pi x/L and applying the chain rule twice, we get d/dx log(P_n(x) = 2/g * cos(h) * (n pi / L) = (2 n pi / L) cot(n pi x/L). The expectation of the square of this is just the integral of the square, times the probability, over the length of the box, which (skipping over the tedious stuff) evaluates to FI = 4 pi^2 n^2 / L^2. Note that this has dimensions 1/L^2 as expected.

The interesting thing here is that the energy levels are proportional to n^2 / m L^2, and the FI is proportional to n^2 / L^2. The kinetic energy is proportional to the (positional) Fisher information divided by the mass. This is despite the fact that the energy and the FI are computed by entirely different procedures. One uses the Schrodinger Equation, the other a purely statistical definition of information that predates the SE.

If your jaw hasn’t hit the floor yet, just wait a few seconds for that to sink in. But wait, it gets better.

Remember the Heisenberg uncertainty relation? For position and momentum it says that the product of the uncertainties of position and momentum must be greater than a certain constant, i.e.

delta x * delta p >= C

Divide both sides by delta p and then square both sides:

(delta x)^2 >= C / (delta p)^2

but since (delta x)^2 is (crudely speaking) the variance of x, we also have the Cramer-Rao inequality

(delta x)^2 >= 1/FI

which implies that the Fisher information and the momentum-uncertainty-squared are closely related. In other words, the Heisenberg uncertainty principle can be viewed as just a simple application of Fisher information.

I’ve exchanged emails with Frieden; he confirms that the linear relation of kinetic energy to FI holds throughout non-relativistic quantum mechanics. When you go relativistically covariant (i.e. start using the Dirac equation instead of the Schrodinger equation), things change a bit, but there’s still a relationship.

Pretty heady stuff. Hopefully, I’ve been able to convey a bit of my excitement along with some feel for the mechanics.

By the way, remember my comment a few days ago about finding a web page about Common integrals in quantum field theory useful? A bunch of those come up when you try to find the FI for the Quantum Simple Harmonic Oscillator …