Lecture 2: Optimization principles for information flow
(last updated 27 March 2008)
Much of biological function is about the transmission and processing of information. In our sensory systems, information about the outside world is encoded into sequences of discrete, identical electrical pulses called action potentials or spikes. In bacteria, information about the availability of different nutrients is translated into the activity of particular proteins (transcription factors) which regulate the expression of specific genes required to exploit these different nutrients. In the developing embryo, cells acquire information about their position, and hence their fate in the adult organism, by responding to spatial variations in the concentration of specific "morphogen" molecules; in many cases these morphogens are again transcription factors. Given the importance of these different signaling systems in the life of the organism, it is tempting to suggest that Nature may have selected mechanisms which maximize the information which can be transmitted given the physical constraints. For the case of the neural code, this is an old idea, with some important successes (and failures). For the regulation of gene expression by transcription factors, we have just begun to explore the problem, but already there are some exciting results. I'll outline how maximizing information transmission provides a theory (rather than a highly parameterized model) for small networks of gene regulation, such as those relevant in embryonic development, in which most of the behavior of the network should be predictable once we know how many molecules of each kind the cell is willing to "spend," since the counting of these molecules sets the overall scale for information transmission. Even simple versions of this problem make successful experimental predictions.
Once again, a central issue in this lecture is how we build bridges from general principles (optimizing information transmission) to the details of particular systems (e.g., the expression of specific genes in the fruit fly embryo). But I hope to show explicitly how the same general principles are being used to think about very different biological systems, from bacteria to brains, making concrete our physicists' hope that there are concepts which can unify our thinking about these complex systems.
These notes are less complete than for the first lecture, in part because the phenomenological background is simpler. Also, much of what I talked about is in relatively recent papers, and so it doesn’t seem so essential to recapitulate the arguments here … you can read the originals. The outline is similarly compact:
I have tried to add a few things which, in retrospect, I probably should have told you in the lecture, but didn’t.
The generation of physicists who turned to biological phenomena in the wake of quantum mechanics noted that to understand life one has to understand not just the flow of energy (as in inanimate systems) but also the flow of information. There is, of course, some difficulty in translating the colloquial notion of information into something mathematically precise. Indeed, almost all statistical mechanics textbooks note that the entropy of a gas measures our lack of information about the microscopic state of the molecules, but often this connection is left a bit vague or qualitative. Shannon proved a theorem that makes the connection precise: entropy is the unique measure of available information consistent with certain simple and plausible requirements. Further, entropy also answers the practical question of how much space we need to use in writing down a description of the signals or states that we observe. This leads to a notion of efficient representation, and in this section of the course we'll explore the possibility that biological systems in fact form efficient representations of relevant information, or more simply generally that these systems maximize information flow subject to some simple physical constraints.
In order to get started you really need to know a little bit about information theory itself, and in particular to understand the uniqueness of the entropic measures as the mathematically precise version of our intuition about “information” as used in common language. I strongly recommend that you read Shannon’s original work, which is beautiful.
A mathematical theory of communication. CE Shannon, Bell Sys Tech J 27, 379-423 & 623-656 (1948).
I’ve also tried to give a summary of the key ideas with at least a start on the problems of relevance to biological systems, in a set of lecture notes: Some background on information theory. I hope this is useful. The standard modern textbook on the subject is
Elements of Information Theory. TM Cover & JA Thomas (Wiley, New York, 1991).
The idea that information theory might be useful for thinking about the brain is something that occurred to people almost immediately. Indeed, Shannon himself used language as an example in his original work, and went on to do a wonderful experiment using the knowledge of native speakers in a “prediction game” to estimate the entropy of English text. This is another paper worth reading.
Prediction and entropy of written English. CE Shannon, Bell Sys Tech J 30, 50-64 (1951).
It is also interesting to note that Chomsky’s violent reaction to Shannon sits at the foundation of nearly fifty years of modern theoretical work in linguistics:
Three models for the description of language. N Chomsky, IRE Trans Inf Theory IT-2, 113-124 (1956).
We won’t enter into these controversies about language, although there are many interesting issues. A modern perspective, with which I have some sympathy, is presented by Pereira:
Formal grammar and information theory: Together again?. F Pereira, Phil Trans R Soc Lond 358, 1239-1253 (2000).
Starting in the mid 1950s, Attneave, Barlow and others discussed the possibility that the computations done by the brain, in particular the visual system, might be such as to capture the maximum amount of information, or to provide a more efficient representation of the (vast) incoming data. Most of this discussion was in words, with little formalism, but still very influential.
Some informational aspects of visual perception. F Attneave, Psych Rev 61,183-193 (1954).
Sensory mechanisms, the reduction of redundancy and intelligence. HB Barlow, in Proceedings of the Symposium on the Mechanization of Thought Processes, volume 2, DV Blake & AM Utlley, eds, pp 537-574 (HM Stationery Office, London, 1959).
Possible principles underlying the transformation of sensory messages. HB Barlow, in Sensory Communication, W Rosenblith, ed, pp 217-234 (MIT Press, Cambridge, 1961).
One more thing before we go on: While for physicists the idea of maximizing information transmission might seem natural, many biologists react violently against this. Separating out sociological factors, I think the real issue is how the abstract quantities defined by Shannon relate to the concrete problems of survival, rewards and punishments. This isn’t a new problem, and not long after Shannon’s original papers there was a paper establishing the quite astonishing connection between information and rewards in a gambling game.
A new interpretation of information rate. JL Kelly, Jr, Bell Sys Tech J 35, 917-926 (1956).
In Cover and Thomas you can read about generalizations of this idea, including the more dignified applications to portfolio management. More recently, several groups have tried to build on these ideas to make the optimization of information theoretic quantities a more plausible principle in the context of biological systems.
The fitness value of information. CT Bergstrom & M Lachmann, arXiv:q–bio.PE/0510007 (2005).
Phenotypic diversity, population growth, and information in fluctuating environments. E Kussell & S Leibler, Science 309, 2075-2078 (2005).
Information and fitness. SF Taylor, N Tishby & W Bialek, arXiv:0712.4382 [q–bio.PE] (2007).
I believe this problem—does biology care about bits?—is something we will see more about in the near future.
One of the first attempts to formalize the optimization of information transmission was by Laughlin. The problem he considered was relatively simple: if a cell in the retina turns light intensity into a voltage difference across the cell membrane, how should this transformation be chosen to maximize the amount of information that this cell can provide about the visual world?
A simple coding procedure enhances a neuron's information capacity. SB Laughlin, Z Naturforsch 36c, 910-912 (1981).
The essential idea is that information transmission is maximized when the input/output relation of the neuron is matched to the distribution of inputs. In the context of vision, we should think of the inputs as coming from the outside world, and the neuron adapts to match these. There are two really important ideas here. First is that the characteristics of a biological system should be derivable from some general principle (in this case, maximizing information transmission). Second is that the function of such systems only makes sense in their natural context, and so we need to understand this context before we can understand the system.
One of many questions left open in Laughlin’s discussion is the time scale on which the matching should occur. One could imagine that there is a well defined distribution of input signals, stable on very long time scales, in which case the matching could occur through evolution. Another possibility is that the distribution is learned during the lifetime of the individual organism, perhaps largely during the development of the brain to adulthood. Finally one could think about mechanisms of adaptation that would allow neurons to adjust their input/output relations in real time, tracking changes in the input distribution. It seems likely that the correct answer is “all of the above.” But the last possibility, real time tracking of the input distribution, is interesting because it opens the possibility for new experimental tests.
We know that some level of real time matching occurs, as in the example of light and dark adaptation in the visual system. We can think of this as neurons adjusting their input/output relations to match the mean of the input distribution. The real question, then, is whether there is adaptation to the distribution, or just to the mean. Actually, there is also a question abut the world we live in, which is whether there are other features of the distribution that change slowly enough to be worth tracking in this sense.
As an example, we know that many signals that reach our sensory systems come from distributions that have long tails. In some cases (e.g., in olfaction, where the signal—odorant concentration—is a passive tracer of a turbulent flow) there are clear physical reasons for these tails, and indeed it’s been an important theoretical physics problem to understand this behavior quantitatively. In most cases, the tails arise through some form of intermittency. Thus, we can think of the distribution of signals as being approximately Gaussian, but the variance of this Gaussian itself fluctuates; samples from the tail of the distribution arise in places where the variance is large. It turns out that this scenario also holds for images of the natural world, so that there are regions of high variance and regions of low variance.
Statistics of natural images: Scaling in the woods. DL Ruderman & W Bialek, Phys Rev Lett 73, 814-817 (1994).
As explained in this paper, the possibility of “variance normalization” in images suggests that the visual system could code more efficiently by adapting to the local variance, in addition to the local mean (light and dark adaptation). Fortunately, we found a group of experimentalists wiling to search for this effect in the retina, and we found it.
Adaptation of retinal processing to image contrast and spatial scale. S Smirnakis, MJ Berry II, DK Warland, W Bialek & M Meister, Nature 386, 69-73 (1997).
The situation is complicated, because the retina exhibits different many adaptation effects on different time scales, and historically these have been uncovered by experiments using very different kinds of stimuli. Amusingly, the submitted version of Smirnakis et al (1997) was very explicit in asking whether we could observe adaptation to the distribution of inputs, but some combination of referees and editors took this out. Nonetheless, the idea that the retina adapts to the statistics of its inputs seems to have taken hold. Here are some more recent papers on this theme, including the exploration of mechanisms:
Temporal contrast adaptation in the input and output signals of salamander retinal ganglion cells. KJ Kim & F Rieke, J Neurosci 21, 287-299 (2001).
Fast and slow contrast adaptation in retinal circuitry. SA Baccus & M Meister, Neuron 36, 909-919 (2002).
Slow Na+ inactivation and variance adaptation in salamander retinal ganglion cells. KJ Kim & F Rieke, J Neurosci 23, 1506-1516 (2003).
Dynamic predictive coding by the retina. T Hosoya, SA Baccus & M Meister, Nature 436, 71-77 (2005).
I don’t think we have seen the end of this subject. It is reasonable to expect that adaptation in the retina (which, after all, is only the first stage of visual processing!) is sensitive to many higher order statistical structures characterizing the distribution of inputs, and that we haven’t yet found the right language for enumerating these structures.
The original work demonstrating the adaptation to variance in the retina didn’t look closely at the form of the input/output relations. Instead, the idea was that when we change the distribution of inputs, the mean rate at which spikes are generated will change, and if there is adaptation we should be able to see this as a (slower) relaxation of this mean rate to some new steady state. [Notice that, in contrast to the case discussed by Laughlin, these experiments look at neurons that generate spikes (action potentials).] But since the theory is really about the input/output relations, we should look more directly at these.
There is a substantial technical problem here. The usual way that we measure an input/output relation is to pick an input, give it to the system, and then measure the output; then we pick another input, and so on. But what happens if the system is keeping track of the distribution out of which we are choosing the inputs? The problem is actually more serious because inputs are not discrete (like the spikes!), but rather continuous functions of time (such as light intensity, or, in the case we’ll discuss next, the angular velocity of motion across the visual field). So we need to characterize the transformation from a continuous function of time to discrete spikes in a context where the continuous functions are drawn out of some distribution functional, and then change this functional and repeat the process. We can imagine many ways to do this, at varying levels of approximation. It’s actually important that we be able to assess the completeness of our description, since when we change input distributions the quality of our approximations might change, and this could be confused with genuine changes in the input/output relation.
To be blunt, if the mapping from continuous, dynamic input signals to spike is arbitrarily complex, then no reasonable experiment will be able to “measure” the input/output relation in any meaningful sense. In particular, if the neuron is sensitive to even a small window in the history of its continuous dynamic inputs, then the relevant input can easily require hundreds of numbers for its description. The neuron then implements a function on this high dimensional space. Since we can’t do experiments that fully sample this large space (and one might start to wonder if the brain itself could do such a sampling …), all progress depends on simplifying hypotheses. Let me emphasize that this problem exists whether or not you find the theoretical ideas about optimization interesting, but certainly those ideas provide a clear motivation.
One simplifying hypothesis is that neurons are sensitive not to the full stimulus space, but rather to some low dimensional subspace. This idea has many origins, starting with the earliest work on receptive fields in the visual system by Barlow, Hartline and Kuffler in the 1950s. In the context of continuous dynamic inputs, there was an initial hint about the utility of this idea from early experiments on the motion sensitive neurons in the fly visual system.
Real–time performance of a movement sensitive neuron in the blowfly visual system: Coding and information transfer in short spike sequences. R de Ruyter van Steveninck & W Bialek, Proc R Soc London Ser B 234, 379-414 (1988).
I’ll admit, however, that it took us a long time to fully understand what was going on (we did other things in between). The basic idea is that we want to look at every single spike and ask what happened to “cause” that event. If we think of the inputs as vectors in some space of high dimensionality, then by collecting many spikes we have many samples of these vectors. We can calculate the mean vector (the “spike triggered average,” which has been widely used in neuroscience), but to ask about sensitivity to multiple dimensions we need an object with more indices, so we try the covariance matrix of these vectors. If the distribution of inputs is Gaussian, then one can show that the difference between this “spike triggered covariance” and the full covariance matrix of the stimuli will be of low rank. The rank counts the dimensionality of the relevant subspace, and the eigenvectors of this low rank matrix provide (after a rotation to remove the stimulus correlations) a coordinate system on this space. If the dimensionality is sufficiently low, then we can just go back over the data and sample all the relevant distributions, measuring the probability of spike generation as a function of the (relevant) stimulus coordinates with a dynamic range related to the number of spikes that we collected. All of the details of this approach (many of which could be decoded from an appendix in Brenner et al 2000; see below) are explained here:
Features and dimensions: Motion estimation in fly vision. W Bialek & RR de Ruyter van Steveninck, q–bio/0505003 (2005).
The spike triggered covariance matrix has become a fairly widespread tool in the analysis of neural responses to complex sensory inputs. Some examples:
Two dimensional time coding in the auditory brainstem. SJ Slee, MH Higgs, AL Fairhall & WJ Spain, J Neurosci 25, 9978-9988 (2005).
Spatiotemporal elements of macaque V1 receptive fields. NC Rust, O Schwartz, JA Movshon & EP Simoncelli, Neuron 46, 945-956 (2005).
Selectivity for multiple stimulus features in retinal ganglion cells. AL Fairhall, CA Burlingame, R Narasimhan, RA Harris, JL Puchalla & MJ Berry II, J Neurophysiol 96, 2724-2738 (2006).
Excitatory and suppressive receptive field subunits in awake monkey primary visual cortex (V1). X Chen, F Han, MM Poo & Y Dan, Proc Nat’l Acad Sci (USA) 104, 19120-19125 (2007).
This method remains limited, however, to the case of inputs drawn from a Gaussian distribution. Of course, natural signals aren’t Gaussian, and we would like to be able to do all of this in the fully natural context (we’re not there yet!). One idea, which seems like a good one in principle but might be unmanageable in practice, is to look for a limited set of dimensions that preserve the mutual information between (sensory) input and (spike) output. Perhaps surprisingly, this idea can be turned into a practical algorithm,
Analyzing neural responses to natural signals: Maximally informative dimensions. T Sharpee, NC Rust & W Bialek, Neural Comp 16, 223-250 (2004); physics/0212110.
Despite the intuition that information requires knowing a full distribution rather than just moments, it turns out that the search for maximally informative dimensions is no more “hungry” for data than are more conventional statistical methods:
Comparison of objective functions for estimating linear-nonlinear models. TO Sharpee, arXiv:0801.0311 [q–bio.NC] (2008).
There is much more to do here.
One interesting example of natural signals with a drifting distribution is the angular motion of flying insects. If you watch a fly, you can see it going nearly straight, you can see it on a windy day making a more meandering flight, and you can also see (although these are fast!) dramatic acrobatics. “Straight” is an approximate statement, and if you look closely you’ll see that trajectories wander by a few degrees, with corrections occurring on the time scale of 1/10 of a second, so typical velocities are ~50 deg/s. At the opposite extreme, during acrobatic flight the angular velocities can be thousands of degrees per second.
Flies, like us, have neurons in their visual systems which are sensitive to motion across the retina. Also like us, they use the output of these neurons to guide their motion through the world. Some of these cells, which integrate information across most of the visual field, are huge, and thus one can make very stable recordings of their activity under a wide range of conditions. One has clear links between the activity of these cells and the motor behavior of the fly (for example, producing a torque during flight to correct for deviations from a straight course), and one can also characterize very completely the signals and noise in the photoreceptors of the compound eye that provide the input for the computation of motion. For these reasons, the fly’s motion sensitive neurons have been an important testing ground for theories of coding and computation in the brain. Some of this work was reviewed in
Spikes: Exploring the Neural Code. F Rieke, D Warland, R de Ruyter van Steveninck & W Bialek (MIT Press, Cambridge, 1997),
although much has been done in the decade since that book was published. One aspect of this newer work is to measure the input/output relation of these cells under conditions where the angular velocity of visual motion is fluctuating with Gaussian statistics, and then see how things change when we change the variance of the these inputs, as might happen in the transition from straight to acrobatic flight, or from a quiet to a windy day.
Adaptive rescaling optimizes information transmission. N Brenner, W Bialek & R de Ruyter van Steveninck, Neuron 26, 695-702 (2000).
The results of such experiments show that, at least over some range of conditions, the relationship between angular velocity and the probability of generating an action potential depends strongly on the distribution of inputs. As we increase the variance of the input signals, the input/output relation becomes more shallow, so that the dynamic range over which the neuron’s response in modulated becomes larger. In fact, there is an almost perfect rescaling, so that the output across a range of input distributions is a function not of the absolute angular velocity, but of the velocity in units of its standard deviation.
If we imagine trying to derive the input/output relation as the solution to an optimization problem, then so long as there is no intrinsic scale for the stimulus (e.g., stimuli are not so small as to be confused with internal noise, or so large that they cannot be effectively transduced) the only scale along the stimulus axis comes from the input distribution itself. In this sense, scaling is evidence that system is optimizing something, trying to match the input/output relation to the distribution of inputs. But can we show that it really is information that is being maximized?
The statement that input/output relations exhibit scaling still leaves one dimensionless parameter unspecified. We can think of this as being the dynamic range of the neuron—for example, the range of angular velocities to modulate the response between 10% and 90% of the maximum—divided by the standard deviation of the inputs. We can imagine that the system could choose anything for this dimensionless ratio, and we can calculate (using only measured properties of the input/output relation and noise in the system) how much information would be transferred from the input signal to the spikes as a function of this parameter. In fact, the real system operates at a value of this ratio that maximizes the information.
If this form of adaptive rescaling serves to optimize information transmission, then immediately after we change the distribution of inputs, the input/output relation should be suboptimally matched, and we should be able to “catch” the system transmitting information less efficiently. This actually works:
Efficiency and ambiguity in an adaptive neural code. AL Fairhall, GD Lewen, W Bialek & RR de Ruyter van Steveninck, Nature 412, 787-792 (2001).
What one sees, though, is that the window of time during which the match is less than optimal is quite short, on the order of 100 milliseconds or less. This makes sense since this is about the minimum time required to be statistically sure that the distribution has actually changed (!). On the other hand, there are dynamics in response to the switch on longer time scales, and these may serve to resolve the ambiguities created by the rescaling. Another remarkable observation is that there doesn’t seem to be any single time scale associated with the responses to changing the distribution, so that one can see signature of fractional differentiation and power-law decays. Finally, one might worry that since some things change so quickly, you really shouldn’t think of this as an input/output relation plus a separate adaptation mechanism, but rather as one big nonlinear system that looks like these two parts when we probe it with simplified stimuli. There are major open questions here.
Closely related phenomena of adaptation to the distribution have been seen in other systems, too (this is surely incomplete):
Shifts in coding properties and maintenance of information transmission during adaptation in barrel cortex. M Maravall, RS Petersen, AL Fairhall, E Arabzadeh & ME Diamond, PLoS Biol 5, e19 (2007).
Adaptive filtering enhances information transmission in visual cortex. TO Sharpee, H Sugihara, AV Kurgansky, SP Rebik, MP Stryker & KD Miller, Nature 439, 936-942 (2006).
Let me also add some other references to experiments that I think are connected to these ideas of matching and optimization, but maybe not quite so directly. One theme is that many different neural systems seem “more efficient” (in some sense) when probed with stimuli that incorporate more of the statistical structures that occur in the natural world.
Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory neurons. F Rieke, DA Bodnar & W Bialek, Proc R Soc Lond Ser B 262, 259-265 (1995).
Neural coding of naturalistic motion stimuli. GD Lewen, W Bialek & RR de Ruyter van Steveninck, Network 12, 317-329 (2001); physics/0103088.
Natural stimulation of the nonclassical receptive field increases information transmission efficiency in V1. WE Vinje & JL Gallant, J Neurosci 22, 2904-2915 (2002).
Processing of low probability sounds by cortical neurons. N Ulanovsky, L Las & I Nelken, Nature Neurosci 6, 391-398 (2003).
Neural coding of a natural stimulus ensemble: Information at sub-millisecond resolution. I Nemenman, GD Lewen, W Bialek & RR de Ruyter van Steveninck, PLoS Comp Bio 4, e1000025 (2008); q–bio.NC/0612050 (2006).
Information flow and optimization in transcriptional regulation. G Tkacik, CG Callan Jr & W Bialek, arXiv:0705.0313 [q–bio.MN] (2007).
Information capacity of genetic regulatory elements. G Tkacik, CG Callan Jr & W Bialek, arXiv:0709.4209 [q–bio.MN] (2007).