**Lecture
2: Optimization principles for
information flow**

(last updated 27 March
2008)

Much
of biological function is about the transmission and processing of information.
In our sensory systems, information about the outside world is encoded
into sequences of discrete, identical electrical pulses called action
potentials or spikes. In bacteria, information about the availability of
different nutrients is translated into the activity of particular proteins
(transcription factors) which regulate the expression of specific genes
required to exploit these different nutrients. In the developing embryo,
cells acquire information about their position, and hence their fate in the
adult organism, by responding to spatial variations in the concentration of
specific "morphogen" molecules; in many cases these morphogens are
again transcription factors. Given the importance of these different
signaling systems in the life of the organism, it is tempting to suggest that
Nature may have selected mechanisms which maximize the information which can be
transmitted given the physical constraints. For the case of the neural
code, this is an old idea, with some important successes (and failures).
For the regulation of gene expression by transcription factors, we have
just begun to explore the problem, but already there are some exciting results.
I'll outline how maximizing information transmission provides a theory
(rather than a highly parameterized model) for small networks of gene
regulation, such as those relevant in embryonic development, in which most of
the behavior of the network should be predictable once we know how many
molecules of each kind the cell is willing to "spend," since the
counting of these molecules sets the overall scale for information
transmission. Even simple versions of this problem make successful
experimental predictions.

Once again, a central issue in this lecture is how we
build bridges from general principles (optimizing information transmission) to
the details of particular systems (e.g., the expression of specific genes in
the fruit fly embryo). But I hope to show explicitly how the same general
principles are being used to think about very different biological systems, from
bacteria to brains, making concrete our physicists' hope that there are
concepts which can unify our thinking about these complex systems.

These notes are less complete than for the first
lecture, in part because the phenomenological background is simpler. Also, much of what I talked about is in
relatively recent papers, and so it doesnÕt seem so essential to recapitulate
the arguments here É you can read the originals. The outline is similarly compact:

** **

**Input/output relations and
information flow in neural coding, part I**

** **

** **

**Input/output relations and
information flow in neural coding, part II**

** **

**Information flow in genetic control**

** **

I have
tried to add a few things which, in retrospect, I probably should have told you
in the lecture, but didnÕt.

The generation of physicists who turned to biological
phenomena in the wake of quantum mechanics noted that to understand life one
has to understand not just the flow of energy (as in inanimate systems) but
also the flow of information.
There is, of course, some difficulty in translating the colloquial
notion of information into something mathematically precise. Indeed, almost all statistical
mechanics textbooks note that the entropy of a gas measures our lack of
information about the microscopic state of the molecules, but often this
connection is left a bit vague or qualitative. Shannon proved a theorem that
makes the connection precise:
entropy is the unique measure of available information consistent with
certain simple and plausible requirements. Further, entropy also answers the
practical question of how much space we need to use in writing down a
description of the signals or states that we observe. This leads to a notion of *efficient representation*, and in this section of the
course we'll explore the possibility that biological systems in fact form
efficient representations of relevant information, or more simply generally
that these systems maximize information flow subject to some simple physical
constraints.

In order to get started you really need to know a little
bit about information theory itself, and in particular to understand the
uniqueness of the entropic measures as the mathematically precise version of
our intuition about ÒinformationÓ as used in common language. I strongly recommend that you read ShannonÕs
original work, which is beautiful.

A
mathematical theory of communication.
CE Shannon, *Bell Sys Tech J * **27, **379-423 & 623-656 (1948).

IÕve also tried to give a summary of the key ideas with
at least a start on the problems of relevance to biological systems, in a set
of lecture notes: Some background on information
theory. I hope this is
useful. The standard modern
textbook on the subject is

*Elements of Information
Theory*. TM Cover & JA Thomas (Wiley, New
York, 1991).

The idea that information theory might be useful for
thinking about the brain is something that occurred to people almost
immediately. Indeed, Shannon
himself used language as an example in his original work, and went on to do a
wonderful experiment using the knowledge of native speakers in a Òprediction
gameÓ to estimate the entropy of English text. This is another paper worth reading.

Prediction
and entropy of written English.
CE Shannon, *Bell Sys Tech J *** 30, **50-64 (1951).

It is also interesting to note that ChomskyÕs violent
reaction to Shannon sits at the foundation of nearly fifty years of modern
theoretical work in linguistics:

Three models for the description of language.
N Chomsky, *IRE Trans Inf Theory* **IT-2,** 113-124 (1956).

We wonÕt enter into these controversies about language,
although there are many interesting issues. A modern perspective, with which I have some sympathy, is
presented by Pereira:

Formal
grammar and information theory:
Together again?. F
Pereira, *Phil Trans R Soc Lond *** 358, ** 1239-1253 (2000).

Starting in the mid 1950s, Attneave, Barlow and others
discussed the possibility that the computations done by the brain, in
particular the visual system, might be such as to capture the maximum amount of
information, or to provide a more efficient representation of the (vast)
incoming data. Most of this
discussion was in words, with little formalism, but still very influential.

Some
informational aspects of visual perception. F Attneave, *Psych Rev* **61,**183-193 (1954).

Sensory
mechanisms, the reduction of redundancy and intelligence. HB Barlow, in *Proceedings of the
Symposium on the Mechanization of Thought Processes, volume 2*, DV Blake & AM Utlley, eds,
pp 537-574 (HM Stationery Office, London, 1959).

Possible principles underlying
the transformation of sensory messages. HB Barlow, in *Sensory Communication*, W Rosenblith, ed, pp 217-234
(MIT Press, Cambridge, 1961).

** **

One more thing before we go on: While for physicists the idea of
maximizing information transmission might seem natural, many biologists react
violently against this. Separating
out sociological factors, I think the real issue is how the abstract quantities
defined by Shannon relate to the concrete problems of survival, rewards and
punishments. This isnÕt a new
problem, and not long after ShannonÕs original papers there was a paper
establishing the quite astonishing connection between information and rewards
in a gambling game.

__A new interpretation of information rate__. JL Kelly,
Jr, *Bell Sys Tech J ***35, **917-926 (1956).

In Cover and Thomas you can read about
generalizations of this idea, including the more dignified applications to
portfolio management. More
recently, several groups have tried to build on these ideas to make the
optimization of information theoretic quantities a more plausible principle in
the context of biological systems.

The fitness value of information. CT Bergstrom & M Lachmann,
arXiv:q–bio.PE/0510007 (2005).

__ __

__Phenotypic
diversity, population growth, and information in fluctuating environments__. E Kussell
& S Leibler, *Science* **309,** 2075-2078 (2005).

__Information and
fitness__. SF Taylor, N Tishby & W Bialek, arXiv:0712.4382
[q–bio.PE] (2007).

I believe this problem—does biology care about
bits?—is something we will see more about in the near future.

** **

**Input/output relations and information flow in
neural coding, part I**

One of the first attempts to formalize the optimization
of information transmission was by Laughlin. The problem he considered was relatively simple: if a cell in the retina turns light
intensity into a voltage difference across the cell membrane, how should this
transformation be chosen to maximize the amount of information that this cell
can provide about the visual world?

A
simple coding procedure enhances a neuron's information capacity. SB Laughlin, *Z Naturforsch * **36c,** 910-912 (1981).

The essential idea is that information transmission is
maximized when the input/output relation of the neuron is matched to the
distribution of inputs. In the
context of vision, we should think of the inputs as coming from the outside
world, and the neuron adapts to match these. There are two really important ideas here. First is that the characteristics of a
biological system should be derivable from some general principle (in this
case, maximizing information transmission). Second is that the function of such systems only makes sense
in their natural context, and so we need to understand this context before we
can understand the system.

One of many questions left open in LaughlinÕs discussion
is the time scale on which the matching should occur. One could imagine that there is a well defined distribution
of input signals, stable on very long time scales, in which case the matching
could occur through evolution.
Another possibility is that the distribution is learned during the
lifetime of the individual organism, perhaps largely during the development of
the brain to adulthood. Finally
one could think about mechanisms of adaptation that would allow neurons to
adjust their input/output relations in real time, tracking changes in the input
distribution. It seems likely that
the correct answer is Òall of the above.Ó
But the last possibility, real time tracking of the input distribution,
is interesting because it opens the possibility for new experimental
tests.

We know that some level of real time matching occurs, as
in the example of light and dark adaptation in the visual system. We can think of this as neurons
adjusting their input/output relations to match the mean of the input
distribution. The real question,
then, is whether there is adaptation to the distribution, or just to the
mean. Actually, there is also a
question abut the world we live in, which is whether there are other features
of the distribution that change slowly enough to be worth tracking in this
sense.

As an example, we know that many signals that reach our
sensory systems come from distributions that have long tails. In some cases (e.g., in olfaction,
where the signal—odorant concentration—is a passive tracer of a
turbulent flow) there are clear physical reasons for these tails, and indeed
itÕs been an important theoretical physics problem to understand this behavior
quantitatively. In most cases, the
tails arise through some form of intermittency. Thus, we can think of the distribution of signals as being
approximately Gaussian, but the variance of this Gaussian itself fluctuates;
samples from the tail of the distribution arise in places where the variance is
large. It turns out that this
scenario also holds for images of the natural world, so that there are regions
of high variance and regions of low variance.

Statistics of natural images: Scaling in the
woods. DL Ruderman & W
Bialek, *Phys Rev Lett * **73,** 814-817 (1994).

As explained in this paper, the possibility of Òvariance
normalizationÓ in images suggests that the visual system could code more
efficiently by adapting to the local variance, in addition to the local mean
(light and dark adaptation).
Fortunately, we found a group of experimentalists wiling to search for
this effect in the retina, and we found it.

Adaptation of retinal processing to image
contrast and spatial scale. S
Smirnakis, MJ Berry II, DK Warland, W Bialek & M Meister, *Nature* **386,** 69-73 (1997).

The situation is complicated, because the retina
exhibits different many adaptation effects on different time scales, and
historically these have been uncovered by experiments using very different
kinds of stimuli. Amusingly, the
submitted version of Smirnakis et al (1997) was very explicit in asking whether
we could observe adaptation to the __distribution__ of inputs, but some
combination of referees and editors took this out. Nonetheless, the idea that the retina adapts to the
statistics of its inputs seems to have taken hold. Here are some more recent papers on this theme, including
the exploration of mechanisms:

Temporal contrast adaptation in the input
and output signals of salamander retinal ganglion cells. KJ Kim & F Rieke, *J Neurosci * ** 21, **287-299 (2001).

Fast and slow contrast adaptation in retinal circuitry. SA Baccus & M Meister, *Neuron *** 36, **909-919 (2002).

Slow Na^{+ } inactivation and variance adaptation in
salamander retinal ganglion cells.
KJ Kim & F Rieke, *J Neurosci ***23, **1506-1516 (2003).

Dynamic predictive coding by the retina. T Hosoya, SA Baccus & M Meister, *Nature
* **436, **71-77 (2005).

I donÕt think we have seen the end of this subject. It is reasonable to expect that
adaptation in the retina (which, after all, is only the first stage of visual
processing!) is sensitive to many higher order statistical structures
characterizing the distribution of inputs, and that we havenÕt yet found the
right language for enumerating these structures.

** **

The original work demonstrating the adaptation to
variance in the retina didnÕt look closely at the form of the input/output
relations. Instead, the idea was
that when we change the distribution of inputs, the mean rate at which spikes
are generated will change, and if there is adaptation we should be able to see
this as a (slower) relaxation of this mean rate to some new steady state.
[Notice that, in contrast to the case discussed by Laughlin, these experiments
look at neurons that generate spikes (action potentials).] But since the theory is really
about the input/output relations, we should look more directly at these.

There is a substantial technical problem here. The usual way that we measure an
input/output relation is to pick an input, give it to the system, and then
measure the output; then we pick another input, and so on. But what happens if the system is
keeping track of the distribution out of which we are choosing the inputs? The problem is actually more serious
because inputs are not discrete (like the spikes!), but rather continuous
functions of time (such as light intensity, or, in the case weÕll discuss next,
the angular velocity of motion across the visual field). So we need to
characterize the transformation from a continuous function of time to discrete
spikes in a context where the continuous functions are drawn out of some
distribution functional, and then change this functional and repeat the
process. We can imagine many ways
to do this, at varying levels of approximation. ItÕs actually important that we be able to assess the
completeness of our description, since when we change input distributions the
quality of our approximations might change, and this could be confused with genuine
changes in the input/output relation.

To be blunt, if the mapping from continuous, dynamic
input signals to spike is arbitrarily complex, then no reasonable experiment
will be able to ÒmeasureÓ the input/output relation in any meaningful
sense. In particular, if the
neuron is sensitive to even a small window in the history of its continuous
dynamic inputs, then the relevant input can easily require hundreds of numbers
for its description. The neuron
then implements a function on this high dimensional space. Since we canÕt do experiments that
fully sample this large space (and one might start to wonder if the brain
itself could do such a sampling É), all progress depends on simplifying
hypotheses. Let me emphasize that
this problem exists whether or not you find the theoretical ideas about optimization
interesting, but certainly those ideas provide a clear motivation.

One simplifying hypothesis is that neurons are sensitive
not to the full stimulus space, but rather to some low dimensional
subspace. This idea has many
origins, starting with the earliest work on receptive fields in the visual
system by Barlow, Hartline and Kuffler in the 1950s. In the context of continuous dynamic inputs, there was an
initial hint about the utility of this idea from early experiments on the
motion sensitive neurons in the fly visual system.

Real–time performance of a movement
sensitive neuron in the blowfly visual system: Coding and information transfer
in short spike sequences. R de
Ruyter van Steveninck & W Bialek, *Proc R Soc London Ser B* **234,** 379-414 (1988).

IÕll admit, however, that it took us a long time to
fully understand what was going on (we did other things in between). The basic
idea is that we want to look at every single spike and ask what happened to ÒcauseÓ
that event. If we think of the
inputs as vectors in some space of high dimensionality, then by collecting many
spikes we have many samples of these vectors. We can calculate the mean vector (the Òspike triggered
average,Ó which has been widely used in neuroscience), but to ask about
sensitivity to multiple dimensions we need an object with more indices, so we
try the covariance matrix of these vectors. If the distribution of inputs is Gaussian, then one can show
that the difference between this Òspike triggered covarianceÓ and the full
covariance matrix of the stimuli will be of low rank. The rank counts the dimensionality of the relevant subspace,
and the eigenvectors of this low rank matrix provide (after a rotation to
remove the stimulus correlations) a coordinate system on this space. If the dimensionality is sufficiently
low, then we can just go back over the data and sample all the relevant
distributions, measuring the probability of spike generation as a function of
the (relevant) stimulus coordinates with a dynamic range related to the number
of spikes that we collected. All
of the details of this approach (many of which could be decoded from an
appendix in Brenner et al 2000; see below) are explained here:

Features and dimensions: Motion estimation in fly vision. W Bialek & RR de Ruyter
van Steveninck, q–bio/0505003 (2005).

The spike triggered covariance matrix has become a
fairly widespread tool in the analysis of neural responses to complex sensory
inputs. Some examples:

Two
dimensional time coding in the auditory brainstem. SJ Slee, MH Higgs, AL Fairhall & WJ
Spain, *J Neurosci ***25,
**9978-9988
(2005).

Spatiotemporal
elements of macaque V1 receptive fields. NC Rust, O Schwartz, JA Movshon & EP Simoncelli, *Neuron* **46, **945-956 (2005).

Selectivity for multiple stimulus features in
retinal ganglion cells. AL
Fairhall, CA Burlingame, R Narasimhan, RA Harris, JL Puchalla & MJ Berry II,
*J Neurophysiol *** 96, **2724-2738 (2006).

Excitatory
and suppressive receptive field subunits in awake monkey primary visual cortex
(V1). X Chen, F Han, MM Poo
& Y Dan, *Proc NatÕl Acad Sci (USA) ***104, **19120-19125 (2007).

This method remains limited, however, to the case of
inputs drawn from a Gaussian distribution. Of course, natural signals arenÕt Gaussian, and we would
like to be able to do all of this in the fully natural context (weÕre not there
yet!). One idea, which seems like
a good one in principle but might be unmanageable in practice, is to look for a
limited set of dimensions that preserve the mutual information between
(sensory) input and (spike) output.
Perhaps surprisingly, this idea can be turned into a practical
algorithm,

Analyzing
neural responses to natural signals: Maximally informative dimensions. T Sharpee, NC Rust & W Bialek, *Neural Comp* **16,** 223-250 (2004);
physics/0212110.

Despite the intuition that information requires knowing a
full distribution rather than just moments, it turns out that the search for
maximally informative dimensions is no more ÒhungryÓ for data than are more
conventional statistical methods:

Comparison
of objective functions for estimating linear-nonlinear models. TO Sharpee, arXiv:0801.0311
[q–bio.NC] (2008).

There is much more to do here.

** **

**Input/output relations and information flow in
neural coding, part II**

** **

One interesting example of natural signals with a
drifting distribution is the angular motion of flying insects. If you watch a fly, you can see it
going nearly straight, you can see it on a windy day making a more meandering
flight, and you can also see (although these are fast!) dramatic
acrobatics. ÒStraightÓ is an
approximate statement, and if you look closely youÕll see that trajectories
wander by a few degrees, with corrections occurring on the time scale of 1/10
of a second, so typical velocities are ~50 deg/s. At the opposite extreme, during acrobatic flight the angular
velocities can be thousands of degrees per second.

Flies, like us, have neurons in their visual systems
which are sensitive to motion across the retina. Also like us, they use the output of these neurons to guide
their motion through the world.
Some of these cells, which integrate information across most of the
visual field, are huge, and thus one can make very stable recordings of their
activity under a wide range of conditions. One has clear links between the activity of these cells and
the motor behavior of the fly (for example, producing a torque during flight to
correct for deviations from a straight course), and one can also characterize
very completely the signals and noise in the photoreceptors of the compound eye
that provide the input for the computation of motion. For these reasons, the flyÕs motion sensitive neurons have
been an important testing ground for theories of coding and computation in the
brain. Some of this work was
reviewed in

*Spikes: Exploring the Neural Code*. F Rieke, D Warland, R de Ruyter van Steveninck & W
Bialek (MIT Press, Cambridge, 1997),

although much has been done in the decade since that
book was published. One aspect of
this newer work is to measure the input/output relation of these cells under
conditions where the angular velocity of visual motion is fluctuating with
Gaussian statistics, and then see how things change when we change the variance
of the these inputs, as might happen in the transition from straight to
acrobatic flight, or from a quiet to a windy day.

Adaptive rescaling optimizes information
transmission.** **N Brenner, W Bialek & R de
Ruyter van Steveninck, *Neuron * **26,** 695-702 (2000).

The results of such experiments show that, at least over
some range of conditions, the relationship between angular velocity and the
probability of generating an action potential depends strongly on the distribution
of inputs. As we increase the
variance of the input signals, the input/output relation becomes more shallow,
so that the dynamic range over which the neuronÕs response in modulated becomes
larger. In fact, there is an
almost perfect rescaling, so that the output across a range of input
distributions is a function not of the absolute angular velocity, but of the
velocity in units of its standard deviation.

If we imagine trying to derive the input/output relation
as the solution to an optimization problem, then so long as there is no __intrinsic__
scale for the stimulus (e.g., stimuli are not so small as to be confused with
internal noise, or so large that they cannot be effectively transduced) the
only scale along the stimulus axis comes from the input distribution
itself. In this sense, scaling is
evidence that system is optimizing something, trying to match the input/output
relation to the distribution of inputs.
But can we show that it really is information that is being maximized?

The statement that input/output relations exhibit
scaling still leaves one dimensionless parameter unspecified. We can think of this as being the
dynamic range of the neuron—for example, the range of angular velocities
to modulate the response between 10% and 90% of the maximum—divided by
the standard deviation of the inputs.
We can imagine that the system could choose anything for this
dimensionless ratio, and we can calculate (using only measured properties of
the input/output relation and noise in the system) how much information would
be transferred from the input signal to the spikes as a function of this
parameter. In fact, the real
system operates at a value of this ratio that maximizes the information.

If this form of adaptive rescaling serves to optimize
information transmission, then immediately after we change the distribution of
inputs, the input/output relation should be suboptimally matched, and we should
be able to ÒcatchÓ the system transmitting information less efficiently. This actually works:

Efficiency and ambiguity in an adaptive neural
code. AL Fairhall, GD Lewen, W Bialek
& RR de Ruyter van Steveninck, *Nature* **412,** 787-792 (2001).

What one sees, though, is that the window of time during
which the match is less than optimal is quite short, on the order of 100
milliseconds or less. This makes
sense since this is about the minimum time required to be statistically sure
that the distribution has actually changed (!). On the other hand, there are dynamics in response to the
switch on longer time scales, and these may serve to resolve the ambiguities
created by the rescaling. Another
remarkable observation is that there doesnÕt seem to be any single time scale
associated with the responses to changing the distribution, so that one can see
signature of fractional differentiation and power-law decays. Finally, one might worry that since
some things change so quickly, you really shouldnÕt think of this as an
input/output relation plus a separate adaptation mechanism, but rather as one
big nonlinear system that looks like these two parts when we probe it with
simplified stimuli. There are major open questions here.

Closely related phenomena of adaptation to the
distribution have been seen in other systems, too (this is surely incomplete):

Shifts in coding properties and maintenance of
information transmission during adaptation in barrel cortex. M Maravall, RS Petersen, AL Fairhall, E
Arabzadeh & ME Diamond, *PLoS Biol *** 5, **e19
(2007).

Adaptive
filtering enhances information transmission in visual cortex. TO Sharpee, H Sugihara, AV Kurgansky,
SP Rebik, MP Stryker & KD Miller, *Nature ***439, **936-942 (2006).

Let me also add some other references to experiments
that I think are connected to these ideas of matching and optimization, but
maybe not quite so directly. One
theme is that many different neural systems seem Òmore efficientÓ (in some sense) when probed with
stimuli that incorporate more of the statistical structures that occur in the
natural world.

Naturalistic stimuli increase the rate and
efficiency of information transmission by primary auditory neurons. F Rieke, DA Bodnar & W Bialek, *Proc
R Soc Lond Ser B* **262,** 259-265 (1995).

Neural coding of naturalistic motion stimuli.**
**** **GD Lewen, W Bialek & RR de
Ruyter van Steveninck, *Network * **12,** 317-329 (2001); physics/0103088.

Natural stimulation of the nonclassical
receptive field increases information transmission efficiency in V1. WE Vinje & JL Gallant, *J
Neurosci * **22, **2904-2915 (2002).

Processing of low probability sounds by cortical
neurons. N Ulanovsky, L Las
& I Nelken, *Nature Neurosci *** 6, **391-398 (2003).

Neural coding of a natural stimulus
ensemble: Information at sub-millisecond resolution. I Nemenman, GD Lewen,
W Bialek & RR de Ruyter van Steveninck, *PLoS Comp Bio * **4, **e1000025 (2008); q–bio.NC/0612050 (2006).

** **

**Information flow in genetic control**

Information
flow and optimization in transcriptional regulation. G Tkacik, CG Callan Jr & W Bialek,
arXiv:0705.0313 [q–bio.MN] (2007).

Information
capacity of genetic regulatory elements. G Tkacik, CG Callan Jr & W
Bialek, arXiv:0709.4209 [q–bio.MN] (2007).