There are two kinds of people in the world—those who divide everything in the world into two kinds of things and those who don’t.
Scientists love dividing the world into categories. Whenever we are trying to study more than 1 or 2 things at a time, our first instinct is to sort them into boxes based on their similarities, whether we're looking at animals, rocks, stars, or diseases.
The first group of scene-processing regions (near the back of the brain) care only about the image that is currently coming in through your eyes. They are looking for visual features like walls, landmarks, and architecture that will help you determine the structure of the environment around you. But they don't try to keep track of this information over time - as soon as you move your eyes, they forget all about the last view of the world.
The second group (a bit farther forward) uses the information from the first group to build up a stable model of the world and your place in it. They care less about exactly where your eyes are pointed and more about where you are in the world, creating a 3D model of the room or landscape around you and placing you on a map of what other places are nearby. These regions are strongly linked to your long-term memory system, and show the highest activity in familiar environments.
I am very interested in this second group of regions that integrate information over time - what exactly are they keeping track of, and how do they get information in and out of long-term memory? I have a new manuscript with my collaborators at Princeton (currently working its way through the publication gaunlet) showing that these regions build abstract representations of events in movies and audio narration, and am running a new experiment looking at how event templates we learn over our lifetimes are used to help build these event representations.
Recent AI advances in speech recognition, game-playing, image understanding, and language translation have all been based on a simple concept: multiply some numbers together, set some of them to zero, and then repeat. Since "multiplying and zeroing" doesn't inspire investors to start throwing money at you, these models are instead presented under the much loftier banner of "deep neural networks." Ever since the first versions of these networks were invented by Frank Rosenblatt in 1957, there has been controversy over how "neural" these models are. The New York Times proclaimed these first programs (which could accomplish tasks as astounding as distinguishing shapes on the left side versus shapes on the right side of a paper) to be "the first device to think as the human brain."
Deep neural networks remained mostly a fringe idea for decades, since they typically didn't perform very well, due (in retrospect) to the limited computational power and small dataset sizes of the era. But over the past decade these networks have begun to rival human capabilities on highly complicated tasks, making it more plausible that they could really be emulating human brains. We've also started to get much better data about how the brain itself operates, so we can start to make some comparisons.
At least for visual images, a consensus started to emerge about what these deep neural networks were actually doing, and how it matched up to the brain. These networks operate as a series of "multiply and zero" filters, which build up more and more complicated descriptions of the image. The first filter looks for lines, the second filter combines the lines into corners and curves, the third filter combines the corners into shapes, etc. If we look in the visual system of the brain, we find a similar layered structure, with the early layers of the brain doing something like the early filters of the neural networks, and later layers of the brain looking like the later filters of the neural networks.
Zeiler & Fergus 2014, Güçlü & van Gerven 2015
It seemed like things were mostly making sense, until two recent developments:
1. The best-performing networks started requiring a lot of filters. For example, one of the current state-of-the-art networks uses 1,001 layers. Although we don't know exactly how many layers the brain's visual system has, it is almost certainly less than 100.
2. These networks actually don't get that much worse if you randomly remove layers from the middle of the chain. This makes very little sense if you think that each filter is combining shapes from the previous filter - it's like saying that you can skip one step of a recipe and things will still work out fine.
Should we just throw up our hands and say that these networks just have way more layers than the brain (they're "deeper") and we can't understand how they work? Liao and Poggio have a recent preprint that proposes a possible solution to both of these issues: maybe the later layers are all doing the same operation over and over, so that the filter chain looks like this:
Why would you want to repeat the same operation many times? Often it is a lot easier to figure out how to make a small step toward your goal and then repeat, instead of going directly to the goal. For example, imagine you want to set a microwave for twelve minutes, but all the buttons are unlabeled and in random positions. Typing 1-2-0-0-GO is going to take a lot of trial and error, and if you mess up in the middle you have to start from scratch. But if you're able to find the "add 30 seconds" button, you can just hit it 24 times and you'll be set. This also shows why skipping a step isn't a big deal - if you hit the button 23 times instead, it shouldn't cause major issues.
But if the last layers are just the same filter over and over, we can actually just replace them with a single filter in a loop, that takes its output and feeds it back into its input. This will act like a deep network, except that the extra layers are occurring in time:
So Liao and Poggio's hypothesis is that very deep neural networks are like a brain that is moderately deep in both space and time. The true depth of the brain is hidden, since even though it doesn't have a huge number of regions it gets to run these regions in loops over time. Their paper has some experiments to show that this is plausible, but it will take some careful comparisons with neuroscience data to say if they are correct.
Of course, it seems inevitable that at some point in the near future we will in fact start building neural networks that are "deeper" than the brain, in one way or another. Even if we don't discover new models that can learn better than a brain can, computers have lots of unfair advantages - they're not limited to a 1500 cm3 skull, they have direct access to the internet, they can instantly teach each other things they've learned, and they never get bored. Once we have a neural network that is similar in complexity to the human brain but can run on computer hardware, its capabilities might be advanced enough to design an even more intelligent machine on its own, and so on: maybe the "first ultraintelligent machine is the last invention that man need ever make." (Vernor Vinge)
We usually think that our eyes work like a camera, giving us a sharp, colorful picture of the world all the way from left to right and top to bottom. But we actually only get this kind of detail in a tiny window right where our eyes are pointed. If you hold your thumb out at arm's length, the width of your thumbnail is about the size of your most precise central (also called "foveal") vision. Outside of that narrow spotlight, both color perception and sharpness drop off rapidly - doing high-precision tasks like reading a word is almost impossible unless you're looking right at it.
The rest of your visual field is your "peripheral" vision, which has only imprecise information about shape, location, and color. Out here in the corner of your eye you can't be sure of much, which is used as a constant source of fear and uncertainty in horror movies and the occult:
What's that in the mirror, or the corner of your eye?
What's that footstep following, but never passing by?
Perhaps they're all just waiting, perhaps when we're all dead,
Out they'll come a-slithering from underneath the bed....
What does this peripheral information get used for during visual processing? It was shown over a decade ago (by one of my current mentors, Uri Hasson) that flashing pictures in your central and peripheral vision activate different brain regions. The hypothesis is that peripheral information gets used for tasks like determining where you are, learning the layout of the room around you, and planning where to look next. But this experimental setup is pretty unrealistic. In real life we have related information coming into both central and peripheral vision at the same time, which is constantly changing and depends on where we decide to look. Can we track how visual information flows through the brain during natural viewing?
Today a new paper from me and my PhD advisors (Fei-Fei Li and Diane Beck) is out in the Journal of Vision: Pinpointing the peripheral bias in neural scene-processing networks during natural viewing (open access). I looked at fMRI data (collected and shared generously by Mike Arcaro,Sabine Kastner, Janice Chen, and Asieh Zadbood) while people were watching clips from movies and TV shows. They were free to move their eyes around and watch as you normally would, except that they were inside a huge superconducting magnet rather than on the couch (and had less popcorn). We can disentangle central and peripheral information by tracking how these streams flow out of their initial processing centers in visual cortex to regions performing more complicated functions like object recognition and navigation.
We can make maps that show where foveal information ends up (colored orange/red) and where peripheral information ends up (colored blue/purple). I'm showing this on an "inflated" brain surface where we've smoothed out all the wrinkles to make it easier to look at:
This roughly matches what we had previously seen with the simpler experiments: central information heads to regions for recognizing objects, letters, and faces, while peripheral information gets used by areas that process environments and big landmarks. But it also reveals some finer structure we didn't know about before. Some scene processing regions care more about the "near" periphery just outside the fovea and still have access to relatively high-resolution information, while others draw information from the "far" periphery that only provides coarse information about your current location. There are also detectable foveal vs. peripheral differences in the frontal lobe of the brain, which is pretty surprising, since this part of the brain is supposed to be performing abstract reasoning and planning that shouldn't be all that related to where the information is coming from.
This paper was my first foray into the fun world of movie-watching data, which I've become obsessed with during my postdoc. Contrary to the what everyone's parents told them, watching TV doesn't turn off your brain - you use almost every part of your brain to understand and follow along with the story, and answering questions about videos is such a challenging problem that even the latest computer AIs are pretty terrible at it (though some of my former labmates have started making them better). We're finding that movies drive much stronger and more complex activity patterns compared to the usual paradigm of flashing individual images, and we're starting to answer questions raised by cognitive scientists in the 1970s about how complicated situations are understood and remembered - stay tuned!
“The love of complexity without reductionism makes art; the love of complexity with reductionism makes science.” — E.O. Wilson
In the 1950s William S. Burroughs popularized an art form called the "cut-up technique." The idea was to take existing stories (in text, audio, or video) and cut them up into pieces, and then recombine them into something new. His creations are a juxaposition of (often disturbing) imagery, chosen to fit together despite coming from different sources. Here's a sample from The Soft Machine:
Police files of the world spurt out in a blast of bone meal, garden tools and barbecue sets whistle through the air, skewer the spectators - crumpled cloth bodies through dead nitrous streets of an old film set - grey luminous flakes falling softly on Ewyork, Onolulu, Aris, Ome, Osteon - From siren towers the twanging notes of fear - Pan God of Panic piping blue notes through empty streets as the berserk time machine twisted a tornado of years and centuries-
The cut-ups aren't always coherent in the sense of having an understandable plot - sometimes Burroughs was just aiming to convey an emotion. He attributed an almost mystical quality to cut-ups, saying they could help reveal the hidden meanings in text or even serve as prophecy, since "when you cut into the present the future leaks out." His experimental film The Cut-Ups was predictably polarizing, with some people finding it mesmerizing and others demanding their money back.
If you jump through the video a bit you'll see that it isn't quite as repetitive as it seems during the first minute. (I also think Burroughs would heartily approve of jumping through the movie rather than watching it from beginning to end.)
This idea of combining parts to create something new is alive and well on the internet, especially now that we are starting to amass a huge library of video and audio clips. It's painstaking work, but there is a whole genre of videos in which clips from public figures are put together to recreate or parody existing songs, or to create totally original compositions.
Since the whole can have a meaning that is more than the sum of its parts, our brains must be somehow putting these parts together. This process is referred to as "configural processing," since understanding what we're hearing or seeing requires looking not just at the parts but at their configuration. Work from Uri Hasson's lab (before I joined as a postdoc) has looked at how meaning gets pieced together throughout a story, and found a network of brain regions that help join sentences together to understand a narrative. They used stimuli very similar to the cut-ups, in which sentences were cut out and then put back together in a random order, and showed that these brain regions stopped responding consistently when the overall meaning was taken away (even though the parts were the same).
Today I (along with my PhD advisors, Fei-Fei Li and Diane Beck) have a new paper out in Cerebral Cortex, titled Human-object interactions are more than the sum of their parts (free-access link). This paper looks at how things get combined across space (rather than time) in the visual system. We were looking specifically at images containing either a person, an object, or both, and tried to find brain regions where a meaningful human-object interaction looked different from just a sum of person plus object.
In the full paper we look at a number of different brain regions, but some of the most interesting results come from the superior temporal sulcus (an area right behind the top of your ears). This area couldn't care less about objects by themselves, and doesn't even care much about people if they aren't doing anything. But as soon as we put the person and object together in a meaningful way, it starts paying attention, and we can make a better-than-chance guess about what action the person is performing (in the picture you're currently looking at) just by reading your brain activity from this region. Our current theory about this region is that it is involved in understanding the actions and intentions of other people, as I described in a previous post.
Next month I'll be presenting at CEMS 2016 on some new work I've been doing with Uri and Ken Norman, where I'm trying to figure out exactly which pieces of a story end up getting combined together and how these combined representations get stored into memory. Working with real stories (like movies and TV shows) is challenging as a scientist, since usually we like our stimuli to be very tightly controlled, but these kinds of creative, meaningful stimuli can give us a window into the most interesting functions of the brain.
Interviewer: In view of all this, what will happen to fiction in the next twenty-five years?
Burroughs: In the first place, I think there's going to be more and more merging of art and science. Scientists are already studying the creative process, and I think the whole line between art and science will break down and that scientists, I hope, will become more creative and writers more scientific. [...] Science will also discover for us how association blocks actually form.
Interviewer: Do you think this will destroy the magic?
Burroughs: Not at all. I would say it would enhance it.
The Amazing Race is one of the few reality TV shows that managed to survive the bubble of the early 2000s, with good reason. Rather than just trying to play up interpersonal dramas (though there is some of that too), it is set up like a traditional game show with a series of competitions between teams of two, who travel to different cities throughout the world over the course of the show. Eleven teams start out the race, and typically the last team to finish each day's challenges gets booted from the show until only three teams are left. These three teams then have a final day of competition, with the winner being awarded $1 million.
Winning first place on any day before the last one doesn't matter much (though you get a small prize and some bragging rights), which is interesting, since it means that it is possible for the winning team to have never come in first place before the final leg. This got me wondering: if we think of the Race as an experiment which is trying to identify the best team, how good is it? What if we just gave teams a point for every first place win, and then saw which one got the most points, like a baseball series?
Modeling the Race
To try to answer this question, I build a simple model of the Race. I assume that each team has some fixed skill level (sampled from a standard normal distribution), and then on each leg their performance is the sum of this instrinc skill and some randomness (sampled from another normal with varying width). So every leg, the ranking of the teams will be their true skill ranking, plus some randomness (and there can be a lot of randomness on the race). Fans of the show will know that this is a very simplified model of the race (the legs aren't totally independent, the teams can strategically interfere with each other, etc.) but this captures the basic idea. I ran simulated races 10,000 times for each level of randomness.
We can measure how effective the Race was at picking a winner, by seeing what true skill rank the winning team had. So if the team with the highest skill (number 1) wins, that means the race did a good job. If a team with a low skill rank (like 10) wins, then the race did a very bad job of picking the million-dollar winner. This plot shows the rank of the winning team, compared to chance ((1+11)/2=6).
This actually looks surprisingly good! Even at with lots of leg randomness (more than the actual skill difference between the teams) a team with a relatively high rank tends to win. Once the randomness gets to be an order of magnitude bigger than the differences between teams, the winner starts getting close to random.
Improving the Race
But how good is this relative to a simpler kind of competition, where the winner is the team with the most first-place wins? Rather than eliminating teams, all teams race all 9 legs, and the team coming in first the most wins the prize (ties are broken based on which team won most recently). Would this do better or worse?
Turns out this is a little bit better! In general the rank of the winning team tends to be higher, meaning that a "more deserving" team won the money. But the size of the gap depends on how much randomness there is in each leg of the race. Which point along these curves corresponds to the actual TV show?
To answer this, I took the per-leg rankings from the Amazing Race Wikia from the past 10 seasons. Yes, there are people way more obsessed with this show than me, who have been together databases of stats from each season. I measured how consistent the rankings were from each leg of the race. If there wasn't any randomness, we'd expect these to have a perfect (Kendall) correlation, while if each leg is just craziness for all teams then the correlation should be near zero. I found that this correlation varied a bit across seasons, but had a mean of 0.0992. Comparing this to the same calculation from the model, this corresponds to a noise level of about sigma=2.2.
At this level of randomness, there is about a 10% advantage for counting-first-places competition: 37.4% of the time it picks a better team to win the money, while 28.5% of the time the current elimination setup picks a better team (they pick the same team 34.1% of the time).
Of course there are some disadvantages to counting first place wins: the requires all teams to run all legs (which is logistically difficult and means we get to know each team less) and the winner might be locked-in before the final leg (ruining the suspense of the grand finale they usually have set up for the final tasks). This is likely a general tradeoff in games like this, between being fair (making the right team more likely to win) and being exciting (keeping the winner more uncertain until the end). As a game show, The Amazing Race probably makes the right choice (entertainment over fairness) but for more serious matters (political debate performance?) maybe we should pay attention to the winner of each round rather than the loser.
All the MATLAB code and ranking data is available on my bitbucket.