1 Preface

This book began in 2005, when I was running a web-based experiment to study social fads. I’ll tell you more about the results of the experiment in Chapter 4, but now I’m going to tell you something that is not in any academic paper. And, it is something that fundamentally changed how I think about research. One morning, when I checked the web-server—that was always the first thing I did each morning—I discovered that overnight about 100 people from Brazil had been in my experiment. This experience had a profound impact on me. At that time, I had friends who were running traditional lab experiments, and I knew how hard they had to work to recruit, supervise, and pay people to be in their experiments; if they could run 10 people in a single day, that was good progress. But, with my web-based experiment, 100 people participated while I was sleeping. Doing your research while you are sleeping might sound too good to be true, but it isn’t. Changes in technology—specifically the transition from the analog age to the digital age—mean that we can now collect and analyze social data in new ways. This book is about doing social research in these new ways.

As I am writing this book, there is a lot of hype about “big data” and “computational social science,” and one way that I like to think about this book is through the hype cycle, an empirical pattern that describes the adoption of a new technology (Fenn and Raskino 2008). When a new technology is first introduced there is initially a rapid increase in excitment leading to the peak of inflated expectations, which is followed by a growing sense of frustration leading to the trough of despair. Finally, the technology becomes normalized and people reach the plateau of productivity (Figure 1). Different people are in different stages in their own personal journey through the hype cycle, and no matter where you are I hope that you will find the book valuable. My goal with this book is to push down the “peak of inflated expectations”, to pull up the “trough of despair”, and help us all get to the “plateau of productivity” as quickly as possible.

Figure 1: Hype cycle (Fenn and Raskino 2008). Different people are at different stages in their thinking about social research in the digital age. My goal with this book is to push down the peak of inflated expectations, to pull up the trough of despair, and help us, as a community, get to the plateau of productivity as quickly as possible.

Figure 1: Hype cycle (Fenn and Raskino 2008). Different people are at different stages in their thinking about social research in the digital age. My goal with this book is to push down the “peak of inflated expectations”, to pull up the “trough of despair”, and help us, as a community, get to the “plateau of productivity” as quickly as possible.

This book is for two different communities. It is for social scientists that want to do more data science, and it is for data scientists that want to do more social science. I spend time in both of these communities, and this book is my attempt to bring their ideas together in a way that avoids the quirks and jargon of either. Given the communities that this book is for, it should go without saying that this book is not just for students and professors. I’ve worked some in government (at the US Census Bureau) and in the tech industry (at Microsoft Research), and I know that there is lots of exciting research happening outside of universities. So, if you think of what you are doing as social research, then this book is for you, no matter where you work.

We are still in the early days of social research in the digital age, and I’ve seen some misunderstandings that are so fundamental and so common that it makes the most sense for me to address them here, in the preface. From data scientists, I’ve seen two common misunderstandings. The first is thinking that more data automatically solves problems. But, for social research that has not been my experience. In fact, for social research new types of data, as opposed to more of the same data, seems to be most helpful. The second misunderstanding that I’ve seen from data scientists is thinking that social science is just a bunch of fancy-talk wrapped around common sense. Of course, as a social scientist—more specifically as a sociologist—I don’t agree with that; I think that social science has a lot of to offer. Smart people have been working hard to understand human behavior for a long time, and it would be unwise to ignore accumulated wisdom from this effort. My hope is that this book will offer you some of that wisdom in a way that is easy to understand.

From social scientists, I’ve also seen two common misunderstandings. First, I’ve seen some people write-off the entire idea of social research using the tools of the digital age based on a few bad papers. If you are reading this book now, you have probably already read a bunch of research that uses data from social media in a way that is banal or wrong (or both). I have too. However, it would be a serious mistake to conclude from these examples that all of digital age social research is bad. In fact, you’ve probably also read a bunch of research that uses data from surveys in a way that is banal or wrong, but you don’t write-off all research using surveys. That’s because you know that there is great research done with data from surveys, and in this book, I’m going to show you that there is also great research done with the tools of the digital age.

The second common misunderstanding that I’ve seen from social scientists is to confuse the present with the future. When assessing social research in the digital age—the research that I’m going to describe in this book—it is important to ask two distinction questions:

Even though researchers are trained to answer the first question, in this case, I think the second question is more important. That is, even though social research in the digital age has not yet lead to massive, paradigm-changing intellectual contributions, the rate of improvement of digital age research is incredibly rapid. It is this rate of change, more than the current level, that makes digital age research so exciting to me.

Even though that last paragraph seemed to offer you potential riches at some unspecified time in the future, my goal in this book is not to sell you on any particular type of research. I don’t personally own shares in Twitter, Facebook, Google, Microsoft, Apple or any other tech company (although, for the sake of full disclosure, I have worked at or received research funding from Microsoft, Google, and Facebook). If you are happy with the kind of research that you are already doing, great. But, if you have a sense that the digital age means that new and different things are possible, then I’d like to show you those possibilities. Thus, throughout the book my goal is to remain a credible narrator, telling you about all the exciting new stuff that is possible, while guiding you away from a few pitfalls that I’ve seen others fall into.

As you might have noticed already, the tone of this book is a bit different from some other academic books. That’s intentional. In particular, I want this book to have three characteristics: helpful, optimistic, and future-oriented.

Helpful: My goal is to write a book that is helpful for you. Therefore, I’m going to write in an open and informal style. That’s because the most important thing that I want to convey is a certain way of thinking about social research. And, my experience from teaching suggests that the best way to convey this way of thinking is informally and with lots of examples. However, please don’t confuse informality with sloppiness; throughout the book, I’ve chosen my words carefully.

Optimistic: The two communities that this book engages—social scientists and data scientists—have very different styles. Data scientists are generally excited; they tend to see the glass as half full. Social scientists, on the other hand, are generally more critical; they tend to see the glass as half empty. In this book, I’m going to adopt the optimistic tone of a data scientist, even though my training is as a social scientist. So, when I present examples, I’m going to tell you what I love about these examples. And, when I do point out problems with the examples—and I will do this because no research is perfect—I’m going to try to point out these problems in a way is positive and optimistic. I’m not going to be critical for the sake of being critical. I’m going to be critical so that I can help you create more beautiful research.

Future-oriented: I hope that this book will help you do social research using the digital systems that exist today and the digital systems that will be created in the future. I started doing this kind of research in 2003, and since then I’ve seen a lot of changes. I remember that when I was in graduate school, people were very excited about using MySpace for social research. And, when I taught my first class on what I then called “web-based social research,” people were very excited about virtual worlds such as SecondLife. I’m sure that in the future much of what people are talking about today will seem silly. The trick to staying relevant in the face of this rapid change is abstraction. Therefore, this is not going to be a book that teaches you exactly how to use the Twitter API; instead, it is going to be a book that teaches you how to learn from “digital exhaust” (Chapter 2). This is not going to be a book that gives you step-by-step instructions for running experiments on Amazon Mechanical Turk; instead, it is going to teach you how to design and interpret experiments that rely on digital age infrastructure (Chapter 4). Through the use of abstraction, I hope this will be a timeless book on a timely topic; that’s what will be most helpful for you.

I think this is the most exciting time ever to be a social researcher, and I’m going to try to convey that excitement in a way that is precise. That is, it is time to move beyond vague generalities about the magical powers of new data. It is time to get specific.

2 Introduction

2.1 Reader note

I am not able to write the full introduction until the other chapters are complete. So, I’m just going to include an outline of the book.

2.2 Outline of the book

The book is organized around a progression through four broad research designs: observing behavior, asking questions, running experiments, and collaborating. Roughly, as you move along this progression, the amount that you can learn increases but the logistical challenges also increase. These categories are not mutually exclusive or exhaustive, but in my experience almost all social research deploys one or more of these approaches. These four approaches were all used to some extent 50 years ago, and I’m confident that will all be used, in some form, 50 years from now.

Chapter 1 (Introduction) explains the goals of the book.

Chapter 2 (Observing) will describe what and how we can learn from observational data, especially digital exhaust. I’ll start by highlighting the distinction between designed data—what social scientists are accustomed to using for research—and found data—what data scientists are accustomed to using for research. Then, I’ll describe nine common features of digital exhaust, three of which are generally good for research (big, always-on, non-reactive) and six of which are generally bad for research (incomplete, non-representative, corporate, drifting, algorithmically confounded, spammy). Given these features, I’ll describe three research designs that can be used to successfully learn from digital exhaust: counting things, forecasting things, and approximating experiments.

Chapter 3 (Asking) will argue that, despite the pessimism that some survey researchers currently feel, the digital age will be the golden age of survey research. The chapter will begin by explaining why digital exhaust will not replace surveys; on the contrary, digital exhaust increases the value of surveys. Digital exhaust and surveys are complements not substitutes. Next, I will review the total survey error framework, and use it to organize the developments that the digital age enables for survey research. In particular, I will show that it is time for researchers to revisit their reflexive aversion to non-probability sampling. We have learned a lot since the debacles of the 1950s and the data environment in the digital age means that non-probability sampling now is actually quite different than non-probability sampling in the analog era. Further, I will show that the change from human-administered surveys to computer-administered surveys enables and requires changes in how we interact with our participants, and I’ll show three examples of how we can better tailer our interviewing procedures to computer-administered surveys. Finally, I’ll describe two strategies for combining survey data with digital exhaust: enriched asking and amplified asking.

Chapter 4 (Experiments) will argue that digital age experiments will enable researchers to combine the control of lab experiments with the realism of field experiments, and at a scale not possible previously. Even though social scientists and data scientists both frequently run experiments, these experiments typically have different goals, with data scientists focused on optimizing some outcome and social scientists focused on understanding some outcome. Therefore, I’ll describe some concepts that can help unify these two perspectives. Finally, I’ll describe four main types of experiments (embedded experiments, overlaid experiments, experiments using online labor markets, and macro-sociological experiments), and provide examples of each type.

Chapter 5 (Collaborating) will argue that social researchers can begin using mass collaboration—such as crowdsourcing and citizen science—to conduct social research. Rather than just collaborating with our students and colleagues, we can now collaborate with the billions of people in the world that have an Internet connection. I expect that these new mass collaborations will yield amazing results not just because of the number of people involved but also because of their diverse skills and perspectives. By describing successful mass collaboration projects from other fields—such as astronomy, computer science, and ornithology—and by providing a few key organizing principles, I hope to convince you of two things: first, that mass collaboration can be harnessed for social research, and second, that researchers who use mass collaboration will be able to solve problems that had previously seemed impossible. Although mass collaboration is often promoted as a way to save money, it is much more than that. As I will show, mass collaboration doesn’t just allow us to do research cheaper, it allows us to do research better.

Chapter 6 (Ethics) will address many of the complex ethical questions raised by social research in the digital age. I will argue that researchers seeking generalizable knowledge should move beyond the current rules-based approach to ethics, an approach that fails because of rapidly changing capabilities and that leaves researchers unable to explain their reasoning to each other and the public. Rather, I’ll advocate for a principles-based approach to ethics, and I’ll propose four general principles that should guide decisions about research ethics. Given past experience and likely future trends, I’ll also describe and analyze three specific ethical challenges that I think will continue to confound many researchers: informed consent; understanding and managing informational risk; and making decisions in the presence of uncertainty. Finally, I’ll conclude with three practical tips for working in an area with unsettled ethics.

Chapter 7 (Predicting the future) will predict future trends that have important implications for researchers. In particular, I’ll try to predict which problems will get easier over time and which problems will get harder. Finally, I’ll describe some of what I see as the fundamental problems that need to be addressed in order for social researchers to fully take advantage of the digital age.

3 Observing

3.1 Introduction

Digital exhaust is everywhere.

In the analog age, collecting data about behavior—who does what when—was expensive, and therefore, relatively rare. Now, in the digital age, the behaviors of billions of people are automatically recorded. For example, every time you click on a website, make a call on your cell phone, or pay for something with your credit card, a digital record of your behavior is created. Because these automatically created data are a by-product of people’s every day actions, they are often called digital exhaust. The ever-rising flood of digital exhaust means that we have moved from a world where behavioral data was scarce to a world where behavioral data is abundant. But, because digital exhaust is relatively new, an unfortunate amount of research using it looks like scientists blindly chasing available data. This chapter, instead, offers a principled approach to using digital exhaust for social research.

A first step to learning from digital exhaust is to realize that it is part of a broader categeory data that has been used for social research for many years: observational data. Roughly, observational data is any data that results from observing a social system without interveening in some way. A crude way to think about it is that observational data is everything that does not involve talking to people (e.g., surveys, the topic of Chapter 3) or changing people’s environments (e.g., experiments, the topic of Chapter 4). Thus, in addition to digital exhaust, observational data also includes things like the text of newspaper articles, governmental administrative records, and satellite photos. Most of the observational data used for social research in the digital age will be digital exhaust so that will be a primary focus of this chapter. But, I will also discuss some examples where researchers actively collect observational data in order to circumvent the limits of digital exhaust.

This chapter has three parts. First, in Section 3.2, I describe the difference between designed data and found data, a difference that clarifies the fundamental challenge and opportunity for researchers using digital exhaust. Social scientists are used to working with designed data: data created for the purpose of research, such as large scale social surveys and experiments. Digital exhaust, however, was not created for research; it was designed for some other purpose, usually to help a company make money. Therefore, from the perspective of researchers, this digital exhaust should be considered found data. For social scientists accustomed to working with designed data, found data introduces new challenges. And, found data also introduces new opportunties that many researchers do not yet fully appreciate. For data scientists accustomed to working with found data, thinking about designed data helps clarify the weaknesses and strengths of found data.

Second, in Section 3.3, I describe nine common characteristics of digital exhaust: three that are generally good for research (big, always-on, and non-reactive) and six that are generally bad for research (incomplete, corporate, non-representative, drifting, algorithmically confounded, and spammy). Understanding these characteristics enables us to quickly recognize the strengths and weaknesses of existing digital exhaust and will help us harness the new sources of digital exhaust that will be created in the future.

Finally, in Section 3.4, I describe three main research strategies that you can use to learn from observational data: counting things, forecasting things, and approximating an experiment.

3.2 Found data vs designed data

Found data and designed data are different. No matter which you are used to working with, it is important to know about the other.

What is the difference between a sociologist and a historian? Although this might seem like the beginning of a bad joke, it is actually a question that reveals the most important feature of digital exhaust, and it is a question that was posed in 1991 by Sir John Goldthorpe. In fact, had he been writing today Goldthorpe might have asked: what is the difference between a social scientist and a data scientist?

According to Goldthorpe (1991), the main difference between a sociologist and a historian is control over data collection. Historians are forced to use relics whereas sociologists can tailor their data collection to specific purposes. In other words, historians use found data whereas sociologists can create designed data. The distinction between found and designed data helps us to think clearly about digital exhaust. Although digital exhaust seems shiny, fresh, and powerful—some have said that digital exhaust will do for social research what the telescope did for astronomy or what the microscope did for biology—digital exhaust is actually pretty similar to the debris at an archaeological dig (Figure 2). People working with digital exhaust—just like people working on an archaeological dig—must appreciate that their data, no matter the quantity, are inherently limited for studying a society.