Fragile Families Challenge uses 'big data' to answer big questions
What would happen if hundreds of social scientists and data scientists worked together on a scientific challenge to improve the lives of disadvantaged children in the United States? The Fragile Families Challenge, an ongoing mass research collaboration that uses "big data" collected as part of Princeton University's Fragile Families and Child Wellbeing Study, is attempting to answer just that.
The Fragile Families and Child Wellbeing Study — based at Princeton and Columbia University — has been following a cohort of about 5,000 children born in large U.S. cities at the turn of the 21st century. The study has gathered information on the children’s physical and mental health, cognitive function, social-emotional skills, schooling and living conditions, as well as the makeup, stability and financial resources of their families.
The challenge, which was launched earlier this year, asked participants from around the world to use this trove of data — some 54 million data points — to predict six key outcomes: grade point average (academic achievement) of the children; grit of the children; material hardship of the household, which is a measure of extreme poverty; eviction of the families; layoff of the caregiver; and whether the primary caregiver would participate in a job skills program. The challenge received 400 applications from researchers in at least 68 institutions representing at least seven countries, and over 150 teams submitted final predictions to the challenge.
The submissions — statistical and machine-learning models of the important outcomes in the lives of the children — will be optimally combined into a community model, and will be used to conduct substantive and methodological research. The entries have already offered experts new ways to improve surveys and expand their range of social science theories. Ultimately, the goal of this community-generated model is to provide policymakers with information that could help spark new theories about how to improve the lives of future generations of disadvantaged children.
"It's been very exciting to interact with people from all over the world with very different intellectual backgrounds all working on the same problem," said Matthew Salganik, a professor of sociology at Princeton and co-organizer of the Fragile Families Challenge, which was supported by a grant from the Russell Sage Foundation. The challenge is part of Salganik’s larger research interest in computational social science, a topic on which he has also recently published a book, "Bit by Bit: Social Research in the Digital Age." Participants in the challenge included sociologists, psychologists, economists and demographers, as well computer scientists, statisticians, engineers and data scientists from industry.
"It's also been really rewarding to see how many of [the participants] have been willing to share what they are doing," Salganik said. “Many [teams] open-sourced their work while the challenge was going on. I love the fact that we were able to do something more collaborative than we normally do."
This collaborative approach was also very exciting to Sara McLanahan, the William S. Tod Professor of Sociology and Public Affairs at Princeton, a principal investigator of the Fragile Familes and Child Wellbeing study and a co-organizer of the challenge. “When we started collecting these data almost 20 years ago, we never imagined this kind of research," she said. "I think that having so many different people with such different backgrounds working with the data is going to help us with what we really care about, which is understanding and improving the lives of disadvantaged families."
In August, the Fragile Families Challenge team awarded six prizes to the top-scoring submissions for each outcome, as well as Innovation Awards, for the most novel approaches using ideas from social science and data science, and a Foundational Prize, for a contribution that most helped other participants. The Foundational Prize was won by Gregory Gundersen, a graduate student in computer science at Princeton, for his work building tools to make the data easier to analyze.
The winners and other interested researchers will gather at Princeton on Nov. 16-17 for the Fragile Families Challenge Scientific Workshop, where they will share their methodology and ideas for future projects that may combine predictive modeling, causal inference and in-depth interviews. These models and their potential applications will be published in scientific journals, both individually and collectively.
“The challenge was a fun way for us to work together as an anti-disciplinary team, mix ideas from social science and data science, and compete with people from all over the world," said Abdullah Almaatouq, a graduate student in the Human Dynamics research group at the MIT Media Lab. Almaatouq and his team won first place in three categories in the Fragile Families Challenge. "We learned a great deal from the challenge. It allowed us to compare modeling ideas from social science and data science in terms of their predictive performance. We got to the chance to assess the trade-offs between styles of modeling, and also find new ways to combine them.”
This predictive modeling is just the beginning, according to Salganik. The next phase of the challenge — which will combine machine learning and in-depth, qualitative interviews — will use the models to help identify and learn from children who are "beating the odds."
"I'm particularly excited about the interviews that are going to come out of the Fragile Families Challenge," said Ian Lundberg, a graduate student in sociology. He, along with Alex Kindel, also a sociology graduate student, are co-organizers of the Fragile Families Challenge. "We are going to interview some kids who are doing worse than expected and better than expected, and try to learn what else is out there that affects achievement and family experiences that we didn't measure," Lundberg said.
The challenge also has found application in the classroom and brought undergraduates into the research process. For example, Barbara Engelhardt, an assistant professor of computer science, used the Fragile Families Challenge in her undergraduate machine learning class. The challenge was also used in classes at Stanford University, Columbia University, the University of Wisconsin and Koc Univerisity in Turkey.
Undergraduates Maya Phillips, a senior majoring in computer science and earning a certificate in technology and society with a focus on information technology, and David Liu, a senior computer science major who is earning a certificate in statistics and machine learning, have been tinkering with the data and honing their own academic interests in the process.
"The intersection of social science and data science is something that really interests me, and it's what I've tried to focus on throughout my studies," said Phillips. "The current plan for my project is to build a web-based API (application programming interface), which programmers can use to access the Fragile Families metadata in ways that are beneficial to future research using the information. The Fragile Families data is a huge set of longitudinal data in a collection of files and formats on which machine learning and statistical analysis has been, and can continue to be done."
"I knew I wanted my thesis to focus on machine learning," said Liu, who has embarked on his senior thesis. "Fragile Families stuck out to me — it involved machine learning and real-world uncertainty. I'm interested in seeing to what extent machine learning is a viable tool for analyzing social science data."
"The thing about being open and collaborative is that you sometimes don't know what is going to happen," said Salganik of the scientific workshop and forthcoming research using the Fragile Families data. "You put something out into the world and sometimes the world returns some really cool stuff. Which is what happened here. People did really cool, really creative things that we could have never imagined, and that's really wonderful."