This course is the second of the two-semeter sequence for Ph.D. students in Sociology. In this course, students will learn the statisical and computational principles necessary to perform modern, flexible, and creative analysis of quantiative social data. This course hopes to transfrom students from consumers of quantative research to producers of it.

See the logistics page for more information about time and location, prerequisites, software, code conventions, collaboration policy, Piazza, and inspirations.

- Conduct, interpret, and communicate results from analysis using multiple regression (including dummy variables and interactions).
- Conduct, interpret, and communicate results from analysis using logistic regression (including dummy variables and interactions).
- Describe the relationship between multiple regression, logistic regression, and then generalized linear model.
- Explain the limitations of observational data for making causal claims, and begin to use existing strategies for attempting to make causal claims from observational data.
- Write clean, reusable, and reliable R code.
- Build a solid, reproducible research pipeline to go from raw data to final paper.
- Feel empowered working with data.

- Learn new statistics
- Learn new programing

There are three main types of assignments for students:

**Preparing for class:**For many classes there will be some reading (or video watching) that you must do before class. I expect you to come to class 100% prepared. I will assign a reasonable amount of stuff, and you must do it. I will not spend valuable class time summarizing readings that you should have done before class. Rather, we are going to use class time for more valuable learning activities.**Weekly homework:**Learning data analysis takes practice. There will be weekly homework assignments, and these assigments are described in detail on the homework page.**Replication and extension project:**Students will replicate and extend a published paper. For more information see the project page.

All class materials are available from our class github page.

I have marked open access materials with a and closed access materials with a . If you do not have access to a university library, copies of many of the closed access articles can be found through Google Scholar.

- none.

- none.

- none.

- Garrett Grolemund's dplyr course on DataCamp.
- RStudio's data wrangling cheatsheet .
- Hadley Wickham's talk about dplyr at UseR 2014.
- Lohr. 2014. For Big-Data Scientists, "Janitor Work" Is Key Hurdle to Insights.
*New York Times*. - Udwin and Baumer (2015). R Markdown. Working paper.
- RStudio's RMarkdown cheatsheet.

- Gentzkow and Shapiro (2014). Code and Data for the Social Sciences: A Practitioner's Guide. Mimeo.
- Tippmann (2014). Programming tools: Adventures with R.
*Nature*. - Google's R Style Guide.

- Wickham (2014) Tidy data.
*Journal of Statistical Software*. - Healy (2014). Plain text, papers, and pandoc. Blog post.
- Murrell (2008). Chapter 5.6: Databases in Introduction to Data Technologies.
- Kent (1983). A Simple Guide to Five Normal Forms in Relational Database Theory.
*Communications of the ACM*. - Slides from Andy Chen's talk about Google's R Style Guidelines at useR 2014.

- Healy and Moody (2014). Data visualization in sociology.
*Annual Review of Sociology*. - Healy (2014). A visualization error. Blog post.
- Tufte (1997). Visual and Statistical Thinking: Displays of Evidence for Making Decisions in Visual Explanations. (Available from Blackboard)
- Wickham (nd). Introduction in ggplot2: Elegant Graphics for Data Analysis.

- Gelman et al (2002). Let's Practice What We Preach: Turning Tables into Graphs.
*The American Statistician*. - Kastellec and Leoni (2007). Using Graphs Instead of Tables in Political Science.
*Perspectives on Politics*. [see also the code repository] - Rougier et al (2014). Ten Simple Rules for Better Figures,
*PLOS Computational Biology*. - Wickham (2010). A layered grammar of graphics.
*Journal of Computational and Graphical Statistics*.

- Watch Visualizing Data Using ggplot2 by David Robinson.
- Introduction (about 3 minutes)
- Scatter Plots (about 8 minutes)
- Faceting and Additional Options (about 4 minutes)
- Histograms and Density Plots (about 4 minutes)
- Boxplots and Violin Plots (about 3 minutes)
- Input- Getting Data into the Right Format (about 9 minutes) [note: we will not use qplot(), but you should know that it exists]
- Output- Saving Your Plots (about 3 minutes)

- Slides from Dawn Koffman's class on ggplot2.
- R Graph Catalogue by Joanna Zhao and Jennifer Bryan.

- Watch Matthew Mccullough's talk: What is version control?
- Watch Mehan Jayasuriya talk: Introduction to Git and GitHub, Webcast. [Note: We will not be using all of the features of github. The goal of this video is to give you a big picture overview. I don’t expect you to understand everything in the video. In class, we’ll focus on the parts that are most important for us. Also, we will generally use RStudio rather than the terminal to interact with the git and github.]
- Hadley Wickham's Webinar: Collaboration and time travel: version control with git, github and RStudio.
- Hadley Wickham. (nd) Chapter on git and github in R Packages.

- Nice R Code's Introduction to version control using git.
- Ram (2013) Git can facilitate greater reproducibility and increased transparency in science.
*Source Code for Biology and Medicine*. - Watch Linus Torvalds talk about git at Google in 2007. [Note: This is about the history and design philosophy of git; it is not about how to use git.]
- Orsini (2013). GitHub For Beginners: Don't Get Scared, Get Started
*ReadWriteWeb*. - Github's git cheat sheet.

- None

- None.

- None.

- Robinson (2014). broom: An R Package for Converting Statistical Analysis Objects Into Tidy Data Frames. Working paper.

- None.

- None.

- Fox, Chapter 7. (skim 7.2.1).

- None.

- None.

- None.

- Brambor, T., Clark, W.R., and Golder, M. (2006). Understanding Interaction Models: Improving Empirical Analyses.
*Political Analysis*. - Greenman, E. and Xie, Y. (2008). Double Jeopardy? The Interaction of Gender and Race on Earning in the United States.
*Social Forces*.

- None.

- Fox, Chapter 6. (Available from Blackboard)
- Berk, Chapter 4 (skip Section 4.6). (Available from Blackboard)

- None.

- None.

- None.

- Nunzo, R. (2014) Scientific method: Statistical errors
*Nature*. - Cohen, J. (1994). The earth is round (p < .05)
*American Psychologist*. - Simmons, J. et al. (2014) False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.
*Psychological Science*. - King, G. Tomz, M., and Wittenberg, J. (2000). Making the Most of Statistical Analyses: Improving Interpretation and Presentation.
*American Journal of Political Science*. - Ward, M.D., Greenhill, B.D., and Bakke, K.M. (2010). The perils of policy by p-value: Predicting civil conflicts.
*Journal of Peace Research*.

- Hauer (2004). The harm done by tests of significance
*Accident Analysis and Prevention*. - Fidler and Loftus (2009). Why Figures with Error Bars Should Replace p Values: Some Conceptual Arguments and Empirical Demonstrations.
*Journal of Psychology*. - Trafimow and Marks (2015). Editorial
*Basic and Applied Social Psychology*. - Greenland and Poole (2013). Living with p values: resurrecting a Bayesian perspective on frequentist statistics.
*Epidemiology*. - Gelman (2013). P values and statistical practice
*Epidemiology*.

- Fox, Appendix B.1.1 and Appendix B.1.2: Matricies.
- Fox, Chapter 9, Sections: 9.1 - 9.2. (skip sections 9.1.1 and 9.1.2). (Available on blackboard).

- Khan academy videos on matrix multiplication.

- None.

- None.

- Fox, Chapter 9, Sections: 9.3 - 9.5. (Available on blackboard).

- Myung (2003). Tutorial on maximum likelihood estimation
*Journal of Mathematical Psychology*.

- Taubes (2007) "Do we really know what makes us healthy?"
*New York Times Magazine*. - Mogran and Winship (2015) Counterfactuals and Causal Inference: Chapter 1 (Introduction) and Chapter 2 (Counterfactuals and the potential outcomes model). (Available from Blackboard)

- Kastellec and Leoni (2007). Using Graphs Instead of Tables in Political Science.
*Perspectives on Politics*. [see also the code repository]

- Mogran and Winship (2015) Counterfactuals and Causal Inference: Chapter 3 (Causal graphs). (Available from Blackboard)

- Elwert and Winship (2014) Endogenous Selection Bias: The Problem of Conditioning on a Collider Variable
*Annual Review of Sociology*.

- Mogran and Winship (2015) Counterfactuals and Causal Inference: Chapter 4 (Models of causal exposure and identification criteria for condition estimators) and Chapter 5 (Matching estimators of causal effects). (Available from Blackboard)

- Stuart (2010) Matching Methods for Causal Inference: A Review and a Look Forward
*Statistical Science*. - Harding (2003). Counterfactual Models of Neighborhood Effects: The Effect of Neighborhood Poverty on Dropping Out and Teenage Pregnancy.
*American Journal of Sociology*. - Arceneaux, Gerber, and Green (2010). A Cautionary Note on the Use of Matching to Estimate Causal Effects: An Empirical Example Comparing Matching Estimates to an Experimental Benchmark.
*Sociological Methods & Research*.

- None.

- Mogran and Winship (2015) Counterfactuals and Causal Inference: Chapter 6 (Regression estimators of causal effects). (Available from Blackboard)
- Cornfield, J. et al. (1959). "Smoking and lung cancer: Recent evidence and a discussion of some questions."
*Journal of the National Cancer Institute*. (Available from Blackboard) [SKIM] - Nasar, S. (1993). "David Card and Alan Krueger; Two Economists Catch Clinton's Eye By Bucking the Common Wisdom."
*New York Times*.

- Fox, Chapter 14.1
- Hamner and Kalkan (2013). Behind the Curve: Clarifying the Best Approach to Calculating Predicted Probabilities and Marginal Effects from Limited Dependent Variable Models.
*American Journal of Political Science*. - Greenhill, Ward, and Sacks (2011). The Separation Plot: A new visual method for evaluating the fit of binary models.
*American Journal of Political Science*.

- None.

- Mood (2010). Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It.
*European Sociological Review*, 67-82. - Berry, DeMeritt, and Esarey (2010). Testing for Interaction in Binary Logit and Probit Models: Is a Product Term Essential?
*American Journal of Political Science*, 248-266.

- Karlson, Holm, Breen (2012). Comparing Regression Coefficients Between Same-sample Nested Models Using Logit and Probit: A New Method.
*Sociological Methodology*, 286-313.

- Fox, Chapter 14.2.
- Quillian, L. and Pager, D. (2001). Black Neighbors, Higher Crime? The Role of Racial Stereotypes in Evaluations of Neighborhood Crime.
*American Journal of Sociology*.

- None.

- Fox, Chapter 15.1. and 15.2 (stop at Three-Way Tables)
- Press release Study: Hurricanes with female names more deadly than male-named storms .
- Jung et al. (2014) Female hurricanes are deadlier than male hurricanes
*PNAS*. (we will not discuss the parts of this paper dealing with experiments) - Freese (2014) My thoughts on that hurricane study
*Blog post*.

- Gelman and Hill (2007) Chapter 4 (Linear regression: before and after fitting the model) in
*Data analysis using regression and multilevel/hierarchical models*. [Available on Blackboard] - Datacoloda blog: Thirty-somethings are Shrinking and Other U-Shaped Challenges.

- Jung et al. (2014) Female hurricanes are deadlier than male hurricanes
*PNAS*. (we will not discuss the parts of this paper dealing with experiments) - Freese (2014) My thoughts on that hurricane study
*Blog post*.

- Gelman and Hill (2007) Chapter 11 (Multilevel structures) in
*Data analysis using regression and multilevel/hierarchical models*. [Available on Blackboard] - Xie and Hannum (1996). Regional Variation in Earnings Inueqality in Reform-Era Urban China.
*American Journal of Sociology*. (Note: Focus especially on pages 950-972 and the appendix).

- Feehan and Salganik (2014) Generalizing the Network Scale-Up Method: A New Estimator for the Size of Hidden Populations Working paper. [SKIM]
- Salganik et al (2011) Assessing network scale-up estimates for groups most at risk for HIV/AIDS: Evidence from a multiple method study of heavy drug users in Curitiba, Brazil.
*American Journal of Epidemiology*. (data and code available)[SKIM] - Feehan et al (2015) Quality vs quantity: A survey experiment to improve the network scale-up method.
*Working paper*(Available from Blackboard) [SKIM]

- Korn, E.L. and Graubard, B.I. (1995). "Examples of Differing Weighted and Unweighted Estimates from a Sample Survey."
*The American Statistician*, 49(3):291-295. - Survey analysis in R
- Kish, L. (1992). "Weighting for Unequal Pi."
*Journal of Official Statistics*, 8(2):183-200.

- None.

- Fox, Chapter 1.
- Berk (2003).
*Regression Analysis: A Constructive Critque*: Chapter 11 (What to do). [Available on Blackboard] - Rosenbaum (2002).
*Observational Studies*: Chapters 11 (Planning an observational study) and 12 (Some strategic issues). [Available on Blackboard]