## Program in Statistics and Machine Learning

#### Director

Kosuke Imai

#### Executive Committee

Barbara Engelhardt, Computer Science

Jianqing Fan, Operations Research and Financial Engineering

Kosuke Imai, Politics

John D. Storey, Lewis-Sigler Institute for Integrative Genomics

#### Associated Faculty

Yacine Ait-Sahalia, Economics

Sanjeev Arora, Computer Science

Mung Chiang, Electrical Engineering

Jonathan D. Cohen, Psychology, Princeton Neuroscience Institute

Paul W. Cuff, Electrical Engineering

David P. Dobkin, Computer Science

Kirill Evdokimov, Economics

Elad Hazan, Computer Science

Bo E. Honore, Economics

Michal Kolesar, Woodrow Wilson School, Economics

Samory K. Kpotufe, Operations Research and Financial Engineering

Sanjeev R. Kulkarni, Electrical Engineering

Han Liu, Operations Research and Financial Engineering

Ulrich K. Mueller, Economics

Jonathan W. Pillow, Psychology, Princeton Neuroscience Institute

Peter J. Ramadge, Electrical Engineering

Marc Ratkovic, Politics

Matthew J. Salganik, Sociology

H. Sebastian Seung, Computer Science, Princeton Neuroscience Institute

Christopher A. Sims, Economics

Amit Singer, Mathematics, Applied and Computational Mathematics

Mona Singh, Computer Science, Lewis-Sigler Institute for Integrative Genomics

Michael A. Strauss, Astrophysical Sciences

Olga G. Troyanskaya, Computer Science, Lewis-Sigler Institute for Integrative Genomics

Ramon van Handel, Operations Research and Financial Engineering

Robert J. Vanderbei, Operations Research and Financial Engineering

Sergio Verdu, Electrical Engineering

Mark W. Watson, Woodrow Wilson School, Economics

#### Sits with Committee

Germán Rodriguez, Population Research

#### Information and Departmental Plan of Study

The Program in Statistics and Machine Learning is offered by the Center for Statistics and Machine Learning. The program is designed for students, concentrating in any department, who have a strong interest in data analysis and its application across disciplines. Statistics and machine learning -- the academic disciplines centered around developing and understanding data analysis tools -- play an essential role in various scientific fields including biology, engineering, and the social sciences. This new field of "data science" is interdisciplinary, merging contributions from computer science and statistics, and addressing numerous applied problems. Examples of data analysis problems include analyzing massive quantities of text and images, modeling cell-biological processes, pricing financial assets, evaluating the efficacy of public policy programs, and forecasting election outcomes. In addition to its importance in scientific research and policy making, the study of data analysis comes with its own theoretical challenges, such as the development of methods and algorithms for making reliable inferences from high-dimensional and heterogeneous data. This program provides students with a set of tools required for addressing these emerging challenges. Through the program, students will learn basic theoretical frameworks and apply statistics and machine learning methods to many problems of interest.

#### Admission to the Program

Students are admitted to the program after they have chosen a concentration, generally by the beginning of their junior year. At that time, students must have prepared a tentative plan and timeline for completing all of the requirements of the program, including required courses and independent work (as outlined below), as well as any prerequisites for the selected courses. For enrollment or questions contact Tara Zigler, program manager.

#### Program of Study

Students are required to take a total of five courses and earn at least a B- for each course: one of the "Foundations of Statistics" courses, one of the "Foundations of Machine Learning" courses, and three elective courses. With all necessary permissions, advanced students may also take approved graduate-level courses. Students may count at most two courses from their departmental concentration or another certificate program toward this certificate program.

Students are also required to complete a thesis or at least one term of independent work in their junior or senior year on a topic that makes substantial application or study of machine learning or statistics. This work may be used to satisfy the requirements of both the program and the student's department of concentration. Submission is due on the same date as your department deadline for thesis or junior independent work. All work will be reviewed by the Statistics and Machine Learning certificate committee. At the end of academic each year, there will be a public poster session for students to present their work to each other, to other students, and to the faculty.

Finally, students are encouraged to attend one of the Statistics and Machine Learning colloquia on campus. These include the Wilks Statistics Seminar, the Machine Learning Seminar, the Political Methodology Seminar, and the Quantitative and Computational Biology Seminar.

**Courses**

**One of the following courses (Foundations of Statistics)**

ECO 202 Statistics and Data Analysis for Economics

EEB 355/MOL 355 Introduction to Statistics for Biology

ORF 245 Fundamentals of Engineering Statistics

POL 345 Quantitative Analysis and Politics

PSY 251 Quantitative Methods

WWS 200 Statistics for Social Science

**One of the following courses (Foundations of Machine Learning)**

COS 424 Interacting with Data

ORF 350 Analysis of Big Data

**Three of the following courses (including those above, with permission)**

*Machine Learning*

COS 402 Artificial Intelligence

ELE 477 Kernel-Based Machine Learning

ORF 418 Optimal Learning

*Theory*

MAT 385 Probability Theory

ORF 309 Probability and Stochastic Systems

ORF 463/COS 323 Computing and Optimization

*Applied Statistics*

AST 303 Observing and Modeling the Universe

CEE 460 Risk Analysis

ECO 302 Econometrics

ECO 312 Econometrics: A Mathematical Approach

ECO 313 Econometric Applications

ELE 480/NEU 480/PSY 480 FMRI Decoding: Reading Minds Using Brain Scans

GEO 422 Data, Models, and Uncertainty in the Natural Sciences

MOL 436 Statistical Methods for Genomic Data

ORF 405 Regression and Time Series

POL 346 Applied Quantitative Analysis

#### Certificate of Proficiency

Students who fulfill the program requirements receive a certificate upon graduation.

### Courses

SML 101 Reasoning with Data Fall, Spring QR

Data-driven decision-making, research discovery, and technology development are everywhere. It is now more important than ever for individuals to understand how data are used for these purposes. This course will introduce the student to how statistical reasoning and methods are used to learn from and leverage modern data. The emphasis will be on concepts and strategies for learning from data, rather than on sophisticated mathematics. Students will be exposed to the basics of statistics, machine learning, and data science through real world problems and applications. Students will also analyze data sets using the computer.
* Staff*

SML 201 Introduction to Data Science Spring QR

This course provides an introduction to the burgeoning field of data science, which is primarily concerned with data-driven discovery and utilizing data as a research and technology development tool. We cover approaches and techniques for obtaining, organizing, exploring, and analyzing data, as well as creating tools based on data. Elements of statistics, machine learning, and statistical computing form the basis of the course content. We consider applications in the natural sciences, social sciences, and engineering.
*
J. Storey*

SML 302 Fundamentals of Machine Learning (see COS 424)