My research focus is on the intersection of machine learning and education.

Selected areas of my work include:
1. Learning and content analytics: analyzing the knowledge state of every student and the quality of every learning resource, i.e., textbooks, lecture videos, assessment questions.
2. Grading and feedback: automatically grading student responses to open-response questions and providing feedback.
3. Personalization: recommending fail-safe personalized learning actions for every individual student to maximize their future knowledge retention.
4. Behavior analysis: analyzing how learner behavior affects learning outcome.

My analytics and personalization algorithms have been integrated into OpenStax Tutor, OpenStax's personalized learning system. In the 2017-2018 academic year, nearly 1.5 million U.S. college students are using OpenStax's collection of 29 free, online textbooks.

I am also broadly interested in many areas of machine learning and its application to many other domains.

Selected works and publications are listed below; see my publications page for a full list.


Learning and Content Analytics

SPARse factor analysis for learning and content analytics (SPARFA)


SPARFA is a purely data-driven framework for learning and content analytics. Under the observation that there are only a small number of latent factors (which we term ''concepts'') that control students' performance, SPARFA analyzes binary-valued (correct/incorrect) graded student responses to assessment questions, and jointly estimates i) question-concept associations, ii) student concept knowledge, and iii) question intrinsic difficulties. SPARFA performs learning analytics by providing personalized feedback to the students on their knowledge level on each concept, and performs content analytics by analyzing how every question is related to each concept and how difficult it is. The original SPARFA paper can be found here:

A. S. Lan, A. E. Waters, C. Studer, and R. G. Baraniuk, "Sparse Factor Analysis for Learning and Content Analytics," Journal of Machine Learning Research (JMLR), Vol. 15, pp. 1959–2008, June 2014

An extension to analyze ordinal responses (partial credits) can be found here:

A. S. Lan, C. Studer, A. E. Waters, and R. G. Baraniuk, "Tag-Aware Ordinal Sparse Factor Analysis for Learning and Content Analytics," Proc. International Conference on Educational Data Mining (EDM), pp. 90–97, July 2013

An extension that jointly analyzes graded response data and question text to interpret the meaning of the latent concepts can be found here:
A. S. Lan, C. Studer, A. E. Waters, and R. G. Baraniuk, "Joint Topic Modeling and Factor Analysis of Textual Information and Graded Response Data," Proc. International Conference on Educational Data Mining (EDM), pp. 324–325, July 2013

An extension that performs time-varying learning analytics by tracing students' knowledge evolution through time and also improves content analytics by analyzing the content and quality of learning resources (e.g., textbooks, lecture videos, etc.) can be found here:
A. S. Lan, C. Studer, and R. G. Baraniuk, "Time-Varying Learning and Content Analytics via Sparse Factor Analysis," Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp 452–461, Aug. 2014


Non-linear student-response models: Dealbreaker and BLAh

Most existing student-response models are linear and additive, which achieve good prediction performance but admits limited interpretability. We develop two non-linear student-response models, the Dealbreaker model, which models a students' chance in answering a question correctly as only dependent on their minimum concept knowledge among the concepts the question covers, and the Boolean logic analysis (BLAh) model, which models binary-valued graded student responses as outputs of Boolean logic functions.

Traditional compensatory student-response models, including SPARFA, characterizes a student's success probability when answering a question as dependent on a linear combination of their knowledge on different concepts. Such linear models can be used to predict unobserved responses, but offer limited interpretability since they allow students to make up for their lack of knowledge on certain concepts with high knowledge on other concepts. On the contrary, the Dealbreaker model is a non-linear model that characterizes a student's success probability on a question as only dependent on their weakest knowledge among all concepts tested in the question. The Dealbreaker paper can be found here:

A. S. Lan, T. Goldstein, R. G. Baraniuk, and C. Studer, "Dealbreaker: A Nonlinear Latent Variable Model for Educational Data," Proc. International Conference on Machine Learning (ICML), pp. 266–275, June 2016

The BLAh model goes beyond the "AND" family of models the Dealbreaker model belongs, and characterizes the graded response of a student on a question as the output of the Boolean logic function corresponding to the question, therefore being much more flexible and interpretable than the the Dealbreaker model. The BLAh paper can be found here:

A. S. Lan, A. E. Waters, C. Studer, and R. G. Baraniuk, "BLAh: Boolean Logic Analysis for Graded Student Response Data," IEEE Journal of Selected Topics in Signal Processing (JSTSP), Vol. 11, Issue 5, pp. 754-764, Aug. 2017



Grading and Feedback

Mathematical language processing (MLP)


MLP is a framework for analyzing students' responses to open-response mathematical questions for grading and feedback. We featurize and cluster students' responses to open-ended mathematical questions, e.g., freelancing derivations that are common in science, technology, engineering and mathematics (STEM) fields. Then, we perform automatic grading and feedback using a small number of instructor-graded responses. The MLP paper can be found here:

A. S. Lan, D. Vats, A. E. Waters, and R. G. Baraniuk, "Mathematica Language Processing: Automatic Grading and Feedback for Open Response Mathematical Questions," Proc. ACM Conference on Learning at Scale (L@S), pp. 167–176, Mar. 2015


Misconception detection


We developed a new natural language processing-based framework to detect the common misconceptions among students' textual responses to short-answer questions. Our framework excels at classifying whether a response exhibits one or more misconceptions. More importantly, it can also automatically detect the common misconceptions exhibited across responses from multiple students to multiple questions; this property is especially important at large scale, since instructors will no longer need to manually specify all possible misconceptions that students might exhibit. The paper can be found here:

J. Michalenko, A. S. Lan, and R. G. Baraniuk, "Data-mining Textual Responses to Uncover Misconception Patterns," Proc. International Conference on Educational Data Mining (EDM), pp. 208-213, June 2017



Personalization

Personalized learning action selection


We study the problem of turning the insights gained from learning and content analytics into personalization -- providing personalized recommendations for each student on what learning actions (read a section of a textbook, watch a lecture video, work on a practice question, etc.) the should take. We make use of the contextual bandits framework; the papers can be found here:

A. S. Lan and R. G. Baraniuk, "A Contextual Bandits Framework for Personalized Learning Action Selection," Proc. International Conference on Educational Data Mining (EDM), pp. 424–429, June 2016


An extension on taking uncertain context into account can be found here:

I. Manickam, A. S. Lan, and R. G. Baraniuk, "Contextual Multi-armed Bandit Algorithms for Personalized Learning Action Selection," Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6344-6348, Mar. 2017 (invited paper)


Safe personalization


We demonstrate that linearizing the probit model in combination with linear estimators performs on par with state-of-the-art nonlinear regression methods, such as posterior mean or maximum a-posteriori estimation. More importantly, we derive exact, closed-form, and nonasymptotic expressions for the mean-squared error of our linearized estimators. Applying our linearization technique to IRT models yield much tighter bounds on learner and question parameter estimates, especially when the numbers of learners and questions are small. Therefore, our analysis has the potential to improve the safety of personalization. The paper can be found here:

A. S. Lan, M. Chiang, and C. Studer, "Linearized Binary Regression," Conference on Information Sciences and Systems (CISS), Mar. 2018, to appear



Behavior Analysis

Measuring engagement from clickstream data


We propose a new model for learning that relates video-watching behavior to engagement level. One of the advantages of our method for determining engagement is that it can be done entirely within standard online learning platforms, serving as a more universal and less invasive alternative to existing measures of engagement that require the use of external devices. We also find that our model identifies key behavioral features (e.g., larger numbers of pauses and rewinds, and smaller numbers of fast forwards) that are correlated with higher learner engagement. The paper can be found here:

A. S. Lan, C. Brinton, T. Yang, and M. Chiang, "Behavior-Based Latent Variable Model for Learner Engagement," Proc. International Conference on Educational Data Mining (EDM), pp. 64-71, June 2017


Instructor preference analysis


We propose a latent factor model that analyzes instructors' preferences in explicitly excluding particular questions from learners' assignments in a particular subject domain. We incorporate expert-labeled Bloom's Taxonomy tags on each question as a factor in our statistical model to improve model interpretability. Our model provides meaningful interpretations that help us understand why instructors exclude certain questions, thus helping automated learning systems to behave more "instructor-like". The paper can be found here:

Z. Wang, A. S. Lan, and R. G. Baraniuk, "A Latent Factor Model For Instructor Content Preference Analysis," Proc. International Conference on Educational Data Mining (EDM), pp. 290-295, June 2017



Non-educational Applications

Learning robust binary hash functions


We propose a new data-dependent method to learn binary hash functions. Inspired by recent progress in robust optimization, we develop a novel hashing algorithm, dubbed RHash, that minimizes the worst-case distortion among pairs of points in a dataset. We show that RHash achieves the same retrieval performance as the state-of-the-art algorithms in terms of average precision while using up to 60% fewer bits, using several large-scale real-world image datasets. The paper can be found here:

A. Aghazadeh, A. S. Lan, A. Shrivastava, and R. G. Baraniuk, "RHash: Robust Hashing via \ell_{\infty}-norm Distortion," Proc. International Joint Conference on Artificial Intelligence (IJCAI), pp. 1386-1394, Aug. 2017


Sensor selection for biosensing and structural health monitoring


We develop a new sensor selection framework for sparse signals that finds a small subset of sensors (less than the signal dimension) that best recovers such signals. Our proposed algorithm, Insense, minimizes a coherence-based cost function that is adapted from classical results in sparse recovery theory. Using a range of datasets, including two real-world datasets from microbial diagnostics and structural health monitoring, we demonstrate that Insense significantly outperforms conventional algorithms when the signal is sparse. The paper can be found here:

A. Aghazadeh, M. Golbabaee, A. S. Lan, and R. G. Baraniuk, " Insense: Incoherent Sensor Selection for Sparse Signals," Signal Processing, 2018, to appear


Cloud dynamics and bidding strategy


We propose a nonlinear dynamical system model for the time-evolution of the spot price as a function of latent states that characterize user demand in the spot and on-demand markets. This model enables us to adaptively predict future spot prices given past spot price observations, allowing us to derive user bidding strategies for heterogeneous cloud resources that minimize the cost to complete a job with negligible probability of interruption. The paper can be found here:

M. Khodak, L. Zheng, A. S. Lan, C. Joe-Wong, and M. Chiang, "Learning Cloud Dynamics to Optimize Spot Instance Bidding Strategies," IEEE International Conference on Computer Communications (INFOCOM), Apr. 2018, to appear


Phase retrieval


We show that with the availability of an initial guess, phase retrieval can be carried out with an ever simpler, linear procedure. Our algorithm, called PhaseLin, is the linear estimator that minimizes the mean squared error (MSE) when applied to the magnitude measurements. We demonstrate that by iteratively using PhaseLin, one arrives at an efficient phase retrieval algorithm that performs on par with existing convex and nonconvex methods on synthetic and real-world data. The paper can be found here:

R. Ghods, A. S. Lan, T. Goldstein, and C. Studer, "PhaseLin: Linear Phase Retrieval," Conference on Information Sciences and Systems (CISS), Mar. 2018, to appear