Data Science Summer School (DS3) header image

Microsoft Research Data Science Summer School

Region: North America

Dates: May 27, 2025 – June 20, 2025

Application deadline: Tuesday, April 15, 2025, 5:00pm PT

Projects

2023

Replicating “COVID-19 lockdowns cause global air pollution declines”

Students: Hamidou Ballo, Jiale (Jerry) Chen, Tenzin Chosang, Aleksandra (Aleks) Georgievska, Chaya (Sara) Goldberger, Albina Haque, Alexandra (Alex) Huey, Bibata Rabba Idi, John Jakobsen, Hadassah Krigsman, Elvis Soto, Xiuwen Zhu

This year’s project looked at replicating and extending [a pape (opens in new tab)r (opens in new tab)] on the effect of the COVID-19 lockdown on global air pollution. The original paper set out to compare the concentration of various air pollutants during early 2020 to same time period during the proceeding years, finding a substantial reduction in pollution, presumably due to a decrease in economic activity during that time. In attempting to reproduce the original findings using data released by the authors, the students found two potential errors in the authors’ original analysis. We worked with the authors to resolve these discrepancies, which changed the magnitude of the paper’s findings but not the overall conclusions. The authors have since updated the paper and issued [a correction (opens in new tab)] detailing the changes.

2022

Replicating “Differential COVID-19 case positivity in New York City”

Students: Warren Ball, Christopher Esquivel, Shashana Farber, Daniel Glick, Navpreet Kaur, Limor Kohanim, Elissa Leung, Abhishek Pokharel, Sushobhan Parajuli, Alexandra Roffe, Nirvi Shah, Christopher Stewart

This project involved replicating and extending [an article (opens in new tab)] on zip-code level correlations between COVID-19 caseloads in NYC in early 2020 and various socioeconomic factors along with daily mobility data. Students worked in groups of two and wrote their own code to download COVID caseloads and U.S. Census data, and then conducted analyzes to reproduce several figures and regressions presented in the paper. They were able to replicate many of the findings in the paper (e.g., with model fits that were nearly identical to those in the paper), after which they extended the work along various dimensions. More details are available in their group reports.

2021

Replicating “Small share of US police draw third of complaints in big cities”

Students: Nikola Baci, Ahuva Bechhofer, Karen Britt, Vanessa Johnson, Xin Yi Li, Yongqi (Yuki) Li, Yasiris Ortiz, Andrea Ramirez, Adina Scheinfeld, Sambhav Shrestha, Anthony Vallejo, Matt Veng

This project looked at a recent analysis (opens in new tab) of complaints against police officers by The Financial Times. The original piece examined complaints filed in Chicago, New York, and Philadelphia and found a common thread: a comparatively small set (10%) of officers drew over a third of each city’s total complaints. The students worked in groups of two and reproduced the analysis from the article, starting from original public data sets from each city. They were able to exactly replicate the reported findings, after which they extended the work to investigate new questions of their own design. This ranged from looking at how complaints varied by officers’ gender and race, to demographics of the person who filed to the complaint, to likelihood that the complaints were sustained based on officer and complainant demographics.

Get the code and data on Github > (opens in new tab)

2020

Replicating ‘Predicting the Present’ with search data

Students: Iman Abakoyas, Rajiv Basnet, Hasanat Jahan, Gabrielle Martinez, Krushang Shah, Basira Shirzad, Tamar Yastrab, Xiaona Zhou

This year’s project involved replicating and extending a widely read paper (Choi and Varian, 2011 (opens in new tab)) on using search data to predict current and future economic outcomes. Students worked in groups of two and wrote their own original code with two goals in mind: first to reproduce the results published in the paper, and second to extend those results in a direction of their choosing. The students were able to exactly replicate the paper’s results when using data provided by the authors, but saw some small discrepancies when using versions of the source data currently available online. We suspect these differences are due to changes in the underlying datasets and to unspecified preprocessing done by the authors. The students extended the paper in several ways: examining alternative models, forecasting on longer time horizons, and evaluating the value of search data on a longer timescale. In investigating the latter, the students found that the utility of search data has decreased since the time of the original publication, and that in recent years a simple baseline model that omits search data is, on average, more accurate than one that does.

Get the data and code on Github > (opens in new tab)

(Note: our 2020 program was modified due to COVID-19, shortened from 8 weeks to 4 weeks and run virtually.)

2019

Replicating ‘An Empirical Analysis of Radical Differences in Police Use of Force’

Students: Brenda Fried, Naomi Moreira, Harpreet Gaur, Cindy Muso, Adnan Hoq, Etta Rapp, Emeka Mbazor, Roymill Terrero

This project replicates and extends a recent paper on racial bias in police use of force. We selected this paper because it is both widely read and also an ideal candidate for a data analysis replication. It uses relatively simple methodology that seems straightforward to implement and check, relies on two publicly available datasets, and contains more than 100 pages between the main text and extensive appendices. Despite this nearly ideal setting, completing the data analysis replication turned out to be much more complicated than expected and took several weeks itself, mainly for reasons that centered around how the original data were cleaned and featurized. These challenges came despite the extensive documentation in the paper and its appendix, but they also helped uncover insights that might not have been obvious from simply reading the paper. We extended the paper’s results through the addition of map and census information as well as predictive checks of the underlying models used in the paper. In this talk we discuss the various challenges we faced in replicating the results and the insights that the replication revealed.

Watch the talk >

View the slides > (opens in new tab)

Get the source code on GitHub > (opens in new tab)

2018

Exploring the Reliability of the NYC Subway System

Students: Akbar Mirza, Brian Hernandez, Amanda Rodriguez, Renzhentaxi Baerde, Phoebe Nguyen, Peter Farquharson, Ayliana Teitelbaum, Sasha Paulovich

The New York City subway is the largest rapid transit system in the world, serving approximately 5.5 million riders each day. Recently there has been a growing concern over the state of the subway system due to aging equipment as reflected in system-wide metrics such as “on-time percentage”, or how often trains run according to schedule. While these metrics provide some insight into the performance of the subway system, they fail to capture how riders experience the system. In this project we use recently released countdown clock data that logs where each train is reported to be at each minute of the day to gain a better understanding of how riders experience the subway system. We examine rider wait times and trip times, considering not just average but also worst-case performance of the system. We also compare the subway to above ground travel, investigate how changes to the system affect rider options, and look at how commutes vary across demographic groups. We find that the subway is typically quite reliable, but that averages can be misleading: variance in subway performance can account for up to a 50% difference between average and worst-case travel times. We also find a correlation between income and commute times and that small changes to the system (e.g., adding or removing stops or lines) can have large effects on riders’ options.

Watch the talk >

View the slides > (opens in new tab)

Get the source code on GitHub > (opens in new tab)

2017

Student Trajectories and School Choice in the NYC Public School System

Students: Keri Mallari, David Futran, Francois Mertil, Ilana Radinsky, Anandini Chawla, Rivka Schuster, Ro Liriano, Thoa Ta

New York City serves over one million public school students each year, yet relatively little is understood in terms of how students’ progress through the school system. In this talk we use individual-level student data over a ten-year time period to explore how early test performance correlates with later success, to describe and predict which students leave the public school system, and to examine effects of the recently implemented high school choice system.

Watch the talk >

View the slides > (opens in new tab)

Read the paper >

Get the source code on GitHub > (opens in new tab)

2016

Airbnb: Predicting Loyalty

Students: Louise Lai, Kaciny Calixte, Jacqueline Curran, and Erica Ram

The advent of the sharing economy has redefined the way firms do business. Airbnb has led this revolution. With a valuation of $25 billion, it has become the world’s third most valued startup and has more rooms than the world’s largest hotel chain. Historically, customer loyalty was based on experience with a particular firm, but now it is based on experiences with many individuals. We chose to use the Inside Airbnb dataset to further investigate the evolving idea of loyalty. Airbnb has both hosts and guests as customers. Host loyalty is defined as a host renting consistently, and guest loyalty as guests returning frequently. We used decision trees to look at both the loyalty of the hosts and the guests. No matter the industry, market experts stand by measures of recency frequency to predict loyalty. However, our model is able to improve upon this idea with added features, such as review text and amenities. The end result is a model that successfully predicts the return rate of hosts and guests to Airbnb with a high level of accuracy.

Watch the talk >

View the slides >

Read the paper >

Get the source code on GitHub > (opens in new tab)

Fare Share: Flow and Efficiency in NYC’s Taxi System

Students: Jai Punjwani, Abraham Neuwirth, Marieme Toure, and Fatima Chebchoub

New York City is home to millions of people who rely on its robust transportation system. The taxi system plays a critical role in helping people navigate the city. With access to information about every single trip that occurred in a yellow taxi in 2013, we were able to reveal patterns in how people move throughout the city. We also analyzed driver efficiency, showing that there is a substantial skill involved in driving a taxi, with some drivers consistently earning up to 30% more than average. Finally, we used the highly granular nature of this data to identify the locations of redundant trips and showed that a simple carpooling strategy could reduce the amount of money spent on taxis and the number of taxi trips taken by upwards of 7%. See an interactive map (opens in new tab) of travel patterns across neighborhoods.

Watch the talk >

View the slides > (opens in new tab)

Read the paper >

Get the source code on GitHub > (opens in new tab)

2015

The Cost of Public School

Students: Thomas Patino, Anastassiya Neznanova, Nikki Hanson, and Glenda Ascencio

New York City is home to the largest public school system in the country, which contains some of the best and worst schools in the state. Given this diversity, which often occurs over small geographic regions, there is extremely high demand for homes in the best public schools in the city. We investigate and quantify this demand by analyzing over 10,000 home sales in different school zones across the city and reveal the implicit cost of purchasing a home zoned for each elementary school in the city. See an interactive map (opens in new tab) of school zone prices.

Watch the talk >

View the slides > (opens in new tab)

Get the source code on GitHub > (opens in new tab)

The Ins and Outs of the New York City Subway System

Students: Eiman Ahmed, Shannon Evans, Riva Tropp, and Steven Vazquez

Every day, the population of New York regions shrinks and swells as people travel into and around the city. With six million daily trips, the subway system is one of the main conduits for these travelers, but relatively little is known about the flow of subway passengers throughout the day. Using MTA’s public datasets, our team mapped the paths commuters take, and consequentially, the substantial changes to the population in the city’s many regions.

Watch the talk >

View the slides > (opens in new tab)

Get the source code on GitHub > (opens in new tab)

2014

Self-Balancing Bikes

Students: Briana Vecchione, Franky Rodriguez, Donald Hanson II, Jahaziel Guzman

Bike sharing is an internationally implemented system for reducing public transit congestion, minimizing carbon emissions, and encouraging a healthy lifestyle. Since New York City’s launch of the CitiBike program in May 2013, however, various issues have arisen due to overcrowding and general flow. In response to these issues, CitiBike employees redistribute bicycles by vehicle throughout the New York City area. During the past year, over 500,000 bikes have been redistributed in this fashion. This solution is financially taxing, environmentally and economically inefficient, and often suffers from timing issues. What if CitiBike instead used its clientele to redistribute bicycles? In this talk, we describe the data analysis that we conducted in hopes of creating an incentive and rerouting scheme for riders to self-balance the system. We anticipate that we can decrease vehicle transportations by offering financial incentives to take bikes from relatively full stations and return bikes to relatively empty stations (with rerouting advice provided via an app).

Watch the talk >

View the slides > (opens in new tab)

Read the paper >

An Empirical Analysis of Stop-and-Frisk in New York City

Students: Md.Afzal Hossain, Khanna Pugach, Derek Sanz, Siobhan Wilmot-Dunbar

Between 2006 and 2012, the New York City Police Department made roughly four million stops as part of the city’s controversial stop-and-frisk program. We empirically study two aspects of the program by analyzing a large public dataset released by the police department that records all documented stops in the city. First, by comparing to block-level census data, we estimate stop rates for various demographic subgroups of the population. We find that the average annual number of stops of young, black men exceeds the number of such individuals in the general population. This disparity is even more pronounced when we account for geography, with the number of stops of young black men in certain neighborhoods several times greater than those in the local population. Second, we statistically analyze the reasons recorded in our data that officers state for making each stop (e.g., “furtive movements” or “sights and sounds of criminal activity”). By comparing which stated reasons best predict whether a suspect is ultimately arrested, we develop simple heuristics to aid officers in making better stop decisions.

Watch the talk >

View the slides > (opens in new tab)

Read the paper >