Investigating Student Mistakes in Introductory Data Science Programming

Technical Symposium on Computer Science Education |

Published by ACM

Data Science (DS) has emerged as a new academic discipline where students are introduced to data-centric thinking and generating data-driven insights through programming. Unlike traditional introductory programming education, which focuses on program syntax and core  computer Science (CS) topics (e.g., algorithms and data structures), introductory DS education emphasizes skills such as studying the data at hand to gain insights and making effective use of programming libraries (e.g., re, NumPy, pandas, scikit-learn). To better understand learners’ needs and pain points when they are introduced to DS programming, we investigated a large online course on data manipulation designed for graduate students who do not have a CS or Statistics undergraduate degree. We qualitatively analyzed incorrect student code submissions for computational notebook-based programming assignments in Python. We identified common mistakes and grouped them into the following themes: (1) programming language and environment misconceptions, (2) logical mistakes due to data or problem-statement misunderstanding or incorrectly dealing with missing values, (3) semantic mistakes from incorrect usage of DS libraries, and (4) suboptimal coding. Our work provides instructors valuable insights to understand student needs in introductory DS courses and improve course pedagogy, along with recommendations for developing assessment and feedback tools to better support students in large courses