Workshop on Data Science Innovation with NSF Big Data Hubs
October 29, 2018 - October 30, 2018

Workshop on Data Science Innovation with NSF Big Data Hubs

Location: Redmond, WA

  • Speaker: Franco Pestilli (opens in new tab)

    Neuroscience is at the forefront of science by reaching across disciplinary boundaries and promoting transdisciplinary research. This process can, in principle, facilitate discovery by convergent efforts from theoretical, experimental and cognitive neuroscience, as well as computer science and engineering. To ensure success, mechanisms to guarantee reproducibility of scientific results must be established. Open software development and data sharing are therefore paramount in the quest to achieve reproducibility. We present brainlife.io, a platform which addresses challenges of neuroscience reproducibility by providing integrative mechanisms for publishing data, and algorithms while embedding them with computing resources to impact multiple scientific communities.

  • Speaker: Lucas Joppa (opens in new tab)

    The speed and scale at which climate systems are changing, and the enormity of the human impact of those changes, requires a commensurate response in how society monitors, models, and manages climate systems. A key component to that response will emerge from the fundamentals of AI – transforming how we collect data, convert those data into actionable information, and communicate that information across the world. By training increasingly sophisticated algorithms with this unprecedented collection of data on dedicated computational infrastructure, we can combine human and computer intelligence in a way that will allow us to make increasingly informed and optimal choices about today – and tomorrow.

  • Speaker: Pietro Michelucci (opens in new tab)

    Our greatest opportunity for problem-solving comes not from humans alone or from Artificial Intelligence (AI) alone, but by combining them in distributed networks. Leveraging the complementary abilities of humans and machines allows us to create unprecedented capabilities today. The EyesOnALZ citizen science project accelerates Alzheimer’s disease research by strategically combining machine learning and Crowd AI (human computation) methods. The specific human/machine partnerships that enabled this capability have co-evolved with algorithmic advancements and computing platforms like Azure.

  • Speaker: Gari Clifford (opens in new tab)

    Perhaps the most significant barrier to accurate machine learning of medical data is the lack of accurate labels on which to train data. PhysioNet has set the gold standard in physiological databases. The last 18 years of PhysioNet/Computing in Cardiology (CinC) Challenges and related work have demonstrated that standard expert labels are far more error prone than would be expected by a relatively well-established medical field. With such inconsistencies, it is impossible for a diagnostic system to realistically achieve performance measure above 80-90%, which in turn prevents their use without human oversight; an imperative for large scale analysis. He will discuss solutions including a voting approach that combines multiple algorithms (and humans) of varying performance levels in an efficient manner to boost labels and classifier performances.

  • Speaker: Renata Rawlings-Goss (opens in new tab)

    The ability to utilize and understand data is an increasingly critical skill for the evolving 21st-century workforce. Sectors posting data-driven jobs, grants, and opportunities are realizing a critical shortfall in data-literate talent for the positions of today as well as tomorrow. To combat this shortfall, underrepresented groups and schools must be engaged in data science training. This fact has sparked a collective effort to design programs that engage a broader community around STEM and data science. As part of this effort, the South Big Data Hub has created a program, called DataUp, to accelerate data science education across the region.

  • Speaker: Eric Horvitz (opens in new tab)

    Eric will share directions and results enabled by the confluence of large-scale data resources, jumps in computational power, and advances in machine learning. He will focus on efforts that leverage learning and inference to assist people with decisions, touching on work in transportation, medicine, and human-machine collaboration.

  • Speaker: Sarah Stone (opens in new tab)

    The UW Data Science for Social Good (DSSG) program partners eScience Institute Data Scientists and Student Fellows from across the country with Project Leads from academia, government, and private sector to find data-driven solutions to societal challenges. Previous projects span transportation, public health, urban planning, and disaster response. Project-based discussions around ethics, human-centered design and stakeholder collaboration are keystones of our program. Differences in prior experience and training among student fellows can pose a challenge, but become a strength in project work. Our experience supports the notion that DSSG programs can both impact social good and provide data science training for students from diverse disciplinary backgrounds.

  • Speaker: Karen Matthys (opens in new tab)

    Big Data market revenues are projected to reach $103 Billion by 2027. This expanding field presents a great opportunity for women and minorities to take on technical and leadership roles in all sectors. One way that we are addressing this opportunity is through the Women in Data Science Conference (WiDS), which was launched at Stanford just 3 years ago and now reaches over 100,000 people worldwide. This talk with share outcomes from WIDS global collaboration, lessons learned, and future plans. Karen will also cover other interdisciplinary collaborations at Stanford.

  • Speaker: Geralyn Miller (opens in new tab)

    This talk will explore why genomics workloads are perfect for the cloud and how Microsoft Research is using the cloud to disrupt genomics discovery.

  • Speaker: Ranveer Chandra (opens in new tab)

    Data-driven techniques help boost agricultural productivity by increasing yields, reducing losses and cutting down input costs. However, these techniques have seen sparse adoption owing to high costs of manual data collection and limited connectivity solutions. Our solution, called FarmBeats, an end-to-end IoT & AI platform for agriculture that enables seamless data collection from various sensors, cameras and drones. Our system design explicitly accounts for weather-related power and Internet outages, which has enabled six month long deployments in two US farms. In this talk, he will describe the FarmBeats system, and also outline the AI challenges we are currently addressing for outdoor as well as indoor agriculture.

  • Speaker: Andrew Hoffman (opens in new tab)

    The recent partnership forged between Microsoft Research and the NSF Big Data Regional Innovation Hubs and Spokes furthers a relationship that began as early as 2010. The field of data science has shifted considerably in these intervening years, however, as have the organizational structures tasked with shepherding through data science research, innovation, and education. This presentation provides a biotechnical analysis of this changing landscape. Touching on both high-level science funding policy initiatives, as well as on the more practical work of forging and carrying out collaborations between government, industry, non-profits, and academia to enable cloud-based computational research, it discusses the multiple ways in which participating entities come to valorize these partnerships.

  • Speaker: Erin Robinson (opens in new tab)

    A critical part of effective earth science data and information system interoperability involves collaboration across geographically and temporally distributed communities. The Earth Science Information Partners (ESIP) is a broad-based, distributed community of science, data and information technology practitioners from across science domains, economic sectors and the data lifecycle primarily based in the United States. Over the last twenty years, ESIP’s open, participatory structure has provided a melting pot for coordinating around common areas of interest like data citation, experimenting on innovative ideas and capturing and finding best practices and lessons learned from across the network. This talk will provide an overview of relevant activities the ESIP is involved with and identify strategies for advancing data science research and innovation through open communities of practice like ESIP and Big Data Innovation Hubs.

  • Speaker: Braden Tierney (opens in new tab)

    We do not have a grasp on the scope of the microbiome’s gene content, a question crucial for understanding the role of microbes in host health. To quantify this genetic universe, we undertook a meta-analysis of 3,500 human shotgun-sequencing samples from two body sites, the mouth and gut. We found that prior work has drastically underestimated the genetic richness of human microbiota by tens of millions of genes. These results serve as an explanation for the large heterogeneity of microbiome-derived human phenotypes, a path forward for gene-centric approaches in microbiome studies, and a quantification of the need for larger-scale metagenomic analyses than what currently exists.

  • Speaker: Meghan Houghton (opens in new tab)

    An update on the National Science Foundation’s investments in data science, including through the Big Data Regional Innovation Hubs program and Harnessing the Data Revolution—one of NSF’s 10 Big Ideas for Future Investment.

  • Speaker: Kristin Tolle (opens in new tab)

    As Microsoft expands its AI portfolio of products, we have a keen interest in leveraging these technologies to enable people to do more. This is particularly critical in the nonprofit space where organizations are facing the world’s most challenging problems—from ensuring food and water security to enabling safer and more reliable disaster response. The Tech for Social Good team, in particular, the AI for Humanitarian Action, is working to build scalable, reusable solutions with nonprofits so that those with similar needs and missions can do more to help others build better lives. AI for Humanity is the third and most recently launched pillar of this applied AI mission to provide solutions to organizations aligned with our mission. This talk will briefly cover our mission and engagement model and discuss some of the solutions we are building for this community.

  • Speaker: Shashi Shekhar (opens in new tab)

    Spatial big data, such as trajectories and satellite imagery, have transformed our society via popular applications for navigation, ride-sharing, precision agriculture, public health, and public safety. It is only a start and bigger opportunities lies ahead. However, classical one-size-fits-all data science methods are grossly inadequate for analyzing spatial data due to severe problems such as gerrymandering and the very high cost of spurious patterns. To overcome the limitations of traditional data science, this presentation will summarize recent developments (spatial Hadoop, spatial statistics, spatial data mining, nano-satellites, high-definition roadmaps) and calls for community action to improve data science curriculum and computational platforms.