October 8, 2012 - October 9, 2012

eScience Workshop 2012

Location: Chicago, IL, U.S.

Monday, October 8, 2012

  • Speaker: Tony Hey, Microsoft Research | slides

  • Chair: Kristin Tolle, Microsoft Research

    Speaker: Drew Purves, Microsoft Research | video | slides

    To manage the planet on which we all depend, we need to predict the future outcome of various options. How would biofuel subsidies affect crop prices affect deforestation? CO2 emissions affect climate change affect fire? At present, we cannot make such predictions with any confidence. But, as I’ll show in this talk, a computational approach to environmental science can change that. I’ll explain how we built the first fully data-constrained model of the terrestrial carbon cycle, using Big Data, cloud computing, and machine learning. And I’ll demo similar models for global food production, Amazon deforestation, and bird biodiversity. The prototype tools on which these models have been built—for example, FetchClimate, Filzbach, WorldWide Telescope—are freely available, and will hopefully allow other scientists to adopt a rigorous approach to modeling the complexities of the biosphere.

  • Chair: Yan Xu, Microsoft Research

    Panel: Open Data for Open Science—Data Interoperability | video

    Speakers:

    • Robert Gurney, University of Reading
    • Philip Murphy, University of Redlands | slides
    • Karen Stocks, University of California, San Diego | slides
    • Yan Xu, Microsoft Research | slides
    • Ilya Zaslavsky University of California, San Diego | slides

    The goal of cross-domain interoperability is to enable reuse of data and models outside the original context in which these data and models are collected and used and to facilitate analysis and modeling of physical processes that are not confined to disciplinary or jurisdictional boundaries. A new research initiative of the U.S. National Science Foundation, called EarthCube, is developing a roadmap to address challenges of interoperability in the Earth sciences and create a blueprint for community-guided cyberinfrastructure accessible to a broad range of geoscience researchers and students.

    The panel will discuss this and related initiatives and projects, focusing on challenges of data discovery, interpretation, access, and integration across domain information systems, assessment of their readiness for cross-domain integration, and technologies enabling interoperability in the geosciences.

  • Chair: Kristin Tolle, Microsoft Research

    Panel: Enabling Multi-Scale Science | video

    Speakers:

    • Roberto Cesar, University of Sao Paulo (USP) | slides
    • James Hunt, University of California, Berkeley | slides
    • Claudia Bauzer Medeiros, University of Campinas (UNICAMP) | slides

    eScience research increasingly involves the need to facilitate multi-scale problem solving that spans wide ranges in space and time scales. It requires collaboration among researchers and practioneers from multiple disciplines, each with their own orientations towards problem identification, solution formulation and implementation. The panel aims to discuss some of the challenges of working in multi-scale scenarios.

    Panelists will present these challenges from two perspectives: application, and computing approaches. The first perspective will focus on issues such as scientific profiles involved, scales considered, data collected and produced, models and visualization needs. The second viewpoint will consider, among others, characteristics of data and storage structures to accommodate the wide variety of data scales and formats, language/workflow constructs that may facilitate the specification, execution and interaction of models, and interface/interaction primitives.


    The Internet of Databases—Generalizing the Archaeo Informatics Approach | video | slides

    Speaker: Chris van der Meijden, Ludwig Maximilians University of Munich, Germany

    One thing we have learned from our Archaeo-Data-Network is that there is a need to split meta information of databases in two levels. The first level contains a centralized unique ID and very few standard information. The second level of meta information is defined by the archaeo scientist. This can be implemented for any kind of archaeo database, so the network’s extensibility is virtually unlimited. The advantage of this dual meta approach is its flexible connectivity and, therefore, getting comprehensive data transparent available for general searching and mining. With this approach huge, rigid archives can be connected to small, flexible databases for scientific analysis in any scientific domain. Combined with simple authorization management for unpublished data, we see in our system the potential of being the general blueprint for an eScience infrastructure that we call the Internet of databases.


    Combining Semantic Tagging and Support Vector Machines to Streamline the Analysis of Animal Accelerometry Data | video | slides

    Speaker: Nigel Ward, The University of Queensland

    Increasingly, animal biologists are taking advantage of low cost micro-sensor technology by deploying accelerometers to monitor the behaviour and movement of a broad range of species. The result is an avalanche of complex tri-axial accelerometer data streams that capture observations and measurements of a wide range of animal body motion and posture parameters. We present a system that supports storing, visualizing, annotating and automatic recognition of activities in accelerometer data streams by integrating semantic annotation and visualization services with Support Vector Machine techniques.

  • Chair: Yan Xu, Microsoft Research | slides

    Panel: Handling Big Data for the Environmental Informatics / Real-Time Environmental Observation, Modeling, and Decision Support | video

    Speakers:

    • Jeff Dozier, University of California, Santa Barbara | slides
    • David Maidment, University of Texas, Austin | slides
    • Barbara Minsker, University of Illinois, Urbana-Champaign | slides
    • Chaowei Yang, George Mason University | slides

    Earth observations and other environmental data collection methods help us accumulate terabytes to petabytes of datasets. This pose a grand challenge to the informatics for environmental studies. We propose this session to capture the latest development on the Big Data collection, processing, and visualization in several aspects.

    With increasing near-real-time availability of embedded and mobile sensors, radar, satellite, and social media, the opportunities to improve understanding, modeling, and management of environmental systems, as well as the built and human systems that interact with environmental systems, is immense.

  • Chair: Dennis Gannon, Microsoft Research

    Active Publications | video

    Speakers:

    • Ian Foster, University of Chicago and Argonne National Laboratory | slides
    • Tanu Malik, University of Chicago and Argonne National Laboratory | slides

    The e-Science domain brings together scientists, experts, and engineers to enterprise comprehensive, large-scale data and computational cyberinfrastructures. The objective is to advance knowledge discovery in the sciences and establish effective channels of communication between the various disciplines. Software, data, workflows, technical reports, and publications are often the modes of this communication. However, currently all these modes of communication are disconnected from each other.

    E-publishing is changing the nature of scientific communication through digital publication repositories and libraries. But the larger and more pertinent issue is connecting these yet static digital e-publications repositories to large amounts of computation, data, derived data, and extracted information.

  • Chair: Kenji Takeda, Microsoft Research | slides

    Panel: Cloud Computing – What Do Researchers Want? | video

    Speakers:

    • Fabrizio Gagliardi, Microsoft Research | slides
    • Dennis Gannon, Microsoft Research | slides
    • Marty Humphrey, University of Virginia | slides
    • Paul Watson, Newcastle University | slides

    Cloud computing for science is seeing take-up in many disciplines, but many researchers are skeptical. In this panel session we will discuss:

    • How researchers are using the cloud today
    • What they want/need for the future
    • Why they might not want to use the cloud
  • Chair: Harold Javid, Microsoft Research

    Machine-Assisted Thought | video | slides

    Speaker: Michael J. Kurtz, Harvard-Smithsonian Center for Astrophysics

    I suggest that there are two distinct branches of eScience, both fundamentally enabled by the explosion of capabilities inherent in the information age. The first concerns the use of numbers, measurements from arrays of sensors, outputs from simulations, and so forth. The techniques of eScience increase our ability to perceive massive amounts of data by factors of billions or trillions. I call this Machine Assisted Perception.

    The second branch of eScience concerns the use of words, the verbal abstractions used by humans to communicate ideas. The new technologies of digital libraries and search engines have already substantially changed the scholarly thought process, growth in the capabilities of these technologies continues to be rapid. I call this machine/human collaboration Machine Assisted Thought.


    DemoFest | video | slides

    Chair: Jim Pinkelman, Microsoft Research

  • Presenter: Rob Fatland, Microsoft Research

    Layerscape is a set of (combined cloud/desktop) data visualization and collaboration tools provided at no cost by Microsoft Research. We describe how these tools (visualization engine, developer toolkit, RESTful API, Excel add-in, story authoring environment, collaboration/sharing website) can provide researchers and developers a way of addressing data deluge problems commonly faced in geoscience research. As a particular case study, we will discuss unfolding data streams from many sensors operated from autonomous underwater vehicles during a September 2012 experiment conducted by the Monterey Bay Aquarium Research Institute (MBARI) off the California coast. Additional visualizations will also be available for perusal and discussion, and may be freely searched and viewed at the support website.

  • Presenter: Ian Foster, University of Chicago and Argonne National Laboratory; Steve Tuecke, University of Chicago; Vas Vasiliadis, University of Chicago

    In millions of labs worldwide, researchers struggle with massive data, advanced software, complex experimental protocols, and burdensome reporting. The emergence of cloud computing offers the opportunity to accelerate discovery and innovation while reducing costs by outsourcing time-consuming information technology tasks from individual labs and institutions to third-party providers. Over the past two years, we have developed a cloud-hosted, high-performance data movement service that is currently used by thousands of researchers at campuses and institutions worldwide. We are expanding the capabilities we offer en route to our goal of delivering a comprehensive research data management solution comprising storage, sharing, cataloging, archiving, and other critical functions as a service. We expect these services will be particularly valuable to those investigators in small and medium-sized laboratories that face significant challenges in developing, deploying, and operating IT infrastructure to support their work.

  • Presenter: Eamonn Maguire, University of Oxford

    Minimum reporting guidelines, terminologies, and formats (referred to generally as community standards) are increasingly used in the structuring and curation of datasets, enabling data annotation to varying degrees and reproducible research. But how can we enable researchers to make use of existing community standards, maximize curation and sharing, and subsequently reuse richly annotated experimental information? A successful example is provided by the Investigation/Study/Assay (ISA) open source, metadata tracking framework supported by the growing ISA Commons community.

  • Presenter: Tanu Malik, University of Chicago and Argonne National Laboratory

    The exponential growth in the amount of scientific data means that revolutionary measures are needed for data management, analysis and accessibility. Online scientific databases—such as the SkyServer in astronomy, the Protein Data Bank in biology, and the PubChem in chemistry—are important repositories for publishing and accessing large scientific datasets. These databases have also become sources for new scientific research; researchers routinely interact with these repositories to search, download, and analyze relevant datasets. However, these interactions remain largely disconnected with the final outcomes of research, such as publications and journal articles. We will demonstrate components of the Science Object Linking and Embedding (SOLE) system, which aims to create interactive publications and make it easy to capture interactions with the online databases and associate them with publications.

  • Presenter: Carly Strasser, California Digital Library

    DataUp is a project sponsored by Microsoft Research and theGordon and Betty Moore Foundation, conducted at the University of California Curation Center of the California Digital Library. The project’s goal was to develop tools that help researchers document, organize, preserve, and share their scientific data. We focused on assisting Earth, environmental, and ecological scientists, since these groups historically have not practiced good data stewardship. In this session, we will demonstrate the DataUp add-in for Excel and the DataUp web application. Both the add-in and the web application perform four main tasks:

    • Perform a best practices check to ensure good data organization
    • Help guide the user through creation of metadata for their Excel file
    • Help the user obtain a unique identifier for their dataset
    • Connect the user to a DataONE repository, where their data can be deposited and shared with others
  • Presenter: Michael Witt, Purdue University

    Databib is a free, global, online catalog of research data repositories. Librarians and other information professionals have identified and cataloged more than 300 data repositories that can be easily browsed and searched by users or integrated with other platforms or cyberinfrastructure. Databib can help researchers find appropriate repositories to deposit their data, and it gives consumers of data a tool to discover repositories of datasets that meet their research or learning needs. Users can submit new repositories to Databib, which are reviewed and curated by an international board of editors. All information from Databib has been contributed to the public domain using the Creative Commons Zero protocol. Supported machine interfaces and formats include RSS, OpenSearch, RDF/XML, Linked Data (RDFa), and social networks such as Twitter, Facebook, and Google+.

  • Presenter: Dong Xie, Oxford University

    At the 2010 eScience Workshop, I presented my work “SYSQ – Questionnaire System for Large Scale Depression Study.” Now, two years later, we are finishing the phenotype collection and these data have already enabled us to publish more than 12 papers in various journals from an epidemiological prospective; the next round of papers are in the making on the complete dataset. Meanwhile, every two weeks, we are receiving external hard drives from a sequencing centre (2TB in size each), full of raw genome sequences coming from our patients and controls. These data need to be processed and associated with the phenotype so that we can finally find the gene for depression, after several years of hard work. This task by no means is trivial. The processing pipeline needs to be built from scratch. It brings pressure to the IT, to the bioinformatics; with limited resources and non-existent previous published work, one really need to think out of the box.

  • Presenter: Yan Xu, Microsoft Research

    We will demonstrate how the Open Data Protocol, OData, can be used to release scientific data from silos. The demo will showcase examples of using OData as the glue to seamlessly solve data interoperability problems among heterogeneous data sources.

  • Tuesday, October 9, 2012

    • Biology: A Move to Dry Labs | video

      Chair: Dan Fay, Microsoft Research | slides

      Speaker: David Heckerman, Microsoft Research | slides

      Since its beginning, the wet lab has been the key driver in biological discovery. Recently, however, more and more science is getting done in dry labs, those where only computational analysis is done. The presentation will include examples, ranging from genomics to vaccine design.

    • Chair: Gail Steinhart, Cornell University

      Panel: Educating Data Scientists for Scientific Data | video

      Moderator: Gail Steinhart, Cornell University

      Teaching Scientific Data Management in Data Science Education and Workforce Development Programs for Science Communities | video | slides

      Speaker: Robert R. Downs, Columbia University

      Recent popularity of data science has led to increased recognition of the need for education and workforce development in data science. However, definitions of the term, data science, vary and often focus on techniques for data analytics and visualization, omitting scientific data management and related topics associated with data policy, stewardship, and preservation. Scientific data management encompasses a variety of concepts and methods to foster continuing access and long-term stewardship of data for current and future users. Considering the needs for scientific data management knowledge and capabilities to facilitate improved and persistent accessibility and use of scientific data throughout the data lifecycle, instruction on topics in scientific data management is recommended for data science education and workforce development programs for science communities.


      Educating Scientists About the Data Life Cycle | slides

      Speaker: William Michener, University of New Mexico

      The research life cycle is well known and consists of an initial idea or question that, if sound, leads to submission and funding of a proposal, implementation of a study, and, ideally, to one or many publications that advance the state of knowledge. What is less well understood is how the research life cycle is related to the data life cycle. In this presentation we discuss approaches for educating scientists in eight phases of the data life cycle (for example, planning, data acquisition and organization, quality assurance/quality control, data description, data preservation, data exploration and discovery, data integration, and analysis and visualization). Specifically, we will look at the design and approaches used for developing learning modules, instructional material and resources, and an innovative three-week experiential course that enable participants to more efficiently and effectively manage their research data and compete for research funding are presented.


      Priorities for Data Curation Education: Data Center Partnerships and Long-Tail Science | video | slides

      Speaker: Carole Palmer, University of Illinois at Urbana-Champaign

      For science to fully exploit digital data in new and innovative ways, research data will need to be collected, curated, and made accessible and usable across domains. The need for workforce development in data curation systems and services has been recognized for many years, and education programs are beginning to mature. But to continue to build strong programs in this emerging field, current data curation practice and research needs to underpin goals for professional education. Having established a specialization in data curation in 2006, we have assessed our program’s progress to date and identified areas in need of further development to respond to trends in e-science. Analysis of student placements shows interesting trends in the institutions hiring data curation specialists and the nature of the positions, and evaluation of internships provided in national data centers has suggested important areas for further investment. In addition, our recent research on disciplinary differences in data sharing and the value of long-tail data in the sciences has direct implications for further development of data curation curriculum.


      Educating a New Breed of Data Scientists for Scientific Data Management | video | slides

      Speaker: Jian Qin, Syracuse University

    • Chair: Chris Mentzel, Gordon and Betty Moore Foundation

      Data scientists play active roles in the design and implementation work of four related areas: data architecture, data acquisition, data analysis, and data archiving. While any data and computing related academic unit could offer a data science program or curriculum, each of them has their own flavors: statistics would weigh heavily toward data analytics and computer science on computational algorithms. The information schools are taking a more holistic approach in educating data scientists. This presentation reports the data science curriculum development and implementation at Syracuse iSchool, which has been shaped by the quickly-changing, data-intensive environment not only for science but also for business and research at large. Research projects that we conducted on scientific data management with participation from the e-science student fellows demonstrates the need and significance of educating the new breed of data scientists who have the knowledge and skills to take on the work in the four related areas mentioned above.


      The Utility of a Human/Computer Learning Network For Improving Biodiversity Conservation and Research in eBird | video | slides

      Speaker: Carl Lagoze, University of Michigan

      We describe our work to improve the quality and utility of citizen science contributions to eBird, arguably the largest biodiversity data collection project in existence. Citizen science (the use of “human sensors”) is especially important in a number of observation-based fields, such as astronomy, ecology, and ornithology, where the scale and geographic distribution of phenomena to be observed far exceeds the capabilities of the established research community. Our work is based on the notion of a Human/Computer Learning Network, in which the benefits of active learning (in both the machine learning sense and human learning sense) are cyclically fed back among human and computational participants.


      Tools and Techniques for Outreach and Popular Engagement in eScience | video | slides

      Speaker: Rafael Santos, Instituto Nacional de Pesquisas Espaciais

      Public participation in scientific research takes many forms: participation of volunteers in citizen science projects, monitoring of natural resources and phenomena, volunteering of computational resources for distributed data analysis tasks, and so forth.

      In this presentation, we comment on some of the computational tools, techniques, and case studies of applications that enable active public participation in scientific research. Of particular interest are applications that showcase the benefits of letting the public use the professional resources (in other words, the same data and computational resources that the scientists have access to) and return something back to the research behind it, such as applications that go beyond simple publication of scientific data or applications that use novel methods for user engagement. Examples of applications for scientific outreach that use specialized computational tools or techniques, and/or educational approaches, are also discussed.


      Big Data Processing on the Cheap | video | slides

      Speaker: Joe Hummel, University of California, Irvine

      Getting started with big data? Generating more and more data without the hardware resources to process it? This session will help newcomers to “big data” get started processing and visualizing their data, without the need for expensive computing resources. While these techniques may not produce lightning-fast results, you can at least get started with your analysis.

    • Chair: Kenji Takeda, Microsoft Research

      What Is a Data Scientist? | video

      Speakers:

      • Liz Lyon, UKOLN-DCC, University of Bath UK | slides
      • Kenji Takeda, Microsoft Research

      The term, data scientist, is becoming prevalent in science, engineering, business, and industry. We will explore how the term is used in different contexts, segments, and sectors; we will examine the different variants, flavours, and interpretations and try to answer the following questions:

      • What does a data scientist really do?
      • What skills does a data scientist need? How do they acquire them?
      • What tools, technologies and platforms are used by data scientists?
      • How can we build data scientist capacity and capability for the future?

      Informatics, Information Science, Computer Science, and Data Science Curricula | video | slides

      Speakers: Geoffrey Fox, Indiana University

      We describe a possible data science curricula based on discussions at Indiana University and experience with our Informatics, Computer Science, and Library and Information Science programs. This leads to an interesting breadth of courses and students’ interests, which could address the many job opportunities. We suggest a collaboration to build a MOOC (online) offering with one initial target: minority serving institutions.


      Data Science Curricula at the University of Washington eScience Institute | video | slides

      Speaker: Bill Howe, University of Washington

      The University of Washington eScience Institute is engaged in a number of educational efforts in data science, including certificate programs for professionals, workshops for students in domain science, a new data-oriented introductory programming course, and a data science MOOC to be offered through Coursera in the spring. We consider the tools, techniques, research topics, and skills to be well-aligned with the data-driven discovery emphasis of eScience itself—the only difference is the applications.

      We see several benefits in aligning these two areas. For example, students in science majors who are not pursuing research careers become more marketable. In the other direction, working professionals see opportunities to apply their skills to solve science problems—we have recruited volunteers from industry in this way. In this talk, I’ll discuss these activities, review our curriculum, and describe our next steps.


      Publishing and eScience | video

      Co-Chairs: Mark Abbott, Oregon State University; Jeff Dozier, University of California, Santa Barbara

      Scientific Publishing in a Connected, Mobile World | slides

      Speaker: Mark Abbott, Oregon State University

      New tools for content development and new distribution channels create opportunities for the scientific community, opening new venues for collaboration, review, and self-publication. However, publishing is at the heart of the culture of science, and several centuries of experience with publishing in journals will not simply vanish. Issues of peer review, reproducibility, integrity, and scientific context will need to be addressed before these new tools take hold.

      Open access is but one part of this conversation.


      How to Collaborate with the Crowd: a Method for “Publishing” Ongoing Work | slides

      Speaker: Jeff Dozier, University of California, Santa Barbara, Visiting Researcher Microsoft Research

      The typical model for interdisciplinary research starts with a small-group partnership, typically with colleagues who have known each other for a while. They learn to articulate problems across disciplinary boundaries and discover shared interests. They successfully seek funding, and work together for several years. This model works, but can be cumbersome. An alternative model is to express a sequence of processes and data that integrate to create a suite of data products, and to identify insertion points where expertise from another perspective might be able to contribute to a better solution.


      When Provenance Gets Real: Implications of Ubiquitous Provenance for Scientific Collaboration and Publishing | slides

      Speaker: James Frew, University of California, Santa Barbara

      We expect (or hope?) that the impending standardization of data models, ontologies, and services for information provenance will make scientific collaboration easier and scientific publishing more transparent. We propose a panel of active producers and users of provenance who will address scenarios such as:

      • “I’m a scientist, and this is what I would really like to tell someone with provenance.”
      • “I’m a scientist, and this is what I wish provenance would tell me when I use your data, join your project, or …”
      • “I build systems that capture and/or manage provenance, and this is what I’ve seen scientists actually do when they create and/or use provenance.”

      Data Journal Challenge for the Fourth Paradigm-Trust through Data on Environmental Studies and Projects | slides

      Speaker: Shuichi Iwata, The Graduate School of Project Design

      Landscapes on recent big data issues to bridge environmental studies and social expectations are reviewed to design an e-Journal with data files and models. Data parts are keys to give semantics to original scientific papers, and also double keys for computational models. Structured data with explicit descriptions about their metadata can be managed and their traceability can be realized systematically, step by step. However, almost all available data are unstructured, fragmented, and contain ambiguities and uncertainties. Balances between data quality and freshness/costs/coverage are discussed so as to draw a road map for a data journal, referring to two preliminary case studies on materials data and data due to nuclear reactor accidents and problems.

    • Chair: Kristin Tolle, Microsoft Research

      Panel: Scientific Data: the Current Landscape, Challenges, and Solutions | video | slides

      Moderator: Carly Strasser, California Digital Library

      Speakers:

      • Jeff Dozier, University of California, Santa Barbara
      • Chris Mentzel, Gordon and Betty Moore Foundation
      • William Michener, University of New Mexico
      • Dave Vieglais, The University of Kansas
      • Stephanie Wright, University of Washington

      Funders, researchers, and public stakeholders increasingly see the need to better communicate and curate ever expanding bodies of research data. This panel will bring together many of the stakeholders in the scientific data community, including researchers, librarians, and data repositories.

      Before the panel commences, we will provide a brief introduction to scientific data to facilitate discussion. We will describe the current landscape of scientific data and its management, including publication, citation, archiving, and sharing of data. We will also describe existing tools for data management. The panel discussion will focus on identifying gaps and unmet needs in order to help chart a path for future policy, service, and infrastructure development.


      Novel Approaches to Data Visualization | video

      Chair: George Djorgovski, California Institute of Technology


      Data Visualization in Virtual Spaces and High Dimensions | slides

      Speaker: George Djorgovski, California Institute of Technology

      Visualization is a bridge between the quantitative content of data and human intuition and understanding. Effective visualization is a critical bottleneck as the complexity and dimensionality of data increase. I will describe some experiments in collaborative, multi-dimensional data visualization in immersive virtual reality.


      CT and Imaging Tools for Windows HPC Clusters and Azure Cloud | slides

      Speaker: Darren Thompson, CSIRO (Advanced Scientific Computing)

      Computed Tomography (CT) is a non-destructive imaging technique widely used across many scientific, industrial, and medical fields. It is both computationally and data intensive. Our group within CSIRO has been actively developing X-ray tomography and image processing software and systems for GPU-enabled Windows HPC clusters.

      A key goal of our systems is to provide our “end users”—researchers—with easy access to the tools, computational resources, and data via familiar interfaces and client applications without the need for specialized HPC expertise. We have recently explored the adaptation of our CT-reconstruction code to the Windows Azure cloud platform, for which we have constructed a working “proof-of-concept” system. However, at this stage, several challenges remain to be met in order to make it a truly viable alternative to our HPC cluster solution.


      Work in Progress Toward Enhancing Multidimensional Visualization with Analytical Workflows | slides

      Speaker: Dawn Wright, Environmental Systems Research Institute

      Big Data, particularly from terrestrial sensor networks and ocean observatories, exceed the processing capacity and speed of conventional database systems and architectures, and require visualization in three and four dimensions in order to understand the Earth processes at play. Successfully addressing the scientific challenges of Big Data requires integrative and innovative approaches to developing, managing, and visualizing extensive and diverse data sets, but is also critically dependent on effective analytical workflows. This talk will present an emerging agenda and work in progress toward this end at Environmental Systems Research Institute.

    • Host: Tony Hey, Microsoft Research | video (subsequent keynote address also on this video) | slides

    • Chair: Tony Hey, Microsoft Research

      Speaker: Antony John Williams, Royal Society of Chemistry | video (Jim Gray Award precedes keynote on this video) | slides

      In less than a decade, the Internet has provided us access to enormous quantities of chemistry data. Chemists have embraced the web as a rich source of data and knowledge. However, all that glisters is not gold and while online searches can now provide us access to information associated with many tens of millions of chemicals; can allow us to traverse patents, publications, and public-domain databases; the promise of high-quality data on the web needs to be tempered with caution.

      In recent years, the crowdsourcing approach to developing curated content has been growing. Can such approaches allow us to bring to bear the collective wisdom of the crowd to validate and enhance the availability of trusted chemistry data online or are algorithms likely to be more powerful in terms of validating data? While it is now possible to search the web by using a query language form natural to chemists—that of “structure searching the web”—increasingly, scientists are likely going to have to accept joint responsibility for the quality of data online for the foreseeable future. Their participation is likely to come through engaging in open science, the provision of data under open licenses, and by offering their skills to the community.

      This presentation will provide an overview of the present state of chemistry data online, the challenges and risks of managing and accessing data in the wild, and how an Internet for chemistry continues to expand in scope and possibilities.