October 21, 2007 - October 23, 2007

eScience Workshop 2007

Location: Chapel Hill, North Carolina, US

Abstracts for Tuesday, October 23, 2007

    h2>Plenary Presentation
  • David Heckerman, Microsoft Research

    I will describe several challenges in the design of an HIV vaccine and show we have addressed them with statistical models in combination with high-performance computing. The statistical models we use are generative models, sometimes called graphical models or Bayesian networks. I will also discuss how these models can be used for genome-wide associations studies—the search for connections between our DNA and disease. Finally, I will talk about how our work with scientists has led to improvements in statistical methods for learning generative models from data.

  • Data Management and Standardization

  • Simon Coles, University of Southampton

    Based on the e-Bank-UK (opens in new tab) and Repository for the Laboratory, R4L (opens in new tab) projects; a working model for a scientific data capture, management, curation and dissemination framework will be presented. The eCrystals repository has been constructed on an institutional repository platform and has been configured to ingest small molecule crystallographic data generated by the UK National Crystallography Service, whilst the R4L repository supports a range of different types of analytical chemistry data. This model addresses the current escalating the data deluge problem through integration of digital libraries technologies with both the research laboratory and also with established publication and dissemination routes. The institutional model provides a potential mechanism for the long term archival and availability of information in a manner that enables the capture of its research data output through integration into the laboratory environment. The repository ingest process ensures full capture of laboratory data and effective metadata creation at the point it is generated. A private archive provides effective management of the data, whilst an embargo procedure allows dissemination of results through a public archive in a timely manner. A schema for the dissemination of crystallographic data has been devised through consultation with the community which enables effective harvesting by data centers and third party aggregator services. The use of persistent identifiers provides a mechanism to permanently link the conventional scholarly article with its associated underlying dataset. Current work is investigating the issues associated with the construction of a federation of data repositories (institutional and subject based) and its long term integration into the publishing and chemical information provision processes.

  • Tom Jackson, University of York; Jim Austin, University of York

    We describe work within the CARMEN e-Science project which is addressing the challenge of creating & managing experimental data and methods within the context of Neuroscience research. The traditional research approach of testing a hypothesis and publishing the results is hampered in situations where others need to build on the results and need access to the data or original methods. CARMEN addresses this issue by allowing scientists to share data & methods within a collaborative Grid environment. The CARMEN platform (a CAIRN) is a Grid-based, shared data and services repository. Central to the data management challenge is providing the capability to data-mine large time-series data sets; the initial application of the system is spike train data from nerve cell recordings. We are examining how data can be represented effectively for diverse experimental methods. Also central to the investigations is providing generalized methods which allow users to publish and share their software services on the CAIRN for wider collaborative research. Initial investigations have shown that MATLAB is a common programming platform. Hence, we aim to provide an interactive MATLAB environment on the CAIRN, with a library of dynamically deployable services. A sister project, DAME, developed a distributed signal search engine, called Signal Data Explorer (SDE), which provides a platform for managing, viewing and searching time-series data. Interoperability between this and other software services is being investigated in the project. Currently, SDE invokes search services on data nodes using the PMC technology (Pattern Match Controller), allowing data to be searched remotely. We will generalize this function to allow any process to be run on remote data. CARMEN is keen to seek early engagement with Grid communities to facilitate constructive evaluation of the proposed approach.

  • Mark Bean, GlaxoSmithKline

    AnIML – Analytical Information Markup Language, an International Effort toward an XML Standard for Analytical Chemistry A retrospective on creation of XML standards based on XML schema must include discussion on selecting a home for the standard, obtaining consensus, maintaining momentum over the years, and extensions of the standard in unexpected directions. AnIML XML standard development was hosted by the ASTM E13.15 Committee on Analytical Data and IUPAC, both international standards bodies, partially funded by the National Institute of Standards (NIST), but created by efforts of volunteers across the globe with analytical domain and XML expertise. Merely scheduling meetings spanning a 9-hour time zone presented a challenge, and both wiki and net meeting technologies proved invaluable. AnIML consists of a flexible core XML schema which can be stretched around any analytical data set according to rules specified by extensible Technique Definition documents (LC, UV, NMR, MS, etc.) created by domain experts. These Technique Definitions can be extended to meet vendor- or industry-specific needs without breaking the core schema. This allows vendor-neutral applications (generic AnIML viewers) to be written. AnIML draws on experience with prior standards (JCAMP, ANDI). We will illustrate with examples from the core schema, Technique Definition documents, and AnIML xml files containing LCMS data.

  • Scientific Workflow

  • Ewa Deelman, USC-Information Sciences Institute

    Many scientific applications such as those in astronomy, earthquake science, gravitational-wave physics, and others have embraced workflow technologies to do large-scale science. Workflows enable researchers to collaboratively design, manage, and obtain results which involve hundreds of thousands of steps, access Terabytes of data, and generate similar amounts of intermediate and final data products. Although workflow systems are able to facilitate the automated generation of data products, many issues still remain to be solved. These issues exist in different forms in the workflow lifecycle. During workflow creation appropriate input data need to be discovered. During workflow mapping and execution data need to be staged in and staged-out of the computational resources. As data are produced, they need to be archived with enough metadata and provenance information so that they can be interpreted and shared among collaborators. This talk will describe the workflow lifecycle and discuss the issues related to data management at each step. Examples of challenge problems will be given in the context of the following applications: CyberShake, an earthquake science computational platform, Montage, an astronomy application, and LIGO’s binary inspiral search, a gravitational-wave physics application. These computations, represented as workflows, are running on today’s national cyber infrastructure such as the OSG and the TeraGrid and use workflow technologies such as Pegasus and DAGMan to map high-level workflow descriptions on to the available resources and execute the resulting computations. The talk will describe the challenges, possible solutions, and open issues faced when mapping and executing the large-scale workflows on the current cyber infrastructure. Particular emphasis will be given to issues related to the management of data throughout the workflow lifecycle.

  • Carole Goble, University of Manchester; David De Roure, The University of Southampton

    Workflows are scientific objects in their own right, to be exchanged and reused. myExperiment is an initiative to create a social networking environment for workflow workers. myExperiment is also planned as a market place to oeshop for workflows and services; a gateway to other environments; and a platform to launch workflows. We are currently beta testing the first phase of myExperiment “the social network “amongst a group of Life Scientists who develop Taverna workflows (www.mygrid.org.uk (opens in new tab)). We present the motivation for myExperiment and sketch the proposed capabilities. We report on the technical, political and sociological issues and our experiences so far. Our greatest challenge is how we work with the inherent self-interest of the scientist to gain trusted and enthusiastic participation in an inherently altruistic activity that relies in the network effects of many members.

  • Bertram Ludaescher, Shawn Bowers, Timothy McPhillips, Daniel Zinn; University of California, Davis

    Interest in scientific workflows has grown considerably in recent years. It is now generally recognized that workflows enable scientists to harness IT in new ways, thus promising to dramatically accelerate scientific discovery in the future. Advantages of scientific workflows over other solutions include workflow automation, optimization, and result reproducibility (via a provenance framework). An often overlooked, but crucially important area is modeling and design of scientific workflows. We believe that better support for rapid development, adaptation, and evolution of workflows is on the critical path to widespread adoption of this technology: e.g., workflow designs should be resilient to procedural changes (task insertion, removal, modification) and schema changes (in inputs and outputs). To this end, we compare extant models of computations (MoCs) underlying different kinds of workflows, i.e., current scientific workflows and more traditional business workflows. Our MoCs comparison includes task dependency graphs (DAGs) common in Grid workflows, Petri nets which are foundational for most business workflow approaches, dataflow process networks found in a number of scientific workflow systems, and XML stream processing models among others. We argue that in many scientific applications, data coherence is a crucial but often neglected aspect that is rarely found in current MoCs, resulting in MoCs (such as vanilla Petri nets and process networks) that are unnecessarily cumbersome and not resilient to change. To overcome such shortcomings, we propose a simple hybrid MoC that elegantly combines features from several MoCs and paradigms. In essence, our hybrid MoC views a scientific workflow as a configurable, pipelining data transducer over XML data streams. By exploiting an assembly line paradigm, data coherence and resilience to changes are achieved as well.

  • Data Mining

  • Alexander Tropsha, UNC-Chapel Hill; Julia Grace, UNC-Chapel Hill; Hao Xu, UNC-CH; Tongan Zhao, UNC-CH; Chris Grulke, ; Berk Zafer, ; Diane Pozefsky, UNC-CH; Weifan Zheng, NCCU

    The NIHs Roadmap includes the Molecular Library Initiative (MLI) and the PubChem repository of biological assays of chemical compounds. In 2005 the MLI formed the Molecular Library Screening Centers Network, MLSCN. As of May 2007, there were 256 MLSCN bioassays deposited in PubChem for over 140,000 chemicals making PubChem already the largest publicly available repository of bioactivity data. It promises to be comparable to “if not exceeding “the largest bioinformatics databases. The Carolina Center for Exploratory Cheminformatics Research (CECCR (opens in new tab)) was founded with the Roadmap funding in 2006 to develop research cheminformatics tools and software to address data mining and knowledge discovery challenges created by the MLI and PubChem projects. We have developed and deployed a prototypic cheminformatics web server called C-Chembench. It includes modules designed to address the needs of all constituent groups of chemical biology and drug discovery specialists, i.e., computational chemists (Model Development Module), biologists (Predictions Module), chemists (Library Design Module), and bioinformaticians (CECCR Base Module). We shall discuss several cheminformatics-specific data mining and knowledge discovery technologies (such as Quantitative Structure Activity Relationship Modeling) for biological assay data analysis and provide several successful examples of applications. Our technologies (that also rely on distributed computing) afford robust and validated models capable of accurate prediction of properties for molecules not included in the training sets. This focus on knowledge discovery and property forecasting brings C-ChemBench forward as the major data-analytical and decision support cheminformatics server in support of experimental chemical biology research.

  • George Djorgovski, Caltech; Roy Williams, Caltech; Ashish Mahabal, Caltech; Andrew Drake, Caltech; Matthew Graham, Caltech; Ciro Donalek, Caltech; Eilat Glikman, Caltech

    We describe an example of a real-time mining of massive data streams, taken from the rapidly developing field of time-domain astronomy and synoptic digital sky surveys. A typical scientific process consists of an iterative loop of measurements, their analysis, additional measurements implied by the analysis, etc. Typically this occurs on time scales of months or years. But if the relevant time scales of phenomena under study are in the range of minutes to hours, the process must be automated, with no humans in the loop – especially if the data flux and volume are in TB or PB range. We describe a system to discover, classify, and disseminate astronomical transient events, which involves a robotic telescope network with a feedback: automatically requested follow-up observations are folded back into an iterative analysis and classification of observed events. These may include a variety of cosmic explosions (e.g., supernovae, cataclysmic variables), inherently variable objects (e.g., stars, quasars), moving objects (asteroids, dwarf planets), and possibly even some previously unknown types of objects and phenomena. Rapid and automated follow-up is essential for their physical understanding and scientific use. The system operates within a broader Virtual Observatory framework, and it includes a variety of computational components, from data reduction pipelines to federated archives, web services, machine learning, etc. It represents a test bed for many technologies needed for a full scientific exploitation of current and forthcoming synoptic sky surveys. It also has a broader relevance for other situations which require an automated and rapid data mining of massive data streams.

  • Firat Tekiner, University of Manchester; Sophia Ananiadou, University of Manchester

    The continuing rapid growth of data and knowledge expressed in the scientific literature has spurred huge interest in text mining (TM). The individual researcher cannot easily keep up with the literature in their domain, and knowledge silos further prevent integration and cross-disciplinary knowledge sharing. The National Centre for Text Mining (NaCTeM) offers TM services to the academic community allow users to apply TM techniques to a variety of problems in their areas of interest. NaCTeM is entering a new phase, where the goal is to move from processing abstracts to full texts and to data mine the voluminous results to discover relationships yielding new knowledge. The expansion to new domains and the increase in the scale will massively increase the amount of data to be processed by the Centre (from Gigabytes to Terabytes). This work we are investigating approaches using high performance computing (HPC) to tackle the problem of data deluge for large-scale TM applications. Although TM applications are data independent, data handling of large text data is an issue when full text data is considered due to the problem sizes in consideration. Each of the steps in the TM pipeline adds further information to the initial raw text and data size increases as processing progresses throughout this process. The initial work focuses on tagging and parsing of text using TM applications and scaling up to 64-128 processors has been achieved. However, when scaling to a larger number of processors, data and work distribution will also be an issue due to the unstructured nature of the data available. In addition, we are aiming to create a framework to move and handle the large amount of data between many processes in TM pipeline. In this work, we will be discussing the challenges encountered when we mine large number text and future work needed to text mine full papers.