October 13, 2006 - October 15, 2006

eScience Workshop 2006

Location: Baltimore, Maryland, US

Chair: Roger Barga

Automatic Capture and Efficient Storage of eScience Experiment Provenance

Roger Barga Microsoft Research

Workflow is playing an increasingly important role in conducting e-Science experiments, but current systems lack the necessary support for the collection and management of provenance data. We argue that eScience provenance data should be automatically generated by the workflow enactment engine and managed over time by an underlying storage service.

In this presentation, we introduce a layered model for workflow execution provenance, which allows navigation from an abstract model of the experiment to instance data collected during a specific experiment run. We outline modest extensions to a commercial workflow engine so it will automatically capture this provenance data at runtime. We then present an approach to store this provenance data in a relational database engine. Finally, we identify important properties of provenance data captured by our model that can significantly reduce the amount of storage required, and demonstrate we can reduce the size of provenance data captured from an actual experiment to 0.4% of the original size, with modest performance overhead.

Taverna, a Workflow System for the Life Scientist in the Trenches

Tom Oinn, Manchester University, UK

Taverna is a workflow workbench designed for Life Scientists. It enables researchers with limited computing background and few technical resources to access and make-use of global scientific resources. Taverna can link together local and remote data and analytical resources, in both the private and public domains to run so called “in silico experiments”. Taverna provides a language and open source software tools to allow users to discover and access available web resources, construct complex analysis workflows, run these workflows on their own data, and others, and visualize the results. The principle is one of lowering the barrier of engagement by the user and the service provider, and providing a lightweight, flexible solution wherever possible. Taverna also aims to support the whole in silico experiment lifecycle, emphasizing the management and mining of provenance metadata and the sharing of workflows amongst colleagues. In addition to the workbench, Taverna provides methods for easing the incorporation of new applications and old favorite tools and promotes the role of common workflow patterns, building a body of verified protocol know-how, and service collections for specific problem sets.

GPFlow: A Pilot Workflow System for Interactive Bioinformatics

James M. Hogan, Queensland University of Technology, Australia

Modern genome bioinformatics is increasingly characterized by the use of application pipelines, with analyses realized through a chain of well-established and robust tools for tasks such as gene finding, sequence alignment, homology detection and motif discovery. Unfortunately, in many cases the pipeline is constructed through a laborious and error prone process of manual data reformatting and transfer, greatly limiting throughput and the ability of the scientist to undertake novel investigations.

The GPFlow project builds on Microsoft business workflow technology to provide a flexible and intuitive workflow environment for both routine and exploratory bioinformatics based research. The system provides a high level, interactive web based front-end to scientists, using a workflow model based on a spreadsheet metaphor.

This talk presents the results of an Australian Research Council/Microsoft funded pilot project, focused on the analysis of gene regulation in bacteria through the use of broad scale comparative studies. While a substantial number of bacterial genomes (approximately 200) have now been sequenced, studies of regulation are hampered by a paucity of reliably annotated regulatory regions outside model organisms such as Escherichia coli and Bacillus subtilis. The problem is compounded by the advent of rapid sequencing technologies and the need to integrate rapidly newly sequenced genomes into the comparative data set. Our demonstration workflows therefore support rapid determination of gene and regulatory homology across organisms and confirmation and discovery of regulatory motifs and identification of their underlying relationships. We shall present results of comparative studies among various species of Chlamydia and Bacillus.

Managing Exploratory Workflows

Juliana Freire, University of Utah

VisTrails is a new workflow management system which provides support for scientific data exploration and visualization. Whereas workflows have been traditionally used to automate repetitive tasks, for applications that are exploratory in nature, very little is repeated—change is the norm. As a scientist generates and evaluates hypotheses about data under study, a series of different, albeit related, workflows are created while a workflow is adjusted in an interactive process. VisTrails was designed to manage these rapidly-evolving workflows. A novel feature of VisTrails is a change-based mechanism which uniformly captures provenance information for data products and for the workflows used to generate these products. By capturing the history of the exploration process and explicitly maintaining the relationships among the workflows created, VisTrails not only allows results to be reproduced, but it also enables users to efficiently and effectively navigate through the space of workflows in an exploration task. In addition, this provenance information is used to simplify the creation, maintenance and re-use of workflows; to optimize their execution; and to provide scalable mechanisms for collaborative exploration of large parameter spaces in a distributed setting. As an important goal of our project is to produce tools that scientists can use, VisTrails provides intuitive, point-and-click interfaces that allow users to interact with and query the provenance information, including the ability to visually compare different workflows and their results. In this talk, we will give an overview of VisTrails through a live demo of the system. More information about VisTrails is available at http://www.sci.utah.edu/~vgc/vistrails.
Chair: Kristin Tolle

Organization and Infrastructure of the Cancer Biomedical Informatics Grid

Peter A. Covitz, Microsoft

The mission of the National Cancer Institute (NCI) is to relieve suffering and death due to cancer. NCI leadership has determined that the scale of its enterprise has reached a level that demands new, more highly coordinated approaches to informatics resource development and management. The Cancer Biomedical Informatics Grid (caBIG) program was launched to meet this challenge.

caBIG participants are organized into workspaces that tackle the various dimensions of the program. Two cross-cutting workspaces – one for Architecture the other for Vocabularies and Common Data Elements – govern syntactic and semantic interoperability requirements. These cross-cutting workspaces provide best practices guidance for technology developers as well as conducting reviews of system designs and data standards. Four domain workspaces build and test applications for Clinical Trials, Integrative Cancer Research, Imaging, and Tissue Banks and Pathology Tools, representing the highest priority areas defined by the caBIG program members themselves. Strategic-level workspaces govern caBIG requirements for Training, Data Sharing and Intellectual Capital, and overall strategic planning.

In its first year caBIG defined high-level interoperability and compatibility requirements for information models, common data elements, vocabularies, and programming interfaces. These categories were grouped into different degrees of stringency, labeled as caBIG Bronze, Silver and Gold levels of compatibility. The Silver level is quite stringent, and demands that systems adopt and implement standards for model-driven and service-oriented architecture, metadata registration, controlled terminology, and application programming interfaces. The Gold level architecture consists of a formal data and analysis grid dubbed “caGrid” that future caBIG systems will register with and plug into. caGrid is based upon the Globus Toolkit (http://www.globus.org (opens in new tab)) and a number of additional technologies such as caCORE from the NCI (http://http://ncicb.nci.nih.gov/NCICB/infrastructure (opens in new tab)) and Mobius from Ohio State University (http://bmi.osu.edu/areas_and_projects/caBIG.cfm (opens in new tab)). More information is available at http://cabig.nci.nih.gov (opens in new tab).

CancerGrid: Model- and Metadata-Driven Clinical Trials Informatics

Steve Harris & Jim Davies, Oxford University, UK

The CancerGrid project (www.cancergrid.org (opens in new tab)) is developing a software architecture for ‘tissue-plus-data’ clinical trials. The project is using a model- and metadata-driven approach that makes the semantics of clinical trials explicit, facilitating dataset discovery and reuse.

The architecture is based on open standards for the composition of appropriate services, such as: randomization, minimization, clinician identity, serious adverse events relay, remote data capture, drug allocation and warehousing, and form validation.

CancerGrid is funded by the UK MRC, with additional support from the EPSRC and Microsoft Research. It brings together expertise in software engineering and cancer clinical trials from five universities: Cambridge, Oxford, Birmingham, London (UCL), and Belfast.

The project has developed a CONSORT-compliant model for clinical trials, parameterized by clinical data elements hosted in metadata repositories. A model instance can be used to generate and configure services to run the trial it describes.

This talk will describe the model, the services, and the technology employed, from XML databases to Office add-ins. It will demonstrate the aspects of the technology that have been completed, and outline plans for future releases.

caBIG Smart Client Joining the Fight Against Cancer

Tom Macura

The cancer Biomedical Informatics Grid (caBIG), dubbed the WWW of Cancer Research, is a National Cancer Institute informatics project that is likely to become the authoritative source of knowledge related to cancer. caBIG is built on a host of open-source Java technologies.

caBIG dotNET is an open-sourced web-service and client API we have developed to expose high-level caBIG Java APIs to the .NET developer community. We have used caBIG dotNET to build two GUI Smart Clients: xl-caBIG Smart Client and Smart Client MOBILE.

The xl-caBIG Smart Client is a set of extensions to Microsoft Excel 2003 that gives scientists a graphical interface for accessing caBIG data-services. It provides users intuitive access to caBIG by leveraging their intimacy with the Windows environment and Excel’s statistical tools.

The xl-caBIG Smart Client MOBILE is an alternative interface to the xl-caBIG Smart Client that is better suited for the unique input interface and limited screen area of mobile computing devices. This translates into caBIG access on a wide range of computing devices including PDAs and mobile phones.

C-ME: a Smart Client to Enable Collaboration Using a Two- or Three-Dimensional Contextual Basis for Annotations

Anand Kolatkar, The Scripps Research Institute

Collaborative Modeling Environment (C-ME) is a smart client that allows researchers to organize, visualize and share information with other researchers utilizing Microsoft Office System Server 2007 (MOSS) as a data store, Vista’s Windows Presentation Foundation for the graphics display and visual studio 2005/C# as a development platform.

C-ME addresses two important aspects of collaboration: context and information management. C-ME allows a researcher to use a 3D atomic structure model or a 2D image (e.g. an image of a slide containing cancer cells) as a contextual basis on which to attach annotations to specific atoms or groups of atoms of a 3D atomic structure or to user-definable areas on a 2D image. These annotations (Office documents, URLs, Notes, and Screen Captures) provide additional information about the atomic structure or cellular imagery, are stored on MOSS and are accessible to other scientists using C-ME provided they have appropriate permission in the Active Directory. Storing and managing the annotations on MOSS allows us to maintain a single copy of the information accessible to a collaborating group of researchers. Contributions to this single copy of information via additional annotations are again immediately available to the entire community.

Data organization is hierarchical with projects being at the top and containing one or more entities, which can be annotated as described above. We are currently enhancing the existing 3D and 2D annotation capabilities to better match researchers’ needs, increasing drag-and-drop/one-click functionality for efficiency and standardization of the GUI under Vista, and further leveraging the search and indexing capabilities of MOSS. We are also looking for outside users to install and evaluate C-ME.

The C-ME development team includes members of InterKnowlogy, Microsoft and the Kuhn-Stevens laboratories at The Scripps Research Institute. Support for C-ME is in part provided by the NIH NIGMS Protein Structure Initiative under grant U54-GM074961 and through NIH-NIAID Contract HHSN266200400058C.
Chair: Jignesh Patel

Some Challenges in Integrating Information on Protein Interactions and a Partial Solution

H.V. Jagadish, University of Michigan

Independently constructed sources of (scientific) data frequently have overlapping, and sometimes contradictory, information content. Current methods of use fall into two categories: force the integration step onto the user, or merely collate the data, at most transforming it into a common format. The first method places an undue burden on the user to fit all of the jigsaw puzzle pieces together. The second leads to redundancy and possible inconsistency.

We propose a third: deep data integration. The idea is to provide a cohesive view of all information currently available for a protein, interaction, or other objects of scientific interest. Doing so requires that multiple pieces of data about the object, in different sources, first be identified as referring to the same object, if required through “third party” information; then that a single “record” be created comprising the union of the information in multiple matched records, keeping track of differences where these occur; and finally by tracking the provenance of every value in the dataset so scientists can judge what items to use and how to resolve differences.

The results of this process, as applied to protein interactions and pathways, is found in the Michigan Molecular Interactions Database (MiMI). MiMI deeply integrates interaction data from: HPRD, DIP, BIND, GRID, IntAct, the Center for Cancer Systems Biology dataset and the Max Delbruck Center dataset. Additionally, auxiliary data is used from: GO, OrganelleDB, OrthoMCL, PFam, ProtoNet, miBlast, InterPro and IPI. MiMI is publicly available at http://mimi.ctaalliance.org (opens in new tab).

In this talk, I will discuss the desiderata for a protein interaction integrated information resource obtained from our user community, and outline the architecture of the system we have developed to address these needs.

Theory in the Virtual Observatory

Gerard Lemson, Max-Planck Instutut fuer extraterrestrische Physik, Germany

I will discuss and demo efforts to introduce theory into the Virtual Observatory (VO). With the VO the astronomical community aims to create an e-science infrastructure to facilitate online access to astronomical data sets and applications.

The efforts of individual national VOs are organized in the International VO Alliance (IVOA), which seeks to define standard protocols to homogenize data access and enable interoperability of distributed services. VO efforts have generally concentrated on observational data, but recently interest has grown to include the results of large-scale computer simulations. The goal is the dissemination of simulation data per se, in particular finding ways of using such data for the planning, prediction and interpretation of observations.

Simulated data sets are in general very different from observational ones and need special treatment in the VO. As an example of this I present a relational data structure for efficiently storing tree structures representing the formation history of objects in the universe. An implementation of this is used in a web service exposing a SQLServer database storing results of the largest cosmological simulation to date.

To facilitate the use of specialized simulation data by observers, the community has invented the idea of the “virtual telescope”. These are services that mimic real telescope observations and produce results that can be directly compared to observations. I will show examples producing optical galaxy catalogues and X-Ray observations of galaxy clusters.

The implementation of virtual telescopes requires specialized expertise not generally available at a single location. Furthermore, to produce mock observations in sufficient detail for scientific purposes requires a high performance computational infrastructure.

The VO offers the appropriate framework for resolving both these issues and I will conclude with thoughts on the steps that are required to make this a reality.

Practical Experience with a Data Centric Model for Large-Scale, Heterogeneous Data Integration Challenges

Michael Gillam, Azyxxi

Heterogeneous data integration in the scientific and clinical domains is an often complex and costly process. The practical challenges have become more apparent with the ongoing technologic struggles and public failures of high-profile government and corporate data integration efforts. We describe a data-centric model of heterogeneous data integration which combines data-atomicity with metadata descriptors to create an architectural infrastructure which is highly flexible, scaleable and adaptable.

The practical application of this approach has created the most diverse real-time clinical data repository available. The system is in live use across eight hospitals with over 80,000 clinical queries per hospital per day. The database is over 60 terabytes in size with 500 terabytes of installed capacity. Over 11,000 heterogeneous data elements are stored effectively integrating textual, image and video streaming data seamlessly into a single architecture. Over 1,500 live data streams feed data into the system and are maintained using only 50% of the time of one full time employee. Data retrieval times for common clinical queries are aimed to deliver 1/8th of a second response times. All data within the system, currently spanning 10 years of live clinical use, are retrievable real-time through the system. No data are offloaded or archived inaccessibly. The system has had 99.997% uptime for the last 10 years.

The success of the data-centric approach has dramatic implications for many large-scale, heterogeneous data integration projects where data types are numerous and diverse, highly scaleable infrastructures are required, and data specifications are imprecise or evolving.

Managing Satellite Image Time Series and Sensor Data Streams for Agro-Environmental Monitoring

Claudia Bauzer Medeiros, UNICAMP, Brazil

The WebMaps project is a multidisciplinary effort involving computer scientists and experts on agricultural and environmental sciences, whose goal is to develop a platform based on Web services for agro-environmental planning in Brazil.

One of the main challenges concerns providing users with the means to summarize and analyze long time-series of satellite images. This analysis must correlate these series, for arbitrary temporal intervals and regions, with data streams from distinct kinds of weather sensor networks (for rainfall, humidity, temperature, etc). Additional sources include data to allow the characterization of a region (e.g., relief or soil properties), crop physiological characteristics and human occupation factors.

Data quality and provenance are important factors, directly influencing analysis results. Besides massive data volumes, several other factors complicate data management and analysis.

Heterogeneity is a big barrier – many kinds of data need to be considered and weather sensors are varied and often faulty. Moreover, data correlations must consider a variety of time warp factors. For instance, the effect of rainfall in a region, combined with temperature and soil parameters, takes months to be reflected in vegetation growth detected by satellite imagery.

Other factors – e.g., effects of human activity – cannot be directly measured, being thus derived from primary sources such as images. Furthermore, there is a wide range of user profiles, with many kinds of analysis and summarization needs.

The first prototype is available, concentrating on spatio-temporal analysis of image series, creating graphs that show vegetation evolution in arbitrary areas. Present data management efforts include analysis of time series co-evolution, visualization, annotation, and provenance management.

The presentation will concentrate on handling heterogeneity and time series co-evolution, and implementation difficulties. The same problems are cropping up in other projects – in biodiversity and health care.
Chair: Marty Humphrey

From Terabytes to Petabytes: Towards Truly Astronomical Data Sizes

Ani Thakar, The Johns Hopkins University

The Sloan Digital Sky Survey (SDSS) has been serving a multi-Terabyte catalog dataset to the astronomical community for a few years now. By the beginning of the next decade, the Large Synoptic Survey Telescope (LSST) will be acquiring data at the rate of one SDSS every 3-4 nights and serving a Petabyte-scale dataset to the community by 2015 or so.

I will discuss the lessons learned from SDSS and how they will guide the LSST data management design. In particular, I will highlight the developments at the Johns Hopkins University (JHU) in online database access, asynchronous query execution, data partitioning, and Web services as the pillars upon which petascale data access will be built. I will treat these topics in the larger context of the international Virtual Observatory (VO) effort that seeks to bring these technologies together so that all astronomical data is federated and accessible in an efficient manner.

JHU is a major participant in the three projects that I will discuss – SDSS, LSST and the VO – and is also building a 100-Terabyte data analysis facility to analyze data, one Terabyte at a time, from large-scale turbulence simulations.

A Web Service Approach to Knowledge Discovery in Astronomy

Andrew Connolly, University of Pittsburgh

Large scale astronomical surveys are providing a panchromatic view of the local and distant universe covering wavelengths from X-rays to radio. With the development of the National Virtual Observatory (NVO) that federates these disparate data sets, astronomers can now query the sky across many decades of the electromagnetic spectrum.

The challenges now faced by astronomy are not just how do we organize and efficiently distribute these data but how do we make use of these resources to enable a new era of science discovery. I will talk here about steps to integrate efficient machine learning techniques into the NVO to facilitate the analysis and visualization of large data sets. I will focus on a webservice-based framework that enables users to upload raw imaging data and return astrometrically and photometrically calibrated images and source catalogs, together with cross-matches of these sources with the full spectrum of catalogs available through the NVO. I will show how we integrate data mining tools into this webservice framework to automatically identify and classify unusual sources either from the resulting catalogs (e.g. using mixture models for density estimation) or directly from the images (e.g. by subtracting images observed at an earlier epoch).

As I will show, these tools are accessible by professional and amateur astronomers and have already been used to detect supernova within images and to identify very high redshift galaxies.

The Astronomical Cross Database Challenge

Maria A. Nieto-Santisteban, The Johns Hopkins University

Astronomy, like many other eSciences, has strong need for efficient database cross-reference procedures. Finding neighboring sources within either the same catalog, or across different catalogs is one of the most requested capabilities. Although there are many astronomical tools capable of finding sources near other sources, they cannot handle the volume of objects that current and future astronomical surveys like the Sloan Digital Sky Survey, or the Large Synoptic Survey Telescope are generating. Since we speak of terabytes of data and billions of records, using traditional file systems to store, access and search the data is no longer an option.

Astronomy is finally moving into the Database Management Systems (DBMS). Even though DBMS are suited for efficient data manipulation and fast access, in order to mange such a big volume, special parallelization, searching and indexing algorithms are needed. We have developed a zoning algorithm that not only speeds up all-to-all neighboring searches using only relational algebra but also partitions and distributes the workload across computers efficiently. Using this technique we can bring the two catalog 1 billion objects cross-identification problem down to the hour.

The challenge remains though in the many-to-many cross-matching process, where ‘many’ is billions of records per catalog and the number of catalogs is in the tens. In this talk we will present our experience working with very big astronomical catalogs and describe a framework that would allow for large scale data access and cross-match.

Proteus RTI: A Simple Framework for On-The-Fly Integration of Biomedical Web Services

Shahram Ghandeharizadeh, USC

On-the-fly integration refers to scenarios where a scientist wants to integrate a Web Service immediately after discovering it. The challenge is to significantly reduce the required information technology skills to empower the scientist to focus on the domain-specific problem. Proteus RTI is a first step towards addressing this challenge. It includes a simple interface to enable a scientist to register Web Services, compose them into plans, execute a plan to obtain results, and share plans with other members of their community.

This presentation provides an overview of Proteus RTI and its components. We present several animations showcasing Proteus RTI with a variety of scientific Web Services. In one example, we compose a plan that invokes different operations of NCBI Web Service to retrieve information pertaining to a keyword such as Asthma from all NCBI databases. In a more complex example, we compose KEGG’s FIND operation with eSearch of NCBI to retrieve all matching molecules with their definition and corresponding source-specific ids.

Animations showing examples discussed in this short abstract (along with others) are available from http://proteusrti.usc.edu (opens in new tab). Proteus RTI is available for download from this URL.

eScience Workshop 2006

Scientific Workflow: Schafler Auditorium

Cancer Informatics

Data Fusion: Remsen One

Web Services in eScience: STI Auditorium