October 21, 2007 - October 23, 2007

eScience Workshop 2007

Location: Chapel Hill, North Carolina, US

Abstracts for Sunday, October 21, 2007

    h2>Keynote Presentation
  • Kelvin K. Droegemeier, School of Meteorology, University of Oklahoma

    Those who have experienced the devastation of a tornado, the raging waters of a flash flood, or the paralyzing impacts of lake-effect snows understand that mesoscale weather develops rapidly, often with considerable uncertainty with regard to location. Such weather is also locally intense and frequently influenced by processes on both larger and smaller scales. Ironically, few of the technologies used to observe the atmosphere, predict its evolution, and compute, transmit, or store information about it operate in a manner that accommodates the dynamic behavior of mesoscale weather. Radars do not adaptively scan specific regions of thunderstorms; numerical models are run largely on fixed time schedules in fixed configurations; and cyber infrastructure does not allow meteorological tools to run on demand, change configurations in response to the weather, or provide the fault tolerance needed for rapid reconfiguration. As a result, today’s weather technology is highly constrained and far from optimal when applied to any particular situation.

    This presentation describes a major paradigm shift now underway in the field of meteorology — away from today’s environment in which remote sensing systems, atmospheric prediction models, and hazardous weather detection systems operate in fixed configurations, and on fixed schedules largely independent of weather — to one in which they can change their configuration dynamically in response to the evolving weather. This transformation involves the creation of adaptive radars, Grid-enabled analysis and forecast systems, and associated cyber infrastructure that operate automatically on demand. In addition to describing the research and technology development being performed to establish this capability within a service oriented architecture, I discuss the associated economic and societal implications of dynamically adaptive weather sensing, analysis and prediction systems.

  • Support for Large-Scale Science: Grumman Auditorium

  • Dan Werthimer, University of California, Berkeley

    I will discuss the possibility of life in the universe, SETI@home, public participation distributed computing, and real time petaop/sec FPGA based supercomputing. Next generation radio telescopes, such as the Allen Telescope Array and the Square Kilometer Array, are composed of hundreds to thousands of smaller telescopes; these large arrays require peta-ops per second of real time processing. I will describe these telescopes, and the motivation for peta-op supercomputing. Such computational requirements are far beyond the capabilities of general purpose computing clusters (e.g. Beowulf clusters) or supercomputers. Traditionally instrumentation for radio telescope arrays has been built from highly specialized custom chips, taking ten years to design and debug; they are very expensive, inflexible, and are usually out of date before they are working well. I’ll present some of the new software tools that make it relatively easy to program FPGA’s, as well as some general purpose open source hardware and software modules we’ve developed to build a variety of real time petaop/second supercomputers. More information is available at http://casper.berkeley.edu (opens in new tab) and http://seti.berkeley.edu (opens in new tab).

  • Geoffrey Fox, Xiaohong Qiu, Huapeng Yuan, Marlon Pierce, David Wild, Rajarshi Guha, Indiana University; Georgio Chrysanthakopoulos, Henrik Frystyk Nielsen , Microsoft

    The eScience paradigm for Chemical Informatics links computational chemistry simulations, large archival databases such as PubChem and the rapidly growing volumes of data from high throughput devices. We have built such a Grid for scientific discovery on the interface of biology and chemistry (drug discovery). We expect eScience to need integration of both distributed and parallel technologies where Intel has highlighted the potential importance of data mining applications as synergistic with both the data deluge and the growing power of multicore systems. Our parallel programming model decomposes problems into services as in traditional eScience approaches and then uses optimized parallel algorithms for the services. This is consistent with the split between efficiency and productivity layers in the Berkeley approach to parallel computing. We implement the productivity layer with Grid workflows or Web 2.0 mashups on services that use where needed high performance parallel algorithms developed by experts and packaged as a library of services for broad use. We discuss parallel clustering of chemical compounds from NIH PubChem. We chose an improved K-Means clustering which has scaling parallelism and uses annealing on Chemical property space resolution to avoid local minima. We use the Microsoft Concurrency and Coordination Runtime (CCR) as it gives good performance at the MPI layer and use its coupling to a service model DSS that is a natural platform for the service productivity layer. The parallel overhead consists of Windows thread scheduling, memory bandwidth limitations and CCR synchronization overheads and totals 10-15% (speedup of 7 on an 8 core system) for realistic PubChem application with the load imbalance from scheduling being the dominant effect.

  • Ioan Raicu, Yong Zhao, Ian Foster; University of Chicago, Alex Szalay, The Johns Hopkins University

    Scientific and data-intensive applications often require exploratory analysis on large datasets, which is often carried out on large scale distributed resources where data locality is crucial to achieve high system throughput and performance. We propose a data diffusion approach that acquires resources for data analysis dynamically, schedules computations as close to data as possible, and replicates data in response to workloads. As demand increases, more resources are acquired and cached to allow faster response to subsequent requests; resources are released when demand drops. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on the application workloads and the performance characteristics of the underlying infrastructure. This data diffusion concept is reminiscent of cooperative Web-caching and peer-to-peer storage systems. Other data-aware scheduling approaches assume static or dedicated resources, which can be expensive and inefficient if load varies significantly. The challenges to our approach are that we need to co-allocate storage resources with computation resources in order to enable the efficient analysis of possibly terabytes of data without prior knowledge of the characteristics of application workloads. To explore the proposed data diffusion, we have developed Falkon, which provides dynamic acquisition and release of resources and the dispatch of analysis tasks to those resources. We have extended Falkon to allow the compute resources to cache data to local disks, and perform task dispatch via a data-aware scheduler. The integration of Falkon and the Swift parallel programming system provides us with access to a large number of applications from astronomy, astrophysics, medicine, and other domains, with varying datasets, workloads, and analysis codes.

  • Computational Modeling in the Life Sciences: Redbud A+B

  • Radhika Nagpal, Harvard University

    In multi-cellular tissues, simple cell behaviors can lead to complex global properties, from wound repair to rapid change in morphology. Understanding the relationship between local cell decisions and system-level behaviors is critical for many reasons: to form and validate cell behavior hypotheses, to predict the effect of aberrant cell behaviors, and to provide new (and sometimes counter-intuitive) insights into tissue behavior. In this talk, I will present our recent work on an abstract model of cell division in the developing fruit fly wing that led to a novel insight into the robustness of proliferating epithelial tissues [1]. Based on time-lapse movies of early wing development, we developed a simple logical model, represented by a first-order Markov chain, of the cell division process and its impact on the graph topology of the tissue network. This mathematical model led to an unexpected prediction: that the stochastic process of cell division will drive the proliferating tissue, as a whole, to adopt a fixed distribution of polygonal cell shapes, regardless of initial tissue topology. This predicted distribution is strongly observed in diverse organisms, not only fruit fly, but also hydra and frogs, suggesting that this may be a fundamental property of proliferating epithelial tissues. Epithelial tissues are ubiquitous throughout the animal kingdom and form many structures in the human body. This work suggests a simple emergent mechanism for regulating cell shape and topology during rapid proliferation, and has many implications for multi-cellular development and disease that we are now investigating.

    [1] Gibson, Patel, Nagpal, Perrimon, “The Emergence of Geometric Order in Proliferating Metazoan Epithelia”, Nature 442(7106):1038-41, Aug 31, 2006.

  • Andrew Phillips, Microsoft Research

    This talk presents a programming language for designing and simulating computer models of biological systems. The language is based on a computational formalism known as the pi-calculus, and the simulation algorithm is based on standard kinetic theory of physical chemistry. The language will first be presented using a simple graphical notation, which will subsequently be used to model and simulate a couple of intriguing biological systems, namely a genetic oscillator and a key pathway of the immune system. One of the benefits of the language is its scalability: large models of biological systems can be programmed from simple components in a modular fashion. The first system is a genetic oscillator built from simple computational elements. We use probabilistic analysis of our model to characterize the parameter space in which regular oscillations are obtained, and validate our calculations through simulation. We also explore different levels of abstraction for our model by exploiting the modularity of our approach, which allows increasing levels of detail to be included without changing the overall structure of the program. Our design principles could in future be used to engineer robust genetic oscillators in living cells. The second system is an executable computational model of MHC Class I Antigen Presentation. This is a key pathway of our immune system, which is able to detect the presence of potentially harmful intruders in our cells, such as viruses or bacteria. By simulating and analyzing this model, we gain some insight into how the pathway functions and offer an explanation for some of the variability present in the human immune system.

  • Matteo Cavaliere, Radhu Mardare, Sean Sedwards, MSR -UNITN CoSBi

    We present a modeling framework and computational paradigm called Colonies of Synchronizing Agents (CSAs), which abstracts the intracellular and intercellular mechanisms of biological tissues. Our motivation is to describe complex biological systems in a formal way, such that it is possible to model, analyze and predict their properties. From these analyses we may also gain insight to inform the creation of new computational devices and techniques. The core model is based on a multi set of agents (which can be thought of as populations of cells or molecules) in a common environment. Each agent has contents in the form of a multi set of atomic objects (i.e., chemicals or the properties of individual molecules) which are updated by rewriting rules. Hence, the model has a certain elegant simplicity, being essentially a multi set of multi sets, acted upon by multi set rewriting rules. Rules may act on individual agents (thus representing intracellular action) or may synchronize the contents of pairs of agents (representing intercellular action). An extended model includes Euclidean space and rules to facilitate the movement of agents within the space. The extended model also includes rules to control agent division and agent death, thus providing a full repertoire of common cell and molecule primitive behaviors. The formal basis of the model allows us to investigate static and dynamic properties of CSAs using tools from computer science, e.g., from automata theory, logic and game theory. In this way we hope to model and investigate complex biological phenomena, such as exist in the immune system and morphogenesis. In particular, we are interested in robust pattern formation, which is the basis of complexity in nature.

  • Scholarly Publication: Bellflower A+B

  • Jane Hunter, Kwok Cheung, University of Queensland

    Scientists are under increasing pressure to publish their raw data, derived data and methodology along with their traditional scholarly publication, in open archives. The goal is to enable the verification and repeatability of results by other scientists in the field and hence encourage the re-use of research data and a reduction in the duplication of research effort. Many scientists and scientific communities would be willing to do this, if they had simple, efficient tools and the underlying infrastructure to streamline the process. Currently there are relatively few tools to support these new forms of scientific publishing and those that do exist are not integrated with existing repository infrastructure. In this presentation, I will describe a system that we have been developing to streamline the process of authoring, publishing and sharing compound scientific objects with built-in provenance information. SCOPE is a graphical authoring tool that assists scientists with the task of: creating a compound scientific publication package (as an OAIORE named graph); attaching a brief metadata description and a Creative Commons license to the object; and publishing it to a Fedora repository for discovery, re-use, peer-review or e-learning. The SCOPE system comprises:

    • A Provenance Explorer window which uses RDF graphs generated from laboratory and scientific workflow systems (e.g., myTea, Kepler, Taverna) to visualize provenance.
    • A publishing window into which the author drags and drops nodes from the provenance explorer.

    External objects (retrieved via an IE web browser) may also be dragged and dropped into the window to form new nodes. Relationships between nodes can either be dynamically inferred (using the Algernon inference engine) or (in the case of external objects) manually defined by the author who draws and tags new links.

  • Carl Lagoze, Cornell University; Herbert Van de Sompel, Los Alamos National Laboratory

    Information objects used in eScience are frequently compound in nature consisting of a variety of media resources with rich inter-relationships. Infrastructure to support the expression and exchange of information about these compound information objects is essential for the deployment of eScience across disciplines. The Open Archives Initiative – Object Reuse and Exchange (OAI-ORE) Project is developing standards to facilitate discovery, use and re-use of these new types of compound scholarly resources by networked services and applications. The Andrew W. Mellon Foundation, Microsoft, and the NSF fund OAI-ORE. In this talk, we will introduce a preliminary version of the core OAI-ORE data models and protocols. These models are based on the notion of bounded, named aggregations of web resources in which the resources and their relationships are typed. Protocols cover discovering, retrieving representations, and updating the constituency of such aggregations. This allows, for example, other web resources, including other compound objects, to reference these aggregations to express citation, annotation, provenance, and other relationships vital to the scholarly process. We also introduce use cases illustrating possible applications of the OAI-ORE work. The eChemistry project, funded by Microsoft, will deploy an OAI-ORE-based infrastructure for exchange of molecular-centric information and the linkage of that information to researchers, experiments, publications, etc. A digital preservation prototype illustrates how the Web-centric approach of OAI-ORE could empower the Internet Archive to readily archive compound information objects.

  • Julius Lucks, Cornell University; Simeon Warner, Cornell Information Science; Thorsten Schwander, arXiv.org; Paul Ginsparg, Cornell University

    The Cornell University e-print arXiv, is a document submission and retrieval system that is heavily used by the physics, mathematics and computer science communities. It has become the primary means of communicating cutting-edge manuscripts on current and ongoing research. The openaccess arXiv e-print repository is available worldwide, and presents no entry barriers to readers, thus facilitating scholarly communication. Manuscripts are often submitted to the arXiv before they are published by more traditional means. In some cases they may never be submitted or published elsewhere, and in others, arXiv-hosted manuscripts are used as the submission channel to traditional publishers such as the American Physical Society, and newer forms of publication such as the Journal for High Energy Physics and overlay journals. The primary interface to the arXiv has been human-oriented html web pages. In this talk, we outline the design of a web-service interface to the arXiv, permitting programmatic access to e-print content and metadata. We discuss design considerations of the interface that facilitate new and creative use of the vast body of material on the arXiv by providing a low barrier to entry for application developers. We outline mock-applications that will greatly benefit from this interface, including an alternative arXiv human interface that is designed to preserve contextual information when performing searches. We finish with an invitation to participate in the growing developer community surrounding the arXiv web-services interface.

  • Sensors and Mapping: Grumman Auditorium

  • Andreas Terzis, Alex Szalay, Katalin Szlavecz, Johns Hopkins University

    It is possible today to build large wireless sensor networks (WSNs) for observing the natural environment. For example, a network currently under deployment by the authors can generate 1.7 GB of raw sensor measurements per year. However, scale is not the only challenge these datasets present. They are also incomplete, noisy and highly redundant. Measurements are taken at finite locations with finite frequency potentially missing events with limited spatial and temporal scope. The sensors used in these networks are prone to errors and failures. Finally, collected data present the proverbial needle in the haystack problem: scientists are interested for subtle signals superimposed on diurnal and seasonal cycles. While similar problems have been explored in other domains, the added challenge posed in the context of WSNs is that these problems have to be solved in an online and distributed way. Off-line processing of previously collected data is inadequate since delivering uninteresting data is very expensive: they consume precious resources until they are deposited to archival storage. Moreover, closed-loop sensor networks cannot afford to operate with incomplete or faulty data, since they make decisions (e.g. engage actuators) that depend on previous measurements. We present statistical techniques, based on principal component analysis, for detecting and classifying novel events in environmental wireless sensor networks. These events are defined as deviations from underlying normal trends. Our techniques can be used to dynamically adjust the network’s behavior and to detect faulty sensors. The proposed mechanisms are lightweight enough to be implemented on existing motes. We argue that our approach is the first step towards sensor networks that can discover the underlying structures of the phenomena they monitor.

  • Jeff Gehlhausen, Stephanie Puchalski, Mehmet Dalkilic, Claudia Johnson, Erika Elswick, Indiana University

    A growing debate about global warming has forced reexamination of the fossil record in novel ways. We examine the chiton. We approach this problem by creating visualization and mining system for the organism, specifically where it was discovered. Our visualization tool makes use of a novel Virtual Earth Clustering technique originally published on (viavirtualearth.com) that efficiently groups items based on the zoom level in Virtual Earth. As the mouse is used to zoom into the canvas of Virtual Earth, more detailed plot data points are shown–and as the mouse is used to zoom out of Virtual Earth, the data points are aggregated into clusters so less data is plotted on the map. Since the clusters are groups by location and zoom level, data is simply being compacted in order to improve performance. Much of the application’s functionality is derived from the JavaScript MouseOver functionality. As different plots are placed on the map according to the different phylogenetic filters, a MouseOver event presents the user with a popup box displaying the name and ID of the chiton. The popup also shows how many records are contained in the cluster. There may many records at a point–the next and previous buttons on the popup box allow the user to cycle through the different records. Additionally, more detailed Time and Phylogenetic information is presented below the map. The data points are placed on the map according to different Phylogenetic constraints–currently the user can select different Order, Suborder, and Family taxonomic data regarding the Chiton dataset. The application uses the ASP.NET AJAX Toolkit Controls to query the server and retrieve the correct taxonomic data for its parent in the Phylogenetic hierarchy after an event is fired due to a selection in the drop-down list.

  • Dennis Fatland, Microsoft; Matt Heavner, Eran Hood, Cathy Connor, University of Alaska Southeast

    The South East Alaska MOnitoring Network for Science, Telecommunications, Education, and Research is a NASA-sponsored smart sensor web project designed to support collaborative environmental science with near-real-time recovery of large volumes of environmental data. The Year One geographic focus is the Lemon Creek watershed near Juneau Alaska with expansion planned for subsequent years up into the Juneau ice field and into the coastal marine environment of the Alexander Archipelago and the Tongass National Forest. Implementation is motivated by problems in hydrology and cryosphere science as well as by gaps in the relationship between science, technology, and education. We describe initial results from 2007, underlying system architecture, and project initiation of inquiry-driven classroom learning.

  • Databases: Redbud A+B

  • Qi Sun, Lalit Ponnala, Cornell University

    Managing data for high throughput genomics studies requires researchers to deal with issues including the integration of heterogeneous data sets, the use of tools for data access, and the issues arising around confidentiality. We have been working together with researchers of Dr. Ron Crystal’s group of Weill Cornell Medical School to develop a data processing pipeline for their COPD project (Chronic Obstructive Pulmonary Disease, which is the 4th leading cause of death in the United States). Microsoft SQLServer 2005 was used as the database engine for this project. We developed a schema that can accommodate the heterogeneous multi-media clinical data, and the high-throughput genotyping data from multiple platforms. The built-in SQLServer encryption functions were used for storing sensitive patient information. On the client side, users can enter and retrieve data through VSTO add-ins for Excel, as most researchers are already familiar with using Excel. Our experience showed that the SOAP web service and the Excel based client applications can be a versatile solution for data integration of high throughput genomics and proteomics projects.

  • Robin Gutell, Weijia Xu; University of Texas at Austin; Stuart Ozer, Microsoft

    Comparative studies of RNA sequences can decipher the structure, function and evolution of cellular components. The tremendous increase in available sequences and related biological information creates opportunities to improve the accuracy and details of these studies while presenting new computational challenges for performance and scalability. To fully utilize this large increase in knowledge, this information must be organized for efficient retrieval and integrated for multi-dimensional analysis. With this, biologists are able to invent new comparative sequence analysis protocols that will yield new and different structural and functional information. Therefore, there is a constant need to reinvent existing turn-key based computational solutions to accommodate the increasing volume of data and new type of information. Managing sequences in a relational database will provide an effective, scalable method to access large scale of data of different types through the mature set of database machinery, and a simplified programming interface for analyzing stored information through SQLs. Based on Microsoft SQLServer, we have designed and implemented the RNA Comparative Analysis Database-rCAD which supports comparative analysis of RNA sequence and structure, and unites, for the first time in a single environment, multiple dimensions of information necessary for alignment viewing, sequence metadata, structural annotations, structure prediction studies, structural statistics of different motifs, and phylogenetic analysis. This system provides a query-able environment that hosts efficient updates and rich analytics. We will show how the performance and scalability of basic analysis tasks, such as covariation analysis can be improved using rCAD. We will also demonstrate the flexibility of using rCAD to form SQL solutions for innovative and complicated analysis problems.

  • Catharine Van Ingen, Microsoft Research; Deb Agarawl, Berkeley Water Center

    Many ecological and hydrological science collaborations are starting to use relational databases to collect, curate and archive their data. Data cube (OLAP) technology can be used in combination with a relational database to simply and efficiently compute aggregates of temporal, spatial and other data dimensions commonly used for data analysis. Over the last year, we’ve built a number of data cubes to support different ecological science goals. This talk explores the commonalities between these cubes and the differences along with some of the reasons for each. While we are hand crafting each cube today, our goal is a methodology that produces a family of cubes that can be used across a number of scientific investigations and related disciplines.

  • Algorithms: Bellflower A+B

  • Wei Wang, Fernando Pardo Manuel de Villena; University of North Carolina

    With the realization that a new model population was needed to understand human diseases with complex etiologies, a genetically diverse reference population of mice called Collaborative Cross (CC) was proposed. The CC is a large, novel panel of recombinant inbred (RI) lines that combines the genomes of genetically diverse founder strains to capture almost 90% of the known variation present in laboratory mice and that is designed specifically for complex trait analysis. CC becomes the focal point for cumulative and integrated data collection, giving rise to the detection of networks of functionally important relationships among diverse sets of biological and physiological phenotypes and a new view of the mammalian organism as a whole and interconnected system. The volume and diversity of the data offers unique challenges, whose solutions will advance both our understanding of the underlying biology and the tools for computational analysis. The data will eventually contain high-density SNPs (Single Nucleotide Polymorphism), or even whole genome sequences, for hundreds of CC lines and millions of phenotypic measurements (molecular and physiological) and other derived variables. New data mining and knowledge discovery techniques are in need for efficient and comprehensive analysis. In collaboration with geneticists, we are developing novel and scalable data management and computational techniques to enable high throughput genetic network analysis, real-time genome-wide exploratory analysis, and interactive visualization. The methods are designed to support instant access and computation for any user-specified regions and enable fast and accurate correlation calculation and retrieval of loci with high linkage disequilibrium. The outcome is a fast SNP query engine that allows for large permutation evaluation and association tests and interactive visualization.

  • Alexander Gray, Georgia Institute of Technology

    The FASTlab develops novel algorithms and data structures for making the analysis of massive scientific datasets possible, with state-of-the-art machine learning/statistical methods and whatever other operations on data lay on our scientist collaborators’ critical paths. I’ll describe two of our latest projects.1.A central long-standing problem in protein folding is the determination of an approximating energy function which is both tractable and accurate enough to achieve realistic folds. Working with Jeff Skolnick, one of the leaders in the field, we are approaching the problem via machine learning rather than chemical theory alone. Using a customized machine learning method and fast algorithms allowing the use of massive datasets of protein conformations, we appear to be outperforming state-of-the-art hand-built energy functions in preliminary qualitative results, and we believe we have only begun exploring this new paradigm.2.Starting next summer, the Large Hadron Collider will generate 40 million data points per second, continuously for 15 years. Since most of this data must be discarded, much activity will surround the decisions about which events to keep, aka trigger tuning. We are working with the ATLAS detector team on new data structures for fast high-dimensional range querying, which will allow interactive trigger tuning on the physicist’s desktop, and ultimately a scheme for automatic trigger tuning. We are excited about the possibility of computer science playing a critical role in the world’s largest scientific experiment.

  • Anne Trefethen, Daniel Goodman, Stef Salvini; Oxford University

    Matlab has become one of the essential computational tools for many engineers and scientists. The environment enables quick development of applications with integrated visualization and many toolboxes aimed at specific algorithmic or application areas. One concern for Matlab users has become the issues of tackling larger-scale problems, and utilizing multiple processors –be that clusters of processors or a system with multicore processors. In response to these user requirements the MathWorks have also developed a toolbox for the support of distributed Matlab enabling applications that are suited to embarrassingly parallel or loosely coupled computations and this is increasingly supporting more fine-grained applications. We will give an initial report on our use of this environment on a number of applications on the Microsoft CCS, considering the ease of integration with the system and providing a view of future tools and techniques for enhancing the existing system.