October 13, 2006 - October 15, 2006

eScience Workshop 2006

Location: Baltimore, Maryland, US

Chair: Catharine van Ingen

Stream Scouts: A Tiered Smart Client System for Annotating the Land-Water Interface

Piotr Parasiewicz (opens in new tab) & Chris Pal, University of Massachusetts at Amherst

We have developed a smart e-client system that will facilitate ad-hoc classification and validation of hydraulic features on very recent, high-resolution aerial photography of streams and rivers. It is geared toward aspects relevant to the protection of both human uses (i.e., drinking water quality, hydropower, flood protection) and ecological status. Specifically, we extend our previous setup of small hand held Pocket PC devices used to take field measurements to form ad-hoc wireless networks and communicate with Tablet PC servers located nearby on a boat or on shore. A heavyweight desktop server in a distant location hosts the river-habitat database, runs complex aquatic habitat simulation models and management applications over the internet, and facilitates data exchange with the Tablet PC. This system supports semi real-time simulations of aquatic habitats and creation of a self-learning database, enabling the development of algorithms for classification of critical habitat features from aerial photographs.

This project is a natural extension of many issues encountered in the TerraServer project, extending methods from computational geography to the domain of computational hydrology and computational ecology, through enabling the annotations necessary for simulating the dynamics of a watershed. Such a system is in high demand among environmental scientists and resource managers.

River Basin Scale Water-Quality Modeling using the CUAHSI Hydrologic Information System

Jonathan Goodall & Song Qian, Duke University

The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is a partnership of over 100 Universities in the United States. Informatics is a pillar the CUAHSI vision, and for the past three years a team of hydrologists and computer scientists have been working together to prototype a Hydrologic Information System to support river basin scale hydrologic science and management.

The Hydrologic Information System consists of (1) standard signatures for hydrologic data delivery web services, (2) a standard database schema for hydrologic observations, (3) ontologies for relating water quality parameters collected and maintained by different federal, state, and local agencies. The eventual goal of this system is to make possible hydrologic assessments and models previously too complex for individual scientists to undertake. The Hydrologic Information System, therefore, is a means to an end, the Cyber infrastructure necessary to progress scientific understanding. Therefore, we will present a large-scale water quality model that utilizes various components of the Hydrologic Information System.

This water quality model has been applied to major river basins in the United States (Chesapeake Bay, Mississippi-Missouri, etc.), but was previously limited to modeling long-term averages in nutrient loadings.

With the Hydrologic Information System, it is now feasible to gather, integrate, and summarize the Nation’s water quality records to support a temporally dynamic version of the model. This new version of the model improves our understanding of how landscape changes impact water quality and, just as important, provide evidence of the ultimate impact of water resources management decisions towards improving our Nation’s water quality.

Space-Time Series of MODIS Snow Cover Products for Hydrologic Science

Jeff Dozier & James E. Frew, University of California Santa Barbara

The Moderate-Resolution Imaging Spectroradiometer flies on two NASA/EOS satellites, each imaging most of the Earth every day, Terra in the morning, Aqua in the afternoon. MODIS has 36 spectral bands covering wavelengths from 0.4 to 14.4µm, 2 at 250m spatial resolution, 5 at 500m, and 29 at 1#km. Using reflectance values from the 7 “land” bands with 250 or 500m resolution, along with a 1km cloud product, we estimate the fraction of each 500m pixel that snow covers along with the albedo (reflectance) of that snow. Such products are then used in hydrologic models in several mountainous basins. The daily products have glitches. Sometimes the sensor cannot view the surface because of cloud cover, and even in the absence of clouds, an off-nadir view in a vegetated area “sees” less ground area than a nadir view. Therefore, we must use the daily time series in an intelligent way to improve the estimate of the measured snow properties for a particular day. We consider two scenarios: one is the “forecast” mode, whereby we use the past, but not the future, to estimate the snow-covered area and albedo on that day; the other is the “retrospective” mode, whereby in the summer after the snow is gone we reconstruct the history of the snow properties for that water year.

This space-time interpolation presents both scientific and data management challenges. The scientific question is: how do we use our knowledge of viewing geometry, snow accumulation and ablation, along with available ground data, to devise a scheme that is better than generic multidimensional interpolation? The data management involves large three-dimensional objects, identification of erroneous data, and keeping track of the lineage of the way a set of pixel values has been interpreted.

Web Services for Unified Access to National Hydrologic Data Repositories

I. Zaslavsky, D. Valentine, B. Jennings, UCSD; D. Maidment, University of Texas – Austin

The CUAHSI hydrologic information system (HIS) is designed to be a multi-tier network of grid nodes for publishing, accessing, querying, and visualizing distributed hydrologic observation data for any location or region in the United States. The core of the system is web services that provide uniform programmatic access to heterogeneous federal data repositories as well as researcher-contributed observation datasets

The currently available second generation of services support data and metadata discovery and retrieval from USGS NWIS (steam flow, groundwater and water quality repositories), DAYMET daily observations, NASA MODIS, and Unidata NAM streams, with several additional web service wrappers being added (EPA STORET, NCDC ASOS, USGS NAWQA.) Accessed from a single discovery interface developed as an ASP.NET application over ESRI’s ArcGIS Server, the web services support comprehensive hydrologic analysis at the catchments, watershed and regional levels.

Different repositories of hydrologic data use different vocabularies, and support different types of query access. Resolving the semantic and structural heterogeneities and distilling a generic set of service signatures is one of the main scalability challenges in this project, and a requirement in our web service design. To accomplish the uniformity of the web services API, different data holdings are modeled following the CUAHSI Observation Data Model. The web service responses are document-based, and use an XML schema to express the semantics in a standard format. Access to station metadata is provided via web service methods, GetSites, GetSiteInfo and GetVariableInfo, while observation values are retrieved via a generic GetValues method. The methods may execute over locally-stored metadata (in SQL Server 2005) or request the information from remote repositories directly. The services are implemented in ASP.Net 2.0 (C#), and tested with both .Net and Java clients.

The CUAHSI HIS project is funded by NSF through 2011. More information about it is available from http://www.cuahsi.org/his (opens in new tab).

Early Experience Prototyping a Scientific Data Server for Environmental Data

Catharine van Ingen, Microsoft

There is an increasing desire to do science at scales larger than a single site or watershed and over times measured in years rather than seasons. This implies that the quantity and diversity of data handled by an individual scientist or small group of scientist is increasing even without the “data deluge” associated with inexpensive sensors.

Unfortunately the quality and quantity of available data and metadata varies widely and algorithms for deriving science measurements from observations are still evolving. Also, the existence of an Internet archive does not guarantee the quality of the data it contains, and the data can easily become lost or corrupted through subsequent handling. Local data recalibration, additional data derivation, and/or gap-filling are seldom tracked leading to confusion when comparing results. Today the tasks of data collection, sharing and mining are often a significant barrier to cross-site and regional analyses.

We are developing a prototype scientific data server for data sharing and curation by small groups of collaborators. This data server forms the storage part of a laboratory information management system or LIMS. The server also performs simple data mining and visualization of diverse datasets with diverse metadata. Our goal is to enable researchers to collect and share data over the long time scales typically necessary for environmental research as well as to simply analyze the data as a whole, thus dramatically increasing the feasible spatial or temporal scale of such studies.

The scientific data server prototype is being developed using the Ameriflux data in cooperation with key scientists attempting continental scale work on the global carbon cycle using long term local measurements. The Ameriflux measurement network consists of 149 micro-meteorological towers across the Americas. The collaboration is communal – principal investigator acts independently to prepare and publish data to the Oak Ridge repository. One of the near term challenges for the Ameriflux and global FLUXNET communities is to enable cross-site analyses across sites with similar locations, ecosystem, climate or other characteristics. A longer term challenge is to link the flux data to other related data such as MODIS satellite imagery.
Chair: Chi Dang

Real-Time Transcription of Radiology Dictation: A Case Study for Multimedia Tablet PCs

Wuchun Feng, Virginia Tech

We present the design and implementation of an integrated multimodal interface that delivers instant turnaround on transcribing radiology dictation. This instant turnaround time virtually eliminates hospital liability with respect to improper transcriptions of oral dictations and all but eliminates the need for transcribers. The multimodal interface seamlessly integrates three modes of input; speech, handwriting and written gestures, to provide an easy-to-use system for the radiologist.

Although computers have quickly become an essential part of today’s society, their ubiquity has been stymied because many still find the computer “unnatural” (and even difficult) to use. While scientists and engineers take their computer skills for granted, a large number of potential users still have limited experience in using a computer. To make computers (or products with embedded computers, e.g., an automobile) easier and more natural to use, manufacturers have proposed the use of a speech recognition system. Even for computer-savvy users, speech can be used to boost productivity because nearly everyone can talk faster than they can type, typically more than 200 words per minute (wpm) versus 50 to 75 wpm. However, speech recognition is never perfect; recognition errors are made. In order to correct these errors, the end user currently uses a keyboard and mouse.

Instead, we propose a system that seamlessly integrates speech, handwriting, and written gestures and provides a natural multimodal interface to the computer. To ensure that the interface is easier to use than a keyboard-and-mouse interface, the speech recognizer must have a high recognition rate, e.g., 95%, and the handwriting and gesture recognizers should provide nearly error-free recognition of stylus-entered handwriting and gestures, respectively, to correct errors made by the speech recognizer. These corrections can then be applied to the speech recognizer itself to improve future recognition.

Facilitating Understanding and Retention of Health Information

Gondy Leroy, Claremont Graduate University

Billions of people read online health information without understanding it, which is unfortunate since it affects their healthcare decisions. Current research focuses almost exclusively on measuring readability and (re)writing texts so they require lower reading levels. However, rewriting all texts is infeasible and little research has been done to help consumers otherwise.

We focus on automated tools that facilitate understanding and retention of information. We found that consumers read at lower grade levels but also use a significantly different vocabulary than healthcare providers. We have developed a vocabulary-based naïve Bayes classifier that distinguishes with 96% accuracy between three levels of medical specificity in text. Applying this classifier to a sample of online texts showed that only 4% of texts by governments, pharmaceuticals and non-profits use consumer-level vocabulary. As a first step we are developing a table of content (ToC) algorithm that automatically imposes a semantic structure. The ToC shows important concepts in the text. Selecting these concepts highlights the key terms and bolds the surrounding text. This makes searching the text easier and, more importantly, it may improve understanding and retention of information. The ToC visually chunks the information into easy to understand groups which may facilitate transfer from working memory to long-term memory. This is especially important for the elderly, our focus group, who often have physical ailments, deteriorated eyesight and decreased working memory.

Results from our pilot study with a first prototype indicate that question-answering when the original text is present worked as well with as without the ToC. Remembering the correct answers and recalling extra information afterwards was better with a ToC. A complete user study comparing the elderly and other adults is ongoing.

Motion-Synchronized Intensity Modulated Arc Therapy

Shuang (Sean) Luan, University of New Mexico

Modern radiotherapy is a minimally invasive treatment technique that uses high-energy X-rays to destroy tumors. The quality of a radiotherapy plan is normally measured by its dose conformity and treatment time. The dose conformity specifies how well the high radiation dose region conforms to the target tumor while sparing the surrounding normal tissues, and the treatment time describes how long a treatment takes and how efficiently the treatment machines are utilized.

One of the biggest challenges in modern radiotherapy is to treat tumors in and near the thorax, because they are subject to substantial breathing induced motions and their anatomies during the treatment may vary significantly from those used for treatment planning. To compensate such target variations, image-guidance techniques such as 4-D CT and motion tracking have recently been employed in radiotherapy to provide real-time adjustment to the treatment. A current most popular image-guidance technique is called “gating”; the key idea is to treat the patient only at a certain phase of the breathing cycle. Since most of the treatment time is spent waiting for the patient to enter the correct breathing phase, gating can be very inefficient. Further, by only treating the patient at a chosen breathing phase, gating fails to take advantage of the 4-D imaging technologies, which can record the patient’s anatomy changes with respect to time.

We have developed an image-guided radiotherapy technique for compensating breathing-induced motions called motion-synchronized intensity-modulated arc therapy. A prototype planning system running on Microsoft Windows has been implemented using Microsoft Visual C++. Unlike gating, our new scheme makes full use of 4-D CT and motion tracking and treats the patient at all breathing phases. Our preliminary study has shown the amazing ability of motion-synchronized IMAT to produce treatment plans with both superior dose conformity and a short treatment time.

Systems Biology and Proteomics in Drug and Biomarker Discovery

Mark Boguski, Novartis

Recent advances in the “omics” technologies, scientific computing and mathematical modeling of biological processes have started to fundamentally impact the way we approach drug discovery. Recent years have witnessed the development of genome-scale functional screens, large collections of reagents such as RNAi libraries, protein micro arrays, databases and algorithms for text mining and data analysis.

Taken together, these tools enable the unprecedented descriptions of complex biological systems, which are testable by mathematical modeling and simulation. While the methods and tools are advancing, it is their iterative and integrated application that defines the systems biology approach.

The Cardiovascular Research Grid

Raimond L. Winslow, JHU

The Cardiovascular Research Grid (CVRG) project is a national collaborative effort involving investigators at Johns Hopkins University, Ohio State University and the University of California at San Diego. The goals of this project are to leverage existing grid computing middleware developed as part of the Biomedical Informatics Research Network (BIRN – a brain Image sharing and data analysis grid) and the Cancer Bioinformatics Grid (caBIG – sharing of cancer data) to create a national resource for sharing and analysis of multi-scale cardiovascular data.

Data to be shared include gene and protein expression data, electrophysiological (time series) data, multimodal 3D and 4D image data and de-identified clinical data. Analysis tools to be developed and shared include machine learning methods for predicting risk of Sudden Cardiac Death based on these multi-scale data, computational anatomy tools for detecting abnormalities of heart shape and motion, and computational models of heart function in health and disease.
Chair: George Spix

It Takes Two (or More) to Place Data

Miron Livny, University of Wisconsin-Madison

Data intensive e-Science is by no means immune to the classical data placement problem – caching input data close to where the computation takes place and caching output data on its way to the designated storage/archiving destination. As for other aspects of e-Science, the scale, heterogeneity and dynamics of the infrastructure and the workload, increase the complexity of providing a dependable managed data placement capability. When a “chunk” of bytes is cached, a source site and destination site are actively engaged in the placement of the data. While the destination has to provide the storage space (we refer to it as a lot) to “park” the data, both sites need to co-allocate local resources like disk bandwidth and memory buffers in support of the transfer activity.

Operational cyber-infrastructure at the campus level (like the Grid Laboratory of Wisconsin (GLOW)) and at the national level (like the Open Science Grid (OSG)) expose the limitations and deficiencies of existing storage and data handling tools and protocols. While most of them focus on network performance, they offer very little in local resource manageability and coordination. Recent work to enhance the capabilities of existing tools like GridFTP and protocols like the Storage Resource Manager (SRM) as well as specialized job managers like Stork, point at promising approaches to address the data placement problem for e-Science applications. Some of this work employs matchmaking techniques to coordinate the allocation of resources at the two end points. These techniques allow the parties to express their autonomous resource allocation policies and locally enforce them. By elevating data placement to the same level as computing, data caching tasks can be easily included in workflows so that all aspects of the workload can be uniformly treated.

COMPASS – Staying Found in a Material World

Gerd Heber & Anthony R. Ingraffea, Cornell Theory Center

The Computational Materials Portal and Adaptive Simulation System is an attempt to deliver certain Computational Materials services and resources over the World Wide Web to the desks of engineers, researchers, and students in academia, government, and industry. Currently, COMPASS resources and services are available to human and non-human end-users through a portal site or XML Web services. The services and resources offered include modeling tools, simulation capabilities, imagery, and other data contributed by domain experts. With COMPASS services, each authorized user can create new resources and further process them in a private workspace.

COMPASS is a multi-tiered system which brings to bear a set of technologies. Its web tier is implemented in Microsoft ASP.NET 2.0 and Atlas. In addition to traditional RDBMS use, the middle tier and back end leverage several of the capabilities introduced with Microsoft SQL Server 2005, e.g., the native XML type and the integrated CLR. Other technologies employed are RDF/XML for metadata management and OpenDX/JDX for local and remote visualization.

COMPASS is work-in-progress: the presentation is a status report and will highlight some of the present challenges. Among them are ambient find-ability (find anything from anywhere anytime) and data resource federation and replication. COMPASS grew out of and is currently supported by the DARPA SIPS (Structural Integrity and Prognosis System) effort which aims at dramatically improving predictions of usable vehicle life based on field data, the best available material science, and multi-scale simulation.

HPC Profile: Interoperable, Standards-based Batch Job Scheduling of Scientific/Technical Applications

Marty Humphrey, University of Virginia

eScientists often use high-end computing platforms such as computational clusters to perform complex simulations. Currently, per-machine idiosyncratic interfaces and behaviors make seamless access across these platforms nearly impossible, causing often fragile middleware (or more likely the end eScientist) to attempt to manually deal with underlying differences between these back-end-resources. In collaboration with Microsoft and Platform Computing in the context of the Open Grid Forum, we have recently completed a standards-based “HPC Profile” based on Web services.

The core of the HPC Profile is the Job Submission Description Language (JSDL) and the Open Grid Services Architecture (OGSA) Basic Execution Services (BES). JSDL is a proposed standard that describes the requirements of computational jobs for submission to resources. BES is an emerging standard for specifying a service to which clients can send requests to initiate, monitor, and manage computational activities. The HPC Profile augments, clarifies and restricts JSDL and BES to create the minimal interoperable environment for realizing the vertical use case of batch job scheduling of scientific/technical applications. The HPC Profile is the cornerstone of the “Evolutionary approach to realizing the Grid vision” of Theimer, Parastatidis, Hey, Humphrey, and Fox.

In this talk, I give an overview of the HPC Profile, emphasizing its impact on the end eScientists. I will describe the state of interoperable open-source implementations of the HPC Profile (as well as the development of compliance tests for future implementations). I will give a demo of a Web Part that can be utilized by Microsoft Office SharePoint Server 2007 as a key component for building a collaboration site for an eScience project. I will conclude with some thoughts on a potential Data Profile, which builds on the success of the HPC Profile to construct a corresponding standards-based approach for interoperable data federation and management.

Globally Distributed Computing and Networking for Particle Physics Event Analysis

Julian Bunn, Caltech

Excitement in anticipation of the first proton beams at CERN’s Large Hadron Collider (LHC) in 2007 is reaching new heights amongst physicists engaged in the Compact Muon Solenoid experiment, one of four detectors that will be used to capture and analyze the LHC data in search of the Higgs Boson and new physics.

The expected LHC data rates (200Mbytes/sec to 1.5 GBytes/sec) give rise to unusually large datasets which must be distributed, processed and analyzed by a worldwide community of scientists and engineers, according to the decentralized, Tiered model of computing developed at Caltech in 1997 and since adopted by these experiments. Over the last eight years at Caltech we have been actively planning and developing computing infrastructure to meet this data challenge. The effort has several thrusts: planning, testing, evaluating, and deploying in production high speed intercontinental networks to carry scientific data on behalf of the community (LHCNet), developing and deploying Grid-based physics software and tools, with a particular focus on event data analysis (Clarens), and creating a worldwide real-time monitoring and control infrastructure for systems, networks, and services based on an agent architecture (MonALISA).

In my presentation I will describe these activities and paint a picture of how we expect to extract and analyze the LHC data for evidence of new physics over the next decade and beyond.

COMPASS – Staying Found in a Material World

Gerd Heber & Anthony R. Ingraffea, Cornell Theory Center

The Computational Materials Portal and Adaptive Simulation System is an attempt to deliver certain Computational Materials services and resources over the World Wide Web to the desks of engineers, researchers, and students in academia, government, and industry. Currently, COMPASS resources and services are available to human and non-human end-users through a portal site or XML Web services. The services and resources offered include modeling tools, simulation capabilities, imagery, and other data contributed by domain experts. With COMPASS services, each authorized user can create new resources and further process them in a private workspace.

COMPASS is a multi-tiered system which brings to bear a set of technologies. Its web tier is implemented in Microsoft ASP.NET 2.0 and Atlas. In addition to traditional RDBMS use, the middle tier and back end leverage several of the capabilities introduced with Microsoft SQL Server 2005, e.g., the native XML type and the integrated CLR. Other technologies employed are RDF/XML for metadata management and OpenDX/JDX for local and remote visualization.

COMPASS is work-in-progress: the presentation is a status report and will highlight some of the present challenges. Among them are ambient find-ability (find anything from anywhere anytime) and data resource federation and replication. COMPASS grew out of and is currently supported by the DARPA SIPS (Structural Integrity and Prognosis System) effort which aims at dramatically improving predictions of usable vehicle life based on field data, the best available material science, and multi-scale simulation.

Accelerating Statistical Biomedical Data Analysis Using a PC-Cluster Based Distributed Computing Technology

Yibin Dong, Virginia Tech

Small sample size problems in biomedical research using genomic datasets brought challenges of relevancy and scalability to research scientists. In order to obtain statistical significance in data analysis, the same computing task is usually repeated hundreds of times but on a single computer, these independent tasks in a queue cause a bottleneck in statistical data analysis. One solution to accelerate statistical biomedical data analysis is to use cluster computing which has typically been Linux based. However, many research scientists who are used to Microsoft Windows environment may not be keen to switch to an operating system which is new to them.

In May 2006, researchers at Virginia Tech Advanced Research Institute built a multi-node parallel computer using Microsoft Corporation’s beta version of Windows Compute Cluster Server (CCS) 2003. It was built using 16 HP Proliant DL145 Generation 2 Servers over a period of two months during which we successfully tested two in-house bioinformatics applications on CCS – robust biomarker selection and predictor performance estimation, using Matlab distributed computing toolbox (DCT). The time reduction rates of the two applications on the 16-node compute cluster are 84.53% and 92.08% respectively compared to running the same application on a single computer.

The improved performance using Windows CCS cluster shows the feasibility of applying Windows-based high performance computing for small to medium-sized biomedical researchers which provides significant benefits such as increased computation performance, easy deployment, easy use, high scalability, and high security.
Chair: Steven Meacham

National Science Foundation and e-Science: Now and Next Steps

Maria Zemankova, National Science Foundation

The National Science Foundation (NSF: http://www.nsf.gov (opens in new tab)) mission as stated in the Act of 1950 is: “To promote the progress of science; to advance national health, prosperity, and welfare; to secure the national defense; and for other purposes.” 55 years later, the National Science Board (NSB) that governs NSF articulated the “2020 Vision for the National Science Foundation” (http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsb05142 (opens in new tab)) and also published a report on “Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century” (http://www.nsf.gov/pubs/2005/nsb0540 (opens in new tab)). This year, NSF established a new Office of Cyberinfrastructure that coordinates and supports the acquisition, development and provision of state-of-the-art Cyberinfrastructure resources, tools and services essential to the conduct of 21st century science and engineering research and education. The common theme is the need to promote conduct of research in the new information, computation and communications-based knowledge discovery and sharing paradigm, i.e., develop “e-Science” research infrastructure.

The Computer & Information Science & Science (CISE) and the NSF’s science, engineering, educational and infrastructure programs foster synergistic collaboration for the advancement of both CISE and domain areas. NSF supports innovative techniques for exploiting and enhancing existing information, computation and communications technologies to support domain-specific research problems, large-scale transformative research projects, or collaborative research activities with other partners, including industry or international research communities. Supported and proposed research spans new methods for modeling new, complex data types; efficient techniques for collecting, storing and accessing large volumes of dynamic data; development of effective knowledge discovery environments, including analysis, visualization, and simulation techniques; distributed collaboration and discovery process management (grids, scientific workflows); research creativity support tools, e-Science interdisciplinary curriculum development; long-term knowledge evolution and sharing; and innovations in publishing and archival of scientific literature, results, and data.

This presentation will provide information on existing research and infrastructure projects, current support opportunities, and outline future plans and wishes.

SETI@home and Public Participation Scientific Computing

Dan Werthimer, University of California, Berkeley

Werthimer will discuss the possibility of life in the universe and the search for radio and optical signals from other civilizations. SETI@home analyzes data from the world’s largest radio telescope using desktop computers from five million volunteers in 226 countries.

SETI@home participants have contributed two million years of computer time and have formed one of Earth’s most powerful supercomputer. Users have the small but captivating possibility their computer will detect the first signal from a civilization beyond Earth.

Werthimer will also discuss plans for future SETI experiments, petaop/sec FPGA based computing, and open source code for public participation distributed computing (BOINC — Berkeley Open Infrastructure for Network Computing).

Organizing, Analyzing and Visualizing Data on the TeraGrid

Kelly P. Gaither, TACC

eScience is a term used to describe computationally intensive science carried out in highly distributed network environments or using immense data sets requiring grid computing technologies. A classic example of a large scale eScience project is the TeraGrid, an open scientific discovery infrastructure combining leadership class resources at nine partner sites to create an integrated, persistent computational resource. The TeraGrid integrates high-performance computers, data and visualization resources, software tools, and high-end experimental facilities around the country. These integrated resources include more than 102 teraflops of computing capability and more than 15 petabytes of online and archival data storage with rapid access and retrieval over high-performance networks. Through the TeraGrid, researchers can access over 100 discipline-specific databases.

I currently serve as the Area Director for Data, Information Services, Visualization and Scheduling (DIVS) for the TeraGrid Grid Integration Group (GIG). In this role, I am keenly aware of the impending data and analysis issues facing our e-Science community. Data management and visualization have become priorities for the national user community and consequently for the TeraGrid. In this day of information proliferation, the need for rapid analysis and discovery is critical. Information and data are being generated at an alarming rate, through measurement, sensor, or simulation. We, as scientists and technologists, are beginning to better understand the management and manipulation of massive data stores – either through storage, co-location, or rapid movement. Getting information from the visualization process, however, is a challenging process that is still in its infancy. We have explored the issues that data analysis and visualization face and the relationships that exist between data type, size, and structure, and the corresponding analysis techniques. I will present and discuss the issues that we face on the TeraGrid with regards to the organization and analysis of large-scale data and strategies going forward.

Sector – An E-Science Platform for Distributing Large Scientific Data Sets

Robert Grossman, University of Illinois at Chicago

In this talk, we show how a peer-to-peer system called Sector can be used to distribute large scientific data sets over wide-area high performance networks. We also describe how Sector has been used recently to distribute data from the Sloan Digital Sky Survey (SDSS).

Sector is designed to exploit the bandwidth available in wide-area, high performance networks and to do this in a way that is fair to other high volume flows and friendly to traditional TCP flows. Sector employs a high performance data transport protocol called UDT to achieve this. Sector has been used to transport the SDSS BESTDR5 catalog data, which is over 1 TB in size, to locations in North America, Europe and Asia. Sector is designed to provide simple access to remote and distributed data. No other infrastructure is required than a fast network and a small Sector client application; in contrast, installing and operating the infrastructure for a data grid can sometimes be challenging.

We also describe a distributed system for integrating and analyzing data called Angle that is built over Sector. Angle is designed to perform row and column operations on distributed data. In contrast to systems that rely exclusively on a database or data warehouse for data integration and require the full semantic integration of different data schemas, Angle also supports data integration using globally unique identifiers called universal keys that can be attached to distributed data attributes. For many applications, having one or more universal keys is an often surprisingly useful. For example, geo-spatial applications can use universal keys specifying a specific latitude-longitude coordinate system for data integration operations; astronomical applications can use universal keys specifying a specific right ascension-left declination coordinate system, etc.

eScience Workshop 2006

Water Science: Schafler Auditorium

Healthcare Informatics: Mudd Hall

High Performance Computing: Remsen One

eScience Communities: STI Auditorium