October 21, 2007 - October 23, 2007

eScience Workshop 2007

Location: Chapel Hill, North Carolina, US

Abstracts for Monday, October 22, 2007

    h2>Plenary Presentation
  • Daniel A. Reed, Chancellor’s Eminent Professor, University of North Carolina at Chapel Hill, Director, Renaissance Computing Institute

    Ten years – a geological epoch on the computing time scale. Looking back, a decade brought the web and consumer email, digital cameras and music, broadband networking, multifunction cell phones, WiFi, HDTV, telematics, multiplayer games, electronic commerce and mainstream computational science. It also brought spam, identity theft, software insecurity, globalization, information warfare, blurred work-life boundaries, distributed sensors and inexpensive storage and clusters. What will another decade of technology advances bring to scientific discovery? As Yogi Berra famously noted, “It’s hard to make predictions, especially about the future.” Without doubt, though, scientific discovery via computing is moving rapidly from a world of homogeneous parallel systems to a world of distributed software, virtual organizations and high-performance, deeply parallel systems. In addition, a tsunami of new experimental and computational data and a suite of increasingly ubiquitous sensors pose equally vexing problems in data analysis, transport, visualization and collaboration. This talk describes a Renaissance vision and approach to solving some of today’s most challenging scientific and societal problems using powerful new computing tools likely to emerge over the next decade.

  • Environmental eScience

  • Keith Grochow, University of Washington

    It is clear that in order to better understand our planet we need to better understand the oceans. In order to accomplish this, the University of Washington is currently creating the design for an unprecedented oceanographic sensor system off the coast of the northwestern United States. This will provide a continuous presence on the ocean floor and throughout the water column in areas of scientific interest. In order to most effectively compare and create designs for the basic infrastructure and specific experiments, we have developed COVE – an interactive environment to allow a broad range of scientists and engineers to work together in this activity. Through a combination of bathymetry and data visualization, interactive layout tools, and workflow integration COVE provides an intuitive shared environment to quickly and cheaply test ideas and compare various approaches. We have deployed the system across the current design team with positive results and are investigating community outreach and educational scenarios for our work.

  • Dennis Gannon, Indiana University; Beth Plale, Indiana University

    This presentation describes work on programming framework that combines rule-based event monitoring and adaptive workflow systems. The domain of application for this work spans a wide variety of problems in eScience in which external events, such as those detected by sensors, human actions, or database updates must automatically trigger computationally significant actions. The results of these actions may feed back into the event stream and trigger additional actions, or they may adaptively alter the path of other computations. The philosophy underlying this work is driven by an observation on the iterative nature of knowledge discovery. In data-driven application domains, knowledge acquisition is often a discovery process carried out by successive refinement. A scientist has an initial hunch about the behavior of a process or system. He/she progresses to an answer through a process that combines and repeats discovery and hypothesis testing. The work will be validated by its applicability to three diverse use-cases. The first involves monitoring Doppler radar streams for severe storm signatures and automatically launching tornado forecast workflows. The second involves the iterative exploration of ligand-protein binding workflows in the drug discovery pipeline. In both of these cases the computational and data analysis actions are non-trivial and require distributed resources. The third use-case addresses the issue of maintaining and optimizing underlying computational resources and adapting workflow of other active queries in the system in response to significant events. The proposed programming model is a reactive rule model, which we conjecture can very fruitfully leverage complex event detection and adaptive distributed workflow.

  • Robert Gurney, University of Reading; Jon Blower, University of Reading; Ned Garnett, NERC

    Environmental science is undergoing a revolution. The availability of frequent global observations, particularly from satellites, the availability of high performance computing to model the environment at high spatial and temporal resolutions to model processes explicitly, and the ability to organize, compare, visualize and exchange observations and model results gives us for the first time an ability to predict both natural and human-induced changes and put error budgets about the predictions. eScience is vital for this revolution, allowing sharing both data and models, and allowing large model ensembles to be run easily. Several examples will be given, drawing on UK experience with its environmental eScience programs. A particular focus is on climate prediction, where climate models with reduced resolution can be run on clusters as ensembles to understand the errors in predictions, either by running very many member ensembles, or by running for very long periods to understand changes over geologic time. It has been found, for instance, that climate over the next century could warm by up to 12C, or not at all, despite the generally published warming being predicted between about 2-5C with smaller ensembles. Other results of confronting ocean models with observations are that the ocean help to understand the ocean overturning and its variability in the 20th Century, and to aid predictions of coastal waters globally as well as the role of sea ice in the Arctic in the global system All of these projects have led to internationally-leading sets of publications. We will also look forward to where the next computing developments are needed to sustain these environmental developments, and discuss how this dialogue between environmental and computer scientists can be strengthened.

  • Knowledge Modeling and Discovery

  • Catherine Blake, UNC Chapel Hill

    Cyber infrastructure provides students, scientists and policy makers with an unprecedented quantity of information, for example in biomedicine PubMed adds 12,000 new citations each week and the top chemistry journals publish more than a hundred thousand articles in a single year. Despite advances in information access, the quantity of information far exceeds human cognitive processing capacity. Consider a breast cancer scientist who must sift through the 12,600 articles published during the 28 months required to conduct a systematic review, a process used to resolve conflicting evidence. In addition to quantity, evidence related to the complex research questions posed by scientists transcend traditional disciplinary boundaries and thus require a multi-, inter-, or trans- disciplinary approach. I will describe how recent advances in natural language processing, specifically in recognizing textual entailment and in generating multi-document summaries can enable new kinds of e-science. This next generation of information tools recognizes contradictions and redundancies that are inevitable in the information intensive environment in which a scientist operates. Using existing systems that account for complex interdependencies between scientific articles as examples, I will show how these systems embody the shift from the information retrieval to information synthesis. I will conclude with preliminary results from Claim Jumper, a system that captures the spirit of gold-miners searching for nuggets of knowledge in a new frontier, and reflects a scientist’s transition through traditional disciplinary boundaries. Given a topic, query and set of articles, Claim Jumper generates a fluent well organized summary from published literature that accounts for redundancy.

  • Pat Langley, Arizona State University

    Most research on computational knowledge discovery has focused on descriptive models that only summarize data and utilized formalisms developed in AI or statistics. In contrast, scientists typically aim to develop explanatory models that make contact with background knowledge and use established scientific notations. In this talk, I present an approach to computational discovery that encodes scientific models as sets of processes that incorporate differential equations, simulates these models’ behavior over time, incorporates background knowledge to constrain model construction, and induces the models from time-series data. I illustrate this framework on data and models from a number of fields, including ecology, environmental science, and biochemistry. In addition, I report on recent extensions that draw upon additional knowledge to reduce search, that combine models to lower generalization error, and that handle data sets with missing observations. Moreover, rather than aiming to automate construction of such models, I describe our efforts to embed these methods in an interactive software environment that lets scientist and computer jointly create and revise explanations of observed phenomena. This talk describes joint work with Kevin Arrigo, Stuart Borrett, Matthew Bravo, Will Bridewell, and Ljupco Todorovski.

  • Sherrilynne Fuller, University of Washington

    The introduction of sophisticated web search engines has greatly improved the ability of scientists to identify relevant research, however, the retrieved set of articles, even when utilizing the advanced search capabilities of search engines such as Google Scholar or PubMed overwhelms even the most diligent researcher. Tools are needed to help scientists to rapidly tease out relevant research findings that will contribute to their picture of potential directions for future research and will thus enhance the hypothesis generation process. Specialized query and visualization tools are needed which support extraction and navigation of research findings and enhance a natural question/answer approach which is a critical aspect of the hypothesis generation process. For example, oeGiven a connection between x and y what else do we know about other connections to each that suggest a mechanism for action? At the present time web search engines do not extensively leverage findings from critical information retrieval research regarding document structure or research about the behavior of scientists particularly in the areas of hypothesis generation and scientific creativity. A review of relevant research findings from the University of Washington and elsewhere will be presented and potential directions for future work will be discussed.

  • Synthetic Biology

  • Bhaskar DasGupta, UIC

    Our (Albert, DasGupta, Dondi, Kachalo, Sontag, Zelikovsky, Westbrooks) work proposes a novel computational method to solve the biologically important problem of signal transduction network synthesis from indirect causal evidence. This is a significant and topical problem because there are currently no high-throughput experimental methods for constructing signal transduction networks, and the understanding of many signaling processes is limited to the knowledge of the signal(s) and of key mediators’ positive or negative effects on the whole process. We illustrate the biological usability of our software by applying it to a previously published signal transduction network and by using it to synthesize and simplify a novel network corresponding to activation induced cell death in large granulomic leukemia. Our methodology serves as an important first step in formalizing the logical substrate of a signal transduction network, allowing biologists to simultaneously synthesize their knowledge and formalize their hypotheses regarding a signal transduction network. Therefore we expect that our work will appeal to a broad audience of biologists. The novelty of our algorithmic methodology based on non-trivial combinatorial optimization techniques makes it appealing to a broad audience of computational biologists as well. The relevant software NET-SYNTHESIS is freely downloadable (opens in new tab).

    Some references: R. Albert, B. DasGupta, R. Dondi, S. Kachalo, E. Sontag, A. Zelikovsky and K. Westbrooks, A Novel Method for Signal Transduction Network Inference from Indirect Experimental Evidence, Journal of Computational Biology, 14 (7), 927-949, 2007. R. Albert, B. DasGupta, R. Dondi and E. Sontag, Inferring (Biological) Signal Transduction Networks via Transitive Reductions of Directed Graphs, to appear in Algorithmica.

  • Jean Peccoud, Virginia Bioinformatics Institute; Yizhi Cai, Virginia Bioinformatics Institute

    The sequence of artificial genetic constructs is composed of multiple functional fragments, or genetic parts, involved in different molecular steps of gene expression mechanisms. Biologists have deciphered structural rules that the design of genetic constructs needs to follow in order to ensure a successful completion of the gene expression process but these rules have not been formalized making it challenging for non-specialists to benefit from the recent progress in gene synthesis. We show that context-free grammars (CFG) can formalize these design principles. This approach provides a path to organizing libraries of genetic parts according to their biological functions which correspond to the syntactic categories of the CFG. It also provides a framework to the systematic design of new genetic constructs consistent with the design principles expressed in the CFG. Using parsing algorithms, this syntactic model enables the verification of existing constructs. We illustrate these possibilities by describing a CFG that generates the most common architectures of genetic constructs in E. coli. Genocad (opens in new tab) allows biologists to experiment with the algorithms outlined in this presentation.

  • Joel Bader, Johns Hopkins University; Jef Boeke, Johns Hopkins School of Medicine; Sarah Richardson, Johns Hopkins School of Medicine

    Eukaryotic genomes, including ours, have many parts — introns, transposons, redundant genes — whose functions remain obscure or appear unnecessary. To understand how life works, we are redesigning the yeast genome by refactoring its organization and inserting ‘debug code’ for downstream functional tests. Producing the synthetic genome requires a team of designers who are supported by BioStudio; an integrated design environment (IDE) modeled on IDE and revision control systems for software developers. The synthetic life will be programmed to jettison genome chunks under wet-lab triggers, producing a combinatorial regression test of gene dependencies. The gene dependency network resembles a social network or WWW network, except that edges are more akin to social antipathy rather than friendship. We will present initial results of new data mining algorithms we have developed to analyze antipathy networks, including graph diffusion algorithms similar to PageRank but adapted for negative edge weights, and a variational Bayes method for fuzzy clustering of gene modules. Portions of this work were supported through the Microsoft eScience program.

  • E-Neuroscience

  • Jano Van Hemert, National e-Science Centre; Douglas Armstrong, University of Edinburgh; Malcolm Atkinson, National e-Science Centre

    Research into animal and human health covers a vast array of biological components and functions. Yet strategies to simulate biological systems across multiple levels, by integrating many components and modeling their interaction, are largely undeveloped. We will explore how this challenge can be approached by considering how to build a virtual fly brain. This offers a new proving ground for collaboration between e-Scientists, biologists and neuroinformaticists. Mental Health accounts for 11% of global disease burden, it is growing rapidly yet it is one of the most challenging areas for drug discovery and development. Realistic models that capture the processes of the human brain would provide new insights into the diagnosis and treatment of certain disorders. However, to achieve this, we need to begin by working from much simpler models. The brain of the Drosophila contains in the region of 100,000 neurons; it provide perhaps the simplest brain capable of what we would consider complex behavior, much of which offers insight into animal and human cognition. The genome was sequenced in 2000 and efforts to improve its functional annotation are highly integrated (www.flybase.org (opens in new tab)). Of the estimated 12,000 Drosophila genes, more than 2,000 are conserved in human disease indications. In order to bring together the many disciplines, the e-Science Institute of the UK has sponsored a theme to allow the establishment of programs with a point of focus for bioinformatics and neuroinformatics in Drosophila, such that gaps in the current databases, biological domain and modeling/simulation efforts can be identified and translated into new projects. In the context of e-Science, the project shall serve as a test bed for the new service oriented platform to enable a distributed data integration and data mining infrastructure, which will be developed in a European project.

  • Hanspeter Pfister, IIC, Harvard University; Michael Cohen, Microsoft Research; Jeff Lichtman, Harvard University; Clay Reid, Harvard University; Alex Colburn, Washington University

    Determining the detailed connections in brain circuits is a fundamental unsolved problem in neuroscience. Understanding this circuitry will enable brain scientists to confirm or refute existing models, develop new ones, and come closer to an understanding of how the brain works. However, advances in image acquisition have not yet been matched by advances in algorithms and implementations that will be capable of enabling the analysis of neural circuitry. The primary challenges are:

    1. The robust alignment of high resolution images of 2D slices of neural tissue to construct three dimensional volumes
    2. Interactive visualization of volumetric petascale data
    3. Automatic segmentation of neural structures such as axons and synapses to define neural circuit elements
    4. Network analysis to enable the comparison of neural circuits

    The Harvard Center for Brain Science (CBS) and the Harvard Initiative in Innovative Computing (IIC) in collaboration with Microsoft Research are addressing these challenges by developing algorithms and tools to enable petascale analysis of neural circuits. In particular, we are developing algorithms capable of reconstructing large 3D volumes from collections of 2D images, scalable interactive visualization tools, semi-automatic segmentation methods for neural circuitry, and novel graph matching tools for connectivity analysis of neural circuits.

  • Tim Clark, Harvard IIC

    Alzheimer Disease (AD) and other neurodegenerative disorders (Parkinson’s, Huntington’s, ALS, etc.) are the poster children for science on the semantic web. Progress in these fields is dependent upon coordination and integration of knowledge developed in many research subspecialties, from genetics to brain imaging. Despite a massive increase in the quantity of information generated as research findings in AD over the past 20 years, there is still not consensus on the etiology of the disease. Integrating the findings of many disparate specialist fields into testable hypotheses, and evaluating competing hypotheses against one another, is still extremely challenging. Our group has developed an ontology of scientific discourse adapted for practical use by neuroscientists, with an annotation tool for organizing knowledge around semantically structured hypotheses. This tool has enabled us, working with leading AD researchers and science editors, to compile many of the core hypotheses developed by scientists in the field for presentation to the research community in semantic web format with their supporting evidence, relationships to other hypotheses at the level of claims, and other key related information. This model of wrapping scientific context in the form of semantic metadata, around more traditional digital content on the web, is extensible across many biomedical research disciplines. We believe it will enable more rapid and efficient progress towards curing several devastating diseases.

  • Dejing Dou, University of Oregon

    Event-related potentials (ERP) are brain electrophysiological patterns created by averaging electro-encephalographic (EEG) data, time-locking to events of interest (e.g., stimulus or response onset). In this paper, we propose a generic framework for mining and developing domain ontologies and apply it to mine brainwave (ERP) ontologies. The concepts and relationships in ERP ontologies can be mined according to the following steps: pattern decomposition, extraction of summary metrics for concept candidates, hierarchical clustering of patterns for classes and class taxonomies, and clustering-based classification and association rules mining for relationships (axioms) of concepts. We have applied this process to several dense-array (128-channel) ERP datasets. Results suggest good correspondence between mined concepts and rules, on the one hand, and patterns and rules that were independently formulated by domain experts, on the other. Data mining results also suggest ways in which expert-defined rules might be refined to improve ontology representation and classification results. The next goal of our ERP ontology mining framework is to address some long-standing challenges in conducting large-scale comparison and integration of results across ERP paradigms and laboratories. Along this direction, we are conducting two research projects: i) semantic data modeling and query answering based on ERP ontologies and ii) mapping discovering from multi-modality ontologies (i.e., surface space vs. source space). In a more general context, this work illustrates the promise of an interdisciplinary research program, which combines data mining, neuro-informatics and ontology engineering to address real-world problems. This talk is mainly based on our paper in KDD 2007.

  • Social Networking Tools for Science

  • Michael Kurtz, Smithsonian

    In the past approximately two decades the promise of future computer network based data and information systems has become a reality exceeding all but the most optimistic predictions. Presently the once separate concepts of measured data, processed data, scientific paper, scientific journal, author, reader, publisher, and library are merging in often unexpected ways. In the next couple of decades standardized data formats, standardized data reduction capabilities and deep standardized mark-up will combine with automated work flow systems to form scientific communication units (journal articles or their successors) with profound capabilities for additional discovery. Our literature will become alive. Clearly once we create a system where journal articles permit reader mediated discovery, we will also have a system which will permit automated discovery by software agents, and meta discovery by meta software agents looking at the output of the “lower level” agents, and so on. In this talk I will discuss these changes from a structural communications point of view, complimentary to the data/database centric view of Szalay and Gray (Nature 440, 413 (2006).

  • Jeremy Frey, University of Southampton

    oeData Data everywhere but not anytime to think is a possible mantra for the problems of scientific data overload. The CombeChem Project (www.combechem.org (opens in new tab)) takes a holistic approach to the undertaking of scientific laboratory research with a view to improve the quality, accessibility and re-use of chemical information. The project is investigating the use of e-Science technologies based on the idea of Publication@Source. This approach highlights the researcher’s responsibility to collect the scientific data with the fullest possible context from the start of the research process and to ensure that none of the material or context is lost as the data is processed, refined, analyzed and disseminated. Part of the Combechem project investigated how e-Science could provide the mechanisms to support this ideal of laboratory research in a globalized and multidisciplinary world. I will illustrate how the ideas of e-Science together with the current collaborative tools such as oeBlogs, oeWikis, can be applied to provide a oeSmart Laboratory Environment to work with the researchers to improve the quality of research. Examples will be given on the use of tablet PCs as Electronic Laboratory Notebooks (ELN) that enable the recording of semantically rich statements about the research process, the use of blogs as laboratory notebooks for collaborative research. Similarly examples for the oelaboratory perspective where e-Science technology has been used to enhance remote monitoring and control of smart laboratories, by elevating laboratory equipment to first class members of the networked community, by converting simple equipment to oeBlogjects. Once the information has been collected, techniques are needed to view, integrate and review the information in the ELNs and Blogs and I will discuss how the oeSemantic Web and oeWeb 2.0 can play their part in this.

  • James Hendler, RPI

    Work in creating infrastructures for scientific collaboration has largely given way to the generalized technologies of the Web, with scientists now relying on blogs, “overlay” journals and scientific wikis. The next Web technology likely to affect scientists is the emerging Semantic Web. In the UK, an investment in eScience saw a number of approaches that used this technology to provide better data integration, management of scientific workflows, and the provenancing of information in scientific systems. The use of ontologies in scientific applications matched Semantic Web technologies well, and projects showed how this technology could be applied to the needs of scientists. These techniques have been gaining wider visibility with their integration into large efforts such as the new “Encyclopedia of Life.” At the same time that the Web has been having significant impact on science, the inverse has also been true. Major breakthroughs in new applications for the Web have come about by using scientific tools to analyze the Web and identify techniques scalable to Web sizes. The best known of these is the PageRank algorithm that gave rise Google. Other efforts have included developing architecture and engineering efforts to help steer web development, and social science efforts to understand why some technologies and/or efforts have scaled well (such as Wikipedia) while others have not achieved the critical mass to succeed. We present both capabilities that new Web technologies will provide to science and the need for a better science of the Web. We argue that more interaction between the traditional sciences and the emerging “web scientists” will lead to new synergies that will have revolutionary impact both on the use of the Web in science and the science of the Web. -This included joint work with T. Berners-Lee, W. Hall, N. Shadbolt, and D. Weitzner.

  • Joel Sachs, Cynthia Parr, UMBC; David Wang, University of Maryland; Andriy Parafiynyk, Microsoft; Timothy Finin, UMBC

    Eco-blogs are becoming popular amongst both amateur nature lovers and working biologists. Subject matter varies, but entries typically include date, location, observed taxa, and description of behavior. These observations can be an important part of the ecological record, especially in domains (such as invasive species science) where amateur reporting plays an important role, and in the study of environmental response to climate change. To enable our goal of a human sensor net, we have developed SPOTter, a Firefox extension that enables the easy creation of RDF data by citizen scientists. SPOTter is not tied to a particular blogging platform, and can be used both to add semantic markup to one’s own blog posts, and to annotate posts or images on other websites, such as Flickr. Once RDF is generated, we can apply much of the machinery we have developed as part of the SPIRE project. This includes Swoogle, our Semantic Web search engine; Tripleshop, our distributed dataset constructor; and ETHAN, our evolutionary trees and natural history ontology. We are then able to issue queries like: what was the northernmost spotting of the Emerald Ash Borer last year? show all sightings of invasive plants in California; etc. We experiment with our approach on the Fieldmarking blog. We also expressed in RDF the 1200 observations from the first Blogger Bioblitz, and, through integration with other ontologies, were able to respond to ad-hoc queries. Our talk will demonstrate how eco-blog posts end up on the Semantic Web, where they can be integrated with existing natural history information, and queried. We will illustrate how scientists can share data by annotating it with RDF, publishing it via plug-ins to popular software, and making it accessible via new tools and Web mashups. Issues of provenance and reliability will also be addressed.

  • Shared Tasks and Shared Infrastructure

  • Steve Harris, Oxford University; Jim Davies, Oxford University Computing Laboratory; Charles Crichton, Oxford University Computing Laboratory; Peter Maccallum, CR-UK Cancer Research Institute, Cambridge; Lorna Morris, CR-UK Cancer Research Institute, Cambridge

    The Software Engineering Group at Oxford are developing a model- and metadata-driven architecture for research informatics. The architecture is being evaluated for use in large-scale clinical trials on both sides of the Atlantic, and is being integrated with the NCI cancer Biomedical Informatics Grid (caBIG); it is being enhanced and generalized for a wider range of applications, and for use in other scientific disciplines. We present the achievements to date, and the lessons learnt in developing frameworks for semantics-driven data acquisition and processing. We report on the deployment of the (open) architecture on widely-used commercial technology – Microsoft SharePoint, Office, and .NET services – aimed at the expectations and requirements of a wide range of stakeholders. We discuss extensions of the approach to the model-driven development of laboratory information systems, including semantic annotation of tissue collections and microarray data. We discuss techniques for identifying candidate semantic metadata elements from existing artifacts (developed in collaboration with the Veteran’s Health Administration) and the deployment of federated metadata registries (developed in collaboration with caBIG). We explain how the approach may be generalized to cover semantics-driven data acquisition and processing in other disciplines.

  • Aman Kansal, Microsoft Research; Feng Zhao, Microsoft Research; Suman Nath, Microsoft Research

    Many advances in science come from observing the previously unobserved. However, developing, deploying, and maintaining the instrumentation required to observe the phenomenon under investigation is a significant overhead for scientists. In most cases, scientists are restricted to collecting data using limited individual resources. As a first step to overcome this limitation, central archives for sharing data have emerged, so that data collected in individual experiments can be re-used by others. We take the next step in this direction: we build an infrastructure, SenseWeb, to enable sharing the sensing instrumentation itself among multiple teams. The key idea is as follows. A scientist deploys sensors to observe a phenomenon, say soil moisture, at their site. The sensors are shared over SenseWeb. Other scientists interested in soil moisture can conduct experiments using these sensors through SenseWeb. Further, other ecologists may deploy similar sensors at their sites and share them. The scientist can now use SenseWeb to access not only her own sensors but also these other similar ones. What emerges is a oemacro-scope of shared sensors measuring the phenomenon at a scale that no single scientist could instrument alone. New experiments are enabled, providing new insights by probing a phenomenon from multiple sites. Barrier to discovery is reduced as many experiments can begin without deployment overhead. SenseWeb addresses challenges in supporting highly heterogeneous sensors, each with their own capability, precision, or sharing willingness. It is built for scalability, allowing multiple concurrent experiments to access common resources. Its map based web interface provides data visualization. Our prototype is currently used by nearly a dozen research teams to share sensors observing different phenomenon ranging from coral ecosystems to urban activity.

  • Eric Rouchka, University of Louisville; Yetu Yachim, University of Louisville

    Background: Biological imaging techniques coupled with the affordability of large scale storage systems has made it possible to construct databases of high resolution images. It is not uncommon for such images to exceed 500 Mb in size. Conventional approaches for viewing and manipulating these images have typically been reserved for desktop applications that tend to be slow and resource-exhaustive. While this sort of approach may be acceptable for single user applications, the internet has made the possibility for geographically sparse research teams to form. Bandwidth bottlenecks do not allow for the effective real-time sharing of these high resolution images without loss of detail due to image compression. Results: We have created a system, YMAGE, for the storage, distribution, and shared annotation of high resolution images. Users of the YMAGE system will be able to create and connect to YMAGE registered servers distributed across the internet. YMAGE allows for the resolution of the images to be maintained by only requesting and sending the viewable region of the image, which can be changed by using zooming utilities. The initial application of YMAGE is for in-situ hybridization images, such as those created through the Allen Brain Atlas project. However, the extensibility of YMAGE allows for high resolution images of any nature to be shared across geographically distributed research groups without loss of information due to image compression. YMAGE users login to a shared user database where they are validated. Each image can be assigned as belonging to a group of users, including a public group. Users are able to view annotations for each image assigned by various research groups, and are able to add their own annotations as well.

  • Jeffrey Grethe, Univ. of California, San Diego; Mark Ellisman, NCMIR, University of California, San Diego

    The Biomedical Informatics Research Network (BIRN (opens in new tab)) promotes advances in biomedical and health care research through the development and support of a cyber infrastructure that facilitates data sharing and fosters a new biomedical collaborative culture. Sponsored by the NIHs National Center for Research Resources, BIRNs infrastructure consists of a cohesive implementation of key information technologies and applications specifically designed to support biomedical scientists in conducting their research. By intertwining concurrent revolutions occurring in biomedicine and information technology, BIRN is enabling researchers to participate in large-scale, cross-institutional research studies where they are able to acquire, share, analyze, mine and interpret data acquired at multiple sites using advanced processing and visualization tools. Some core components of this infrastructure, designed around a flexible large-scale grid model, include: a scalable and powerful data integration environment that allows users to access multiple databases as if they were a single database; the use and development of ontologies and data exchange standards; a user portal that provides a common user interface, encouraging greater collaboration among researchers and offering access to a powerful suite of biomedical tools. The growing BIRN consortium currently involves more than 40 research sites that participate in one or more BIRN related projects. The BIRN Coordinating Center is orchestrating the development and deployment of key infrastructure components for immediate and long-range support of the biomedical and clinical research being pursued. Building on this foundation, the NIH has recently released Program Announcements that encourage researchers to use the BIRN infrastructure to share data and tools or use the infrastructure to federate significant data sets.