Big Data Blows into the Windy City

Published

Ensuring data discoverability, accessibility, and consumability

This week, the annual Microsoft eScience Workshop (opens in new tab) is being held in Chicago (the “Windy City”), providing an unparalleled opportunity for domain scientists, researchers, and technologists to discuss the benefits and difficulties of incorporating more computing and information technology into the scientific process. Over the years, the eScience workshop has provided a forum where scientists could voice their data and technology challenges and get input from those who’ve confronted similar issues.
 
Front and center this year are topics related to Big Data—be it the management of the rising data flood, the analysis of the data tsunami, or even the visualization of the data explosion. In addition, this year’s workshop explores questions about how to train and develop data scientists, and how citizen scientists can play a role in gaining insights from the vast amounts of information.

Many of these topics are examined in the book, The Fourth Paradigm: Data-Intensive Scientific Discovery (opens in new tab), which is an excellent resource for these discussions. And, as evidenced in that book, the Big Data “opportunity” has actually been building for some time—but now it has reached the tipping point in terms of awareness across more science domains. The commoditization of devices, sensors, storage, and connectivity—paired with technologies like cloud computing—has made the idea of capturing and maintaining all data in those science domains a plausible reality. As a result, scientists are thinking about what can be done, rather than lamenting what could be done if only they had the research infrastructure.
 
In preparing for this year’s event, I looked back at the very first Microsoft eScience Workshop (opens in new tab), held in 2004. I revisited Jim Gray’s keynote (opens in new tab) and put together this six-slide composite of the main challenges Jim identified back then. As you’ll notice, while some progress has been made, many of those challenges are still being addressed. For instance, global federation has remained a key issue for distributed and disparate databases. Do you move all the data to one location? Or do you ensure that the data owners continue to curate the data and safeguard the quality of the datasets? The approach taken by SkyQuery has really advanced federation, by demonstrating how multiple datasets can be queried seamlessly and by implementing novel approaches, such as the spatial join queries. If you want more details, check out the paper, SkyQuery: A WebService Approach to Federate Databases (opens in new tab)

Spotlight: Event

Inclusive Digital Maker Futures for Children via Physical Computing

This workshop will bring together researchers and educators to imagine a future of low-cost, widely available digital making for children, both within the STEAM classroom and beyond.

Six-slide composite of the main challenges that Jim Gray identified at the first Microsoft eScience Workshop in 2004
Six-slide composite of the main challenges that Jim Gray identified at the first Microsoft eScience Workshop in 2004

To truly tackle these data challenges, scientific datasets need the following attributes: discoverability, accessibility, and consumability. If a dataset doesn’t have all three, it might as well be kept in a file cabinet. There has been much work done lately on discoverability: for example, the emergence of different “data.gov” domain science catalogs—and even commercial ones like the Windows Azure Marketplace (opens in new tab). The “Open Data for Open Science” session at this year’s eScience Workshop explores how to address some of these challenges from the science side and looks at how simple, Internet-based protocols, such as OData (opens in new tab) (the Open Data Protocol), can help ensure that the end-user scientist can use the data.
 
The Monday evening event at the Adler Planetarium (opens in new tab) showcases how scientific data and information can be communicated to the public, through amazing 3-D tours powered by Microsoft Research WorldWide Telescope (opens in new tab) (WWT) and brought to life in the planetarium’s Grainger Sky Theater (opens in new tab). Microsoft researcher Jonathan Fay (opens in new tab), architect of WWT, has been working with the Adler to ensure that tours that were originally developed to be shown in planetarium can be taken home and experienced later. An example of the great work from the Adler is the Welcome to the Universe show (opens in new tab) and the WWT tour narrated by astronomer Mark SubbaRao (opens in new tab). You can play the tour (opens in new tab) in your browser. You can find more tours powered by WorldWide Telescope at the Layerscape website (opens in new tab).
 
Whether you’re attending the Microsoft eScience Workshop or just wishing you could, I encourage you to dive into these Big Data challenges.

Dan Fay (opens in new tab), Director, Earth, Energy, and Environment; Microsoft Research Connections

Learn More