CSE 599C Scientific Data Management
Announcements
- 04/03/10: Created class mailing list and assigned discussion groups.
- 03/24/10: Welcome to CSE 599c!
- 03/24/10: If you are interested, please register for the Cloud Futures workshop at Microsoft Research. For the second lecture, we will go to the workshop instead of having a regular lecture. There are some potentially very interesting talks about cloud+science on that Friday afternoon.
Administration
Instructors: Magdalena Balazinska and Bill Howe
Meeting times: Fridays 1:30pm-4:20pm (ending just in time for TGIF!)
Location: CSE 405.
Class mailing list : https://mailman.cs.washington.edu/mailman/listinfo/cse599c
Overview
Scientists today face an avalanche of data. Oceanographers generate terabytes with daily forecasts of temperature, elevation, and velocity. Astronomers acquire hundreds of millions of images from increasingly powerful telescopes. Physicists are already discussing petabyte-scale datasets collected from particle accelerators. Biologists have sequenced the human genome, itself a large dataset, and are now describing the complex interactions between all 20,000 - 80,000 protein-encoding genes, not to mention the interactions between the proteins they encode. In all cases, scientists' ability to collect data has outpaced their ability to manage it. Complicate matters with non-standard data types, extreme performance demands, and ever-changing requirements, and you have one of the major data management challenges of today. What do these applications have in common, and why are traditional data management tools inadequate? In this course, we will investigate this question from the perspective of modern database research. We will look at what scientific datasets in different domains have in common, and what sets them apart. We will survey the literature in this area, and explore tools used in practice.
Format
Approximately two papers will be assigned for each class. Please read the papers and come prepared to discuss them.
Evaluation
The course grade will be based on participation.
Course Calendar
The course calendar is still preliminary and subject to change.
|
Topic and readings |
Discussion |
April 2 |
Topic: Data deluge in science and its implications.
Guest talks:
Jeff Gardner, UW Physics and Astronomy, UW eScience Institute [slides]
Andy Connolly, UW Physics and Astronomy [slides]
John Boyle, Director of Informatics Core, Institute for Systems Biology [slides]
Readings: All the papers below are very quick reads, except the last one, which is a bit longer.
- “Science In An Exponential World,” Jim Gray, Alex Szalay, Nature, V. 440.23, 23 March 2006.
- Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science. Jim Gray, Alexander S. Szalay. MSR-TR-2004-110, October 2004, IEEE Data Engineering Bulletin, December 2004, Vol. 27.4, pp. 3-11.
- Requirements for Science Data Bases and SciDB. Michael Stonebraker, Jacek Becla, David J. DeWitt, Kian-Tat Lim, David Maier, Oliver Ratzesberger, Stanley B. Zdonik. CIDR 2009. [Skip over the description of the array model and section 2.2]
- Try to read this one if you have time. It's not just about science: Data, data everywhere: A special report on managing information (the Economist)
|
Open |
April 9 |
Topic: Science in the cloud (part 1)
Instead of a normal lecture, we will attend the Cloud Futures 2010 workshop!
Readings: None assigned. |
None |
April 16 |
Topic: Science in the cloud (part 2)
Lecture notes: lecture3.pdf
Readings:
|
Positive: A
Negative: B
Break: C
|
April 23 |
Topic: Data intensive analytics
Lecture notes: lecture4.pdf
Readings:
|
Positive: B
Negative: C
Break: A |
April 30
|
Topic: New data types (arrays, meshes, and other)
Lecture notes:
Readings:
- Efficient Query Processing on Unstructured Tetrahedral Meshes. Stratos Papadomanolakis, Anastassia Ailamaki, Julio C. Lopez, Tiankai Tu, David R. O’Hallaron, and Gerd Heber. SIGMOD 2006.
- Requirements for Science Data Bases and SciDB. Michael Stonebraker, Jacek Becla, David J. DeWitt, Kian-Tat Lim, David Maier, Oliver Ratzesberger, Stanley B. Zdonik. CIDR 2009. [same reading as lecture 1, but this time please focus on the array data model and operator sections]
|
Positive: C
Negative: A
Break: B |
|
May 7 |
Topic: RDF and ontologies
Lecture notes: lecture6.pdf
Guest talk by David Jones, Department Head, Environmental & Information Systems (EIS), UW Applied Physics Lab on data management challenges in the ocean sciences.
Note that there is a third paper to read this week on the NANOOS Visualization System, which David will present.
Readings:
- Systems paper: The RDF-3X engine for scalable management of RDF data, Thomas Neumann and Gerhard Weikum. VLDBJ 19(1), February 2010. (Pay attention to the related work section.)
- Science application paper:
Ontology-supported Scientific Data Frameworks: The Virtual Solar-Terrestrial Observatory Experience, P. Fox, D. McGuinness, L. Cinquini, P. West, J. Garcia, and J. Benedict, Computers and Geosciences, special issue on Geoscience Knowledge Representation for Cyberinfrastructure, 2009
- Guest speaker: The NANOOS Visualization System: Aggregating, Displaying and Serving Data, Risien, C.M., J.C. Allan, R. Blair, A.V Jaramillo, D. Jones, P.M. Kosro, D. Martin, E. Mayorga, J.A. Newton, T. Tanner, and S.A. Uczekaj, Oceans 2009
Other readings (not required): |
Positive: D
Negative: E
Break: F |
May 14 |
Topic: Query composition and language bindings
Lecture notes:
Readings:
|
Positive: E
Negative: F
Break: D |
May 21 |
Topic: Scientific workflows and mashups
Readings:
- Corle Goble, David DeRoure. "The Impact of Workflow Tools on Data-centric Research." Book chapter, "The 4th Paradigm: Data-Intensive Scientific Discovery", Microsoft Research, Edited by Tony Hey
- Jianwu Wang, Daniel Crawl, Ilkay Altintas. "Kepler + Hadoop : A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems." Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science (WORKS09) at Supercomputing 2009 (SC2009) Conference.
- Ohad Greenshpan, Tova Milo, Neoklis Polyzotis, "Autocompletion for Mashups", VLDB 2009.
|
Positive: F
Negative: D
Break: E |
May 28 |
Topics:
- Part1: Workflow provenance
- Part2: Provenance, conflicts, and curation (Guest lecture by Alexandra Meliou)
Lecture notes:
Readings:
- Please focus on the first and last chapters: James Cheney, Laura Chiticariu, Wang Chiew Tan. "Provenance in Databases: Why, How, and Where."Foundations and Trends in Databases, Volume 1, Issue 4, Pages 379-474, April 2009.
|
Positive: G
Negative: H |
June 3 |
Topic: Visualization
Lecture notes:
Readings: off this week for SIGMOD |
Positive: H
Negative: G |