CSE 599C Scientific Data Management




Instructors: Magdalena Balazinska and Bill Howe

Meeting times: Fridays 1:30pm-4:20pm (ending just in time for TGIF!)

Location: CSE 405.

Class mailing list : https://mailman.cs.washington.edu/mailman/listinfo/cse599c


Scientists today face an avalanche of data. Oceanographers generate terabytes with daily forecasts of temperature, elevation, and velocity. Astronomers acquire hundreds of millions of images from increasingly powerful telescopes. Physicists are already discussing petabyte-scale datasets collected from particle accelerators. Biologists have sequenced the human genome, itself a large dataset, and are now describing the complex interactions between all 20,000 - 80,000 protein-encoding genes, not to mention the interactions between the proteins they encode. In all cases, scientists' ability to collect data has outpaced their ability to manage it. Complicate matters with non-standard data types, extreme performance demands, and ever-changing requirements, and you have one of the major data management challenges of today. What do these applications have in common, and why are traditional data management tools inadequate? In this course, we will investigate this question from the perspective of modern database research. We will look at what scientific datasets in different domains have in common, and what sets them apart. We will survey the literature in this area, and explore tools used in practice.


Approximately two papers will be assigned for each class. Please read the papers and come prepared to discuss them.


The course grade will be based on participation.

Course Calendar

The course calendar is still preliminary and subject to change.


Topic and readings Discussion

April 2

Topic: Data deluge in science and its implications.

Guest talks:
Jeff Gardner, UW Physics and Astronomy, UW eScience Institute [slides]
Andy Connolly, UW Physics and Astronomy [slides]
John Boyle, Director of Informatics Core, Institute for Systems Biology [slides]

Readings: All the papers below are very quick reads, except the last one, which is a bit longer.


April 9

Topic: Science in the cloud (part 1)

Instead of a normal lecture, we will attend the Cloud Futures 2010 workshop!

Readings: None assigned.


April 16

Topic: Science in the cloud (part 2)

Lecture notes: lecture3.pdf


Positive: A
Negative: B
Break: C

April 23

Topic: Data intensive analytics

Lecture notes: lecture4.pdf


Positive: B
Negative: C
Break: A

April 30


Topic: New data types (arrays, meshes, and other)

Lecture notes:


Positive: C
Negative: A
Break: B

May 7

Topic: RDF and ontologies

Lecture notes: lecture6.pdf

Guest talk by David Jones, Department Head, Environmental & Information Systems (EIS), UW Applied Physics Lab on data management challenges in the ocean sciences.

Note that there is a third paper to read this week on the NANOOS Visualization System, which David will present.


Other readings (not required):

Positive: D
Negative: E
Break: F

May 14

Topic: Query composition and language bindings

Lecture notes:


Positive: E
Negative: F
Break: D

May 21

Topic: Scientific workflows and mashups


Positive: F
Negative: D
Break: E

May 28


  • Part1: Workflow provenance
  • Part2: Provenance, conflicts, and curation (Guest lecture by Alexandra Meliou)

Lecture notes:


  • Please focus on the first and last chapters: James Cheney, Laura Chiticariu, Wang Chiew Tan. "Provenance in Databases: Why, How, and Where."Foundations and Trends in Databases, Volume 1, Issue 4, Pages 379-474, April 2009.
Positive: G
Negative: H

June 3

Topic: Visualization

Lecture notes:

Readings: off this week for SIGMOD

Positive: H
Negative: G