CSE 344 Lecture 01 -- Introduction ================================================================ Administration: -- Introducing the staff -- Textbook: Database Systems: The Complete Book, Hector Garcia-Molina, Jeffrey Ullman, Jennifer Widom (Package ISBN-10: 0138155046 | ISBN-13: 9780138155049) -- Check website; subscribe to mailing list -- Homeworks: due on Mondays; first homework already posted, due next week. -- Webquizes on Gradience: You need the access code from the textbook, and the class token that I will give in class Due on Saturdays: first quiz already posted -- Midterm: April 29; Final: June 8 -- Grading: Homeworks: 30% Webquizes: 20% Midterm: 20% Final: 30% -- Software tools: sqlite, Saxon XQuery, postgres, Pig Latin on AWS ================================================================ Overview of a Database Management System What is a database ? Give examples of databases. What is a database management system ? Give examples of DBMS. Oracle IBM: DB2, Informix Microsoft: SQL Server, Access Sybase MySQL (Sun/Oracle) PostgreSQL SQLite What is data management ? Give examples of data management tasks ================================================================ Example: Online Bookseller Data = information on books, customers, pending orders, order histories, trends and preferences, etc. Massive data Hundreds of GBs More if keep click stream logs, images of book covers, sample pages; some application data grows by a terabyte a day. => Far too big for memory Simple commands to: find a specific book, list all books in a certain category and price range, generate an order history, produce sales figures grouped by state, etc. => Also unpredicted queries should be easy Persistent data Data outlives programs that operate on it Safe data From hardware failures, software failures, power outages, ... From malicious users Multi-user access Many people/programs accessing same database, or even same data, simultaneously => Need careful controls => Need high performance: Thousands (or more) queries/updates per second ================================================================ Multi-user example and discussion Jane and John both have ID number for bookseller gift certificate (credit) of $200 they got as a wedding gift. Jane @ her office: orders "The Selfish Gene, R. Dawkins" ($80) prompt user for credit ID; get credit from database; if credit >= 80 then credit := credit - 80; issue order to mail book (add to pending orders); if OK then put new credit into database (else handle error); else print "sorry" John @ his office: orders "Guns, Germs, and Steel, J. Diamond" ($100) prompt user for credit ID; get credit from database; if credit >= 100 then credit := credit - 100; issue order to mail book (add to pending orders); if OK then put new credit into database (else handle error); else print "sorry" Initial credit = $200 Ending credit = ??? What if the system crashes ?? Appears similar to concurrent programming problems (synchronization, semaphores, etc.) BUT: data not main-memory variables Appears similar to file system concurrent access BUT: want to control at smaller granularity Also database may be distributed, replicated ================================================================ Key concepts in data management -- Logical data model. Examples: Relational data model: tables/records/attributes. E.g. each book record has ISBN, Title, Publisher,... each author has First-name, Last-name, ... Semistructured data model: XML document E.g. each book has ISBN, Title, several authors (with first-name etc), may have review, etc. Graph data model: RDF data E.g. social network: nodes are people, edges are friendship relationships -- Schema v.s. data Schema describes how to structure the data Data is an instance of the schema -- Physical data independence: logical schema is separate from the physical schema (files, indexes, disks, servers,...) independence: changes to the physical schema should not affect programs that refer to the logical schema -- High level query language ("set at a time" v.s. "record at a time"): key for physical data independence. query language expresses WHAT we want to get from the data -- Query compiler and optimizer translates WHAT to HOW; invisible to the user -- Transactions: isolation, atomicity (ACID) ================================================================ People -- Database application developer: writes programs that query and modify data (344) -- Database designer: establishes schema (344) -- Database administrator: loads data, tunes system, keeps whole thing running (344, 444) -- Data analyst: data mining, data integration (344, 446) -- DBMS implementor: builds the system (444) ================================================================ Course outline: Relational Data Model SQL, Relational Algebra, Relational Calculus, datalog Semistructured Data Model: XML, XPath, XQuery Conceptual design: E/R diagrams, Views, Database normalization Transactions Parallel databases; Map/Reduce; Pig-Latin Data mining, data cleaning