344 25wi Lecture 1: Syllabus; Relational Model I'm James (you can call me James), and this is my first time teaching 344. Why take this course? - You want to take 444 and this is a prereq - You've heard that databases are used to build applications and you want to build applications An application is a program that runs on your phone, your laptop/desktop, or that you access via a web browsers. I'll call it a *database-backed* application if it uses a database to store its data. Most applications that need to store data persistently (ie, even if the power goes out) use a database. But what is a database (DB)? - I think the simplest answer is just "a collection of data", which sounds trivial but I'll argue there's actually something useful to get out of it. A *database management system* (DBMS) is a program to help manage a database, meaning it allows us to add, remove and change the data in the collection, and it also allows us to query (read) the data in the collection. Many people informally refer to DBMSes as "databases", which is confusing. I will probably accidentally do this too. But if we are speaking precisely, the database is just the data and the DBMS is the program that helps us manipulate it. (I also tend to informally use the phrase "databases" as another name for the field of data management.) DMBSes are pretty big and complicated programs. And there are many competing ones available. We'll look at specifics next class. CSE 444 teaches you how to build a DBMS, and is highly recommended. This course is primarily about how to (expertly) use a DMBS. -------- Aside: please read the syllabus -------- Systems and Theory I believe that good systems are founded on good theory. (And good theory problems often arise from hard systems problems!) The field of data management is a great example of this, and I will try to emphasize the connections between the theory and the practical/systems building aspects of the course. The Relational Model Back to our working definition of a database A database is a collection of data From this definition we can derive the key insight of the relational model, which is a mathematical description of how to think about a database. For the word "collection" we will substitute the concept of a *mathematical set*. What should go in the set? Some kind of elements. In the relational model, those elements are *tuples*. Recall from CSE 311 that the Cartesian product of two sets A and B is notated as "A x B" and consists of all ordered pairs of elements where the first component comes from A and the second component comes from B. For example, the Cartesian product of the set {1, 2, 3} and {a, b} is {(1, a), (2, b), (3, a), (1, b), (2, b), (3, b)} If we take the Cartesian of three sets then we get a set of triples, and so on. In general if we don't want to specify how many sets are being product-ed together, then we call the elements "tuples". Now we can officially define the relational model: The relational model says that a database is a bunch of sets of tuples. The word "bunch" is chosen purposefully to indicate that we can have more than one set of tuples (and we typically do). We will call a set of tuples a *relation*. (This is the same definition given in CSE 311!) If we want to describe which sets the elements of the tuples are drawn from, then we can use the following "relation on ......" terminology: a relation on A1, ..., An is a subset of the Cartesian product A1 x ... x An (Recall that a set X is a subset of another set Y if every element of X is also an element of Y.) Returning to our previous example, let A1 = {1, 2, 3} and A2 = {a, b} then the following are relations on A1, A2: - The whole Cartesian product A1 x A2 is, because A1 x A2 is a subset of itself - Let R = {(1, a), (2, b)}. Then R is a subset of A1 x A2, so R is a relation on A1, A2. - Let S = {} (the empty set. Then S is a subset of A1 x A2 (because the empty set is a subset of any set), so S is a relation on A1, A2. The key claim of the relational model is that a good mental model for a database is a bunch of relations. That is how users should think about the database. Tables Here is an alternate (equivalent) description of the relational model using the language of tables. A table is a 2d grid of rows and columns. The first row is special and is called the header, which labels each column with what type of data it stores. Then another way to talk about the relational model is: The relational model says that a database is a bunch of tables. For example, our relation R would correspond to a table like this {1, 2, 3} | {a, b} -----------+----------- 1 | a 2 | b It's clear that tables and relations are closely related to each other. Every relation can be nicely represented as a table. (The opposite is not quite true, which we will discuss later.) This translation yields several different interchangeable words for similar concepts - The collections of data: relations, tables - The elements of those collections: tuples, rows, records, (perhaps objects?) - The properties of those elements: attributes, columns, fields Relational DBMSes A DBMSs that encourages users to think in terms of relations/tables is called a Relational DBMS, which we abbreviated RDBMS. Empirical Law 1: All good DBMSes eventually become relational. Empirical Law 2: All good ideas in data management eventually get implemented in a relational DBMS. To use a relational DBMSes, we must think in terms of relations. So, to build a database-backed application on top of an RDBMS, we must describe our application data in terms of (several) relations. It is perhaps surprising that this is always possible. But it is, as far as I can tell. Relations versus Spreadsheets A spreadsheet user interface at first glance looks a lot like a table. Since tables are similar to relations, it is tempting to ask "are spreadsheets relations?" I think it is *possible* to use a spreadsheet as a relation, but you have to follow some rules that are not enforced by most spreadsheet applications: - each column should have a type (and all the data in that column actually has that type) - don't store multiple relations in different "places" in the spreadsheet (use a separate sheet instead) (Of course, spreadsheets have many unique and excellent features, such as data visualization, immediate recalculation of formulas, etc., so I'm not claiming they are bad. Just that they do not enforce the relational model.) Finally, note that a 2d grid has an inherent symmetry between the vertical dimension and the horizontal dimension. In other words, there is no inherent distinction in a 2d grid between a row and a column, and we could swap their roles with not loss in functionality. But the same is not true of relations! Relations are asymmetric (ie, non-symmetric) in the two dimensions. In a relation, a row is fundamentally different than a column. A row is a record. A column is the value of one attribute across all records.