344 25wi Lecture 1: Syllabus; Relational Model

I'm James (you can call me James), and this is my first time teaching 344.

Why take this course?
- You want to take 444 and this is a prereq
- You've heard that databases are used to build applications and you
  want to build applications

An application is a program that runs on your phone, your laptop/desktop, or
that you access via a web browsers. I'll call it a *database-backed* application
if it uses a database to store its data. Most applications that need to store
data persistently (ie, even if the power goes out) use a database.

But what is a database (DB)?

- I think the simplest answer is just "a collection of data", which sounds trivial
  but I'll argue there's actually something useful to get out of it.

A *database management system* (DBMS) is a program to help manage a database,
meaning it allows us to add, remove and change the data in the collection, and
it also allows us to query (read) the data in the collection.

Many people informally refer to DBMSes as "databases", which is confusing. I
will probably accidentally do this too. But if we are speaking precisely, the
database is just the data and the DBMS is the program that helps us manipulate it.

(I also tend to informally use the phrase "databases" as another name for the
field of data management.)

DMBSes are pretty big and complicated programs. And there are many competing
ones available. We'll look at specifics next class. CSE 444 teaches you how to
build a DBMS, and is highly recommended. This course is primarily about how to
(expertly) use a DMBS.

--------

Aside: please read the syllabus

--------

Systems and Theory

I believe that good systems are founded on good theory. (And good theory
problems often arise from hard systems problems!) The field of data management
is a great example of this, and I will try to emphasize the connections between
the theory and the practical/systems building aspects of the course.


The Relational Model

Back to our working definition of a database

    A database is a collection of data

From this definition we can derive the key insight of the relational model,
which is a mathematical description of how to think about a database.

For the word "collection" we will substitute the concept of a *mathematical set*.

What should go in the set? Some kind of elements. In the relational model,
those elements are *tuples*.

Recall from CSE 311 that the Cartesian product of two sets A and B is notated as
"A x B" and consists of all ordered pairs of elements where the first component
comes from A and the second component comes from B.

For example, the Cartesian product of the set {1, 2, 3} and {a, b} is

    {(1, a), (2, b), (3, a), (1, b), (2, b), (3, b)}

If we take the Cartesian of three sets then we get a set of triples, and so on.
In general if we don't want to specify how many sets are being product-ed
together, then we call the elements "tuples".

Now we can officially define the relational model:

    The relational model says that a database is a bunch of sets of tuples.

The word "bunch" is chosen purposefully to indicate that we can have more than
one set of tuples (and we typically do).

We will call a set of tuples a *relation*. (This is the same definition given in
CSE 311!)

If we want to describe which sets the elements of the tuples are drawn from, then
we can use the following "relation on ...<some sets>..." terminology:

    a relation on A1, ..., An is a subset of the Cartesian product A1 x ... x An

(Recall that a set X is a subset of another set Y if every element of X is also
an element of Y.)

Returning to our previous example, let A1 = {1, 2, 3} and A2 = {a, b} then the
following are relations on A1, A2:

- The whole Cartesian product A1 x A2 is, because A1 x A2 is a subset of itself
- Let R = {(1, a), (2, b)}. Then R is a subset of A1 x A2, so R is a relation
  on A1, A2.
- Let S = {} (the empty set. Then S is a subset of A1 x A2 (because the empty
  set is a subset of any set), so S is a relation on A1, A2.

The key claim of the relational model is that a good mental model for a database
is a bunch of relations. That is how users should think about the database.


Tables

Here is an alternate (equivalent) description of the relational model using the
language of tables.

A table is a 2d grid of rows and columns. The first row is special and is called
the header, which labels each column with what type of data it stores.

Then another way to talk about the relational model is:

    The relational model says that a database is a bunch of tables.

For example, our relation R would correspond to a table like this


     {1, 2, 3} |  {a, b}
    -----------+-----------
         1     |   a
         2     |   b


It's clear that tables and relations are closely related to each other. Every
relation can be nicely represented as a table. (The opposite is not quite true,
which we will discuss later.) This translation yields several different
interchangeable words for similar concepts

- The collections of data: relations, tables
- The elements of those collections: tuples, rows, records, (perhaps objects?)
- The properties of those elements: attributes, columns, fields


Relational DBMSes

A DBMSs that encourages users to think in terms of relations/tables is called a
Relational DBMS, which we abbreviated RDBMS.

Empirical Law 1: All good DBMSes eventually become relational.

Empirical Law 2: All good ideas in data management eventually get implemented in
a relational DBMS.

To use a relational DBMSes, we must think in terms of relations. So, to build a
database-backed application on top of an RDBMS, we must describe our application
data in terms of (several) relations. It is perhaps surprising that this is
always possible. But it is, as far as I can tell.


Relations versus Spreadsheets

A spreadsheet user interface at first glance looks a lot like a table. Since
tables are similar to relations, it is tempting to ask "are spreadsheets
relations?"

I think it is *possible* to use a spreadsheet as a relation, but you have to
follow some rules that are not enforced by most spreadsheet applications:
- each column should have a type (and all the data in that column actually has
  that type)
- don't store multiple relations in different "places" in the spreadsheet (use a
  separate sheet instead)

(Of course, spreadsheets have many unique and excellent features, such as data
visualization, immediate recalculation of formulas, etc., so I'm not claiming
they are bad. Just that they do not enforce the relational model.)

Finally, note that a 2d grid has an inherent symmetry between the vertical
dimension and the horizontal dimension. In other words, there is no inherent
distinction in a 2d grid between a row and a column, and we could swap their
roles with not loss in functionality.

But the same is not true of relations! Relations are asymmetric (ie,
non-symmetric) in the two dimensions.

In a relation, a row is fundamentally different than a column. A row is a
record. A column is the value of one attribute across all records.