Lab 8: Database Intro
FIT 100, Spring 2005

Key Words: database, database management system, DBMS, relational database management system, RDBMS, entities, relations, attributes, queries, structured query language, SQL, Microsoft Access, forms, reports, rows, tuples, table, columns, fields, data type, Text, Number, identifier, inconsistent, table design, table data, primary key, foreign key, normalize, AutoNumber, One-To-Many
 
See Wikipedia (http://www.wikipedia.org/) for a detailed description of the key words.

Preparation

You must read Chapter 13 ("Getting to First Base") before starting this Lab.  Also skim through Chapter 14 (part about "Queries: Creating Views").

Objectives


For this lab, you will create a small database to store information about music CDs.  You will be using your work from this lab in the next lab (and possibly in a short project).  So do it nice or do it twice!

Database terminology

(Suggestion:  After the lab session, print this and read it with highlighter in hand, to pick out the terms you want to remember.)

Refer to Chapter 13 in the book and Wikipedia for more detailed explanations of the database terminology.  (A caution:  The use of the word "relationship" by the text in Chaper 14, and by the software we'll be using (MS Access), can be confusing, as it sounds like it should have something to do with "relational" in "relational database".  It doesn't, and in fact, the two refer to contradictory things.  We are trying to be precise in our usage here.  Just watch out for "relationship" versus "relational" and "relation" - "relational" and "relation" are formal terms in math and relational databases; "relationship" isn't - it mainly has its normal English meaning.)

A database is a structured collection of information that allows users to store, retrieve, change, and ask questions about ("perform queries on") structured information.  A relational database is the most common type of database.  The word "relational" comes from "relational algebra", which in turn comes from the mathematical term "relation".  The relational algebra describes precisely results of doing queries that just ask questions about the data, and the changes that are made in the contents of the database as a result of doing queries that insert, delete, or update information.

If you remember your introductory math, a "relation" is a set of "tuples", and a tuple is a list of values, each of which has an assigned role in the tuple.  Tuples are usually written with parentheses around them, with the values separated by commas.  For instance, in the mathematical relation "greater than", we could express the fact that 3 is greater than 1 by writing (3,1) and putting that in a set called Greater, or perhaps more clearly, just writing Greater(3,1).  This is fine so long as we are clear that the first "slot" in the tuple means "the number on the left side of the greater than sign" and the second slot is the number on the right.  More complicated relations have more slots.

In a relational database, the information is split up into different types or categories of data, and that each type is described by attributes.  For instance, "people" might be one type, and "books" another, and "companies" a third.  Some attributes of a person might be age, gender, employer, mother, father, address.

A member of a type - an individual person or book or company, etc. - is referred to as an entity.  Each entity is described by specific values of attributes.  What binds a collection of entities together as one type is that they all have the same generic attributes.

Where this ties up to "relations" in math is that each entity's attributes form a tuple - those particular values are "related to" each other because they belong to one actual entity.  The entire collection of entities of a particular type would be a "relation" in the mathematical sense.

In a relational database, the collection of entities of one type is called a table - think of information laid out in rows and columns, where each entity is represented by one row in the table, and attributes are lined up in columns.  We put all of one sort of attribute in the same column - this is like each slot in a tuple having a fixed meaning.  Rows may also be called records, and columns are also called fields.

Some attributes are just simple values, e.g. age is a number.  But attributes can also be references to other entities.  For instance, the value of a person's "employer" attribute would be a reference to a "company" entity.  We say that the employer attribute refers to the company table.  A reference can point into the same table - "mother" and "father" are also people.

In a relational database, an attribute that refers to a row in another table actually contains some information that identifies that row.  Almost without exception, each table includes a field that is a unique identifier (usually a number) for each row - this is called the key (or primary key).  The name of the table and the value of the key together allow finding the row.  So if one table needs to refer to another, it can have a field for the key of the other table.  (An attribute that is a key belonging to some other table is called a foreign key.)

Data modeling

Given a bunch of data, we need to decide how to split it up into useful parts.  If we're going to use a relational database, those parts are tables and their fields.  Using tables and fields to describe data is called a relational model.

Sometimes it's clear how the data should be divided up into tables - other times this is a rather tricky choice.  For instance, an entity might have an attribute that clearly makes sense as an attribute, not an entity in its own right, e.g. a person and their address, but where there's some other reason why that data should be put off in its own table.  In this example, several people might share one address.  If we store the addresses in the "people" table, then we have to duplicate it for each person at that address.  Not only can this waste space, but it could also allow mistakes to creep in - if each address is typed in separately, addresses that are supposed to be the same might become inconsistent.

There is another, quite different, way of describing the structure of data, called the entity-relationship (ER) model.  Instead of merely saying that one entity refers to another, an ER model for the data gives a name to each reference, that describes the relationship between one entity and the other that is implied by the reference.  (Here, "relationship" is a formal term, but this is for ER models, not for relational models...)  For instance, for the person and their address, the relationship could be called "lives at".  ER descriptions of data are often shown as diagrams, with boxes for entities and diamonds for relationships - we'll see one below.  Here, we might write [person] --- <lives at> --- [address].

Pulling out the relationship by itself can make some structure in data more clear than if we tried to go straight to tables and fields.  For instance, think about this relationship:  [consultant] --- <consults for> --- [company].  A consultant may very well be consulting for more than one company at a time, and that number is arbitrary, so we wouldn't want to put a field in the consultant table for the company's key - we'd have to put in enough fields for the most prolific consultant.  So we might think, we'll put the consultant's id in the company table.  Wrong!  A company could have lots of consultants working for it.  Instead, what we can do is have a separate table, just for the consultant / company relationship!  It can hold pairs of a consultant id and a company id.  Now a consultant or a company can have as many references to the other type as it needs, because they're stored in the rows of a separate table.

Database software

A database management system, or DBMS, is software that allows users to store, retrieve, change, and ask questions about ("perform queries on") information stored in the database. A relational database management system, or RDBMS, which handles these tasks for relational databases, is the most common type of DBMS.

Once a database is created, you can enter data, modify the data, and view the data by combining different entities.  This is done by sending queries to the RDBMS.  There is a standard language for writing queries - Structured Query Language, or SQL, but it is also common for an RDBMS to provide a graphical user interface (GUI) that hides the actual SQL.  (Not all RDBMSs have GUIs, and if they do, they are unlikely to be much alike.)

Microsoft Access is a RDBMS that lets you manage your database using a GUI.  You can also create graphical forms to manipulate your data and create reports to show results from a search/query.

Example:  Persons could be stored in a table called tPersons (where we're including the "t" to help you remember that it is a table).  The table may have the fields Name (data type: text), BirthYear (data type: number), and ShoeSize (data type: number) that will store a person's name, birth year, and shoe size, respectively.  To uniquely identify a person, it is good practice to add another field as a key (normally data type: number) - we'll call that PersonId.  The table tAddresses is set up in a similiar manner, but with its own fields.

To hold the reference from a person to their address, tPersons includes a field, AddressId, which contains the key for an address in the table tAddresses.  It isn't the fact that the name of the AddressId field is the same in both tPersons and tAddresses that tells the RDBMS that you intend one to refer to the other - the names of the fields could be different.  Instead, the RDBMS must be told explicitly that the AddressId field in tPersons is intended to refer to the AddressId field in tAddresses.  In the Access screen shot that follows, this is shown by the line between the fields in the two tables' lists of fields.

There are several good reasons to split up the persons and the addresses into two tables.  One is that the number of columns in tPersons is greatly reduced and easier to manage.  More important is that (as mentioned above) we might have several persons living at the same address.  By putting the persons and the addresses in separate tables, the address only needs to be stored once in the table tAddresses instead of in several places in the table tPersons.  If the same address was stored in several places, one could easily enter a slightly different address by mistake, which would cause the database to become inconsistent.

Enough chat -- let's get started.

Create a Small Database for Music CDs using MS Access

But let's not start blindly - let's think about our data, to figure out how we should split it into tables, and how the tables might need to refer to each other.

First, what entities are there?  Well, CDs of course, but also the performers ("artists" in the example below) or groups who recorded the CDs, the songs on them, etc.  For classical music, we might have not only performers, but a composer.

Then, what attributes are we interested in for each?  Names, of course, but let's say (so we'll have more than one attribute per table) that we're also interested in categorizing our CDs by style (e.g. rock, jazz, classical).

To keep this simple, we won't worry about the songs on the CDs, nor about having a separate performer and composer.  And for now, if the CD is by a group, we'll just put in the name of the group as the performer.

Since we've seen the person / address example, we might guess that we won't want to lump the performer in with each of theie CDs, but instead, have a CD table that refers to performers in another table.  That way, many CDs can share one performer record.  What's less obvious is that we might want to do the same thing with the style...  We'll start out with the style and the performer lumped in with the CD, then fix up our "mistake".

Ok, now let's start.

  1. Create a lab8 directory on dante, go to the new directory and create a text file called notes.txt with your favorite text editor (e.g. Notepad, Pico, Emacs).  Put all of your comments, hints and reminders to yourself, questions and answers (if there are any in this lab ;-) in this file!

  2. Start MS Access, usually under All Programs>Microsoft Office>..., or search for "access".

  3. Create a new database:
    1. Select File>New... or the new document icon.
    2. Select Blank database....
    3. Locate the desktop and give your database file the name lab8.mdb. Do not forget to copy this file to your lab8 directory on dante when you are done with the lab!
    4. Click on Create to create the database.
    You should now see the newly opened database. On the left hand side of the window, you should see Tables, Queries, Forms etc. If you click on them, you can view the tables, queries, forms, etc in your database.

    Newly opened database in MS Access

    Become familiar with this window!  You will use it often.  If you lose it, you can get it back by selecting View>Database Objects>Tables.

    You can right click almost anywhere in MS Access to get a popup menu. Sometimes you need to select an item before right clicking.

  4. You should now create a table that will contain a list of CDs that you own.  It will be a first version to get you started. Select Tables on the left side of the dialog window.  Then double click on Create table in Design view.  A big window should appear in which you can describe the fields of your CD table. Add the following fields: CDId with the data type AutoNumber, CDName, ArtistName and StyleName all with the data type Text.   StyleName is for the style of music, e.g., Jazz, Rock, Classical, etc.  If you like, you may put a description for each one, but since this table is (currently) pretty simple, you don't have to.  (The description here serves the same purpose as comments on variable declarations in JavaScript - it tells what the field will be used for.)

    (What's AutoNumber?  We want every CDId to be unique, and one way to get that is to start off at 1, and every time we add a CD, we choose the next number for the CDId.  All RDBMSs provide a function that will remember what number we're currently at, and give us the next one.  For Access, making the type AutoNumber will automatically put in the next number each time we add a new CD.)

    Draft CD Table

  5. Before you are done with describing the CD table, you should tell Access which field is the primary key.  The primary key is a field (or combinations of fields) that will have a unique value for each CD in the CD table.  Place the cursor in the row for the CDId field.  Right click and select Primary Key (or click directly on the primary key icon Primary Key
Icon in the main MS Access window).  A key symbol should appear immediately to the left of the field name CDId.

  6. Save your table by using File>Save or Save Icon.  Name your table tDraftCDs.  After the table is saved you can close the table design window.  Under Tables in your database dialog window you should now see your tDraftCDs table.  Once a table is created, you can view it, add data to it, and modify it.

  7. Check that your tDraftCDs is correctly saved by right clicking on the table and selecting Design View.  Do not change anything.  It should look the same as it did before you closed it.  Close the design view again when you are done.

  8. Open your tDraftCDs to enter data by right clicking on the table and selecting Open.  A window with your table data will appear.  Each row corresponds to one CD.  Notice that you do not need to explicitly fill in the CDId field.  Since it is an AutoNumber field, it will be filled in automatically.  The other columns represent the name, artist and musical style for the CD.  The table should currently be empty.  Now add the following CDs (you can replace them with others later...).  Just start typing the CD name in the CDName field, and Access will automatically assign that row's CDId number, and display a new blank row below the one you're working on.

    Draft CD Table Data

    Data that is entered into a database table is saved directly after you leave the field. So there is no need (or possibility) to save it manually, like you must do in the design view. However, if you modify the style of the table, such as the width of the columns, Access will ask if you want to save the layout (which you probably want to do).


You have now created a database with information about music CDs!  There are, however, a few undesirable things to fix before you will spend your nights entering information about all your CDs:

Improve the Database Design: Normalize

Normalizing a database means to remove redundant data, that is, to remove multiple occurrencies of the same information.  Instead of writing the same information several times in one table, the table is redesigned to have a relation to another table in which the information is stored only once.  The following diagram shows what we ultimately want.

This is an entity-relationship (ER) diagram.  The rectangles are entities, the rounded boxes are their attributes, and the diamonds are relationships between them.  If you'll recall the discussion of references and relationships above, you'll likely guess that each relationship will just turn into a reference from one entity's table to the other.

Relationship Diagram

To relate each CD in a CD table to an artist in an artist table, each CD must have a field with the identifier (aka foreign key) corresponding to a particular artist in the artist table. In the artist table, an artist is identified by its value in the primary key field.

In this section you will create a better designed CD table, move the artists and styles into their own tables, and set up relationships between the three tables.  All of the data in the old CD table (tDraftCDs) will be used.  By using some features of a RDBMS, you will not have to type in the data again!

  1. Create a new table in design view.  It should look almost like the old tDraftCDs, but the artist and musical style should have slightly different names and use different data types.  It is also nice to have a comment that explains what the IDs will be used for:

    Design of the CD Table

    Do not forget to specify the primary key and to save your new table!  Save the new table as tCDs.

  2. Create a new table in design view that will contain a list of artists.  Call it tArtists.  It should have two fields: ArtistId (data type AutoNumber) and ArtistName (data type: Text).  The primary key should be ArtistId.  Do not forget to save the table!

  3. Create a new table in design view that will contain a list of musical styles.  Call it tStyles.  It should be very similiar to tArtists, but use slightly different names for the fields: StyleId (data type: AutoNumber) and StyleName (data type Text).  The primary key should be StyleId.  Do not forget to save the table!

  4. You now have three tables and it is time to "connect" them with relationships!  Open the relationship window by either selecting Tools>Relationships... from the main Access menu, or by clicking on Relationships
Icon.  You should first select what tables you want to create relationships between.  Make sure to select tArtists, tCDs and tStyles.  Do not include the old CD table.  To select multiple items in a list, you can press and hold the Ctrl key while you click on the items you would like to select. End with clicking on Add.  You should see the relationships window with three tables and their fields.  Bold fields indicate primary keys.  It is also possible to rearrange the tables if you do not like the automaticaly generated order.  Example:

    Relationships, not connected

    If you forgot a table or picked the wrong table, you can add more tables by right clicking on background and selecting Show Table...

  5. The following relationships exist between the tables: It is also nice if we can require the data in the database to obey these relationsships, that is, to never have a CD without a corresponding artist or style of music.

    Create one relationship at a time by dragging the primary key from one table and dropping it on a foreign key in another table, for example drag the ArtistId from the table tArtists to the ArtistId in the table tCDs. The following relationship window will appear:

    Artist-CD relationship

    Make sure "Enforce Referential Integrity" is checked.  This means that the ArtistId in the tCDs table must refer to an actual artist in the tArtists table.  Also note the Relationship Type at the bottom of the pop-up window: One-To-Many.  Click on Create when you are ready to add the relationship.

  6. Add the second relationship for musical style on your own.

  7. The relationships between the tables should look like the following.  The infinity symbol symbolizes "many".

    Relations, connected

    You have now successfully designed a relational database!  The next step is to extract the data from the old CD table tDraftCDs and put it into the three new tables tArtists, tStyles and tCDs!  Close the relationships window (answer yes on the "save layout" question).

Transfer the Old Data Into the New Improved Database

In your old table you had only four CDs.  In reality, the number of rows would be many more.  It would be a painful experience to retype the information.  In this section you will extract and transfer the data without any unnecessary typing!

The key is to use a query to view (or extract) certain columns from a table, and to paste the results into the new tables.

  1. In your database window, select Queries (on the left).  Then double click on Create query in Design view.  When you get the option to add tables, only add the old table tDraftCDs, since that is where all data is located.

  2. The lower half of the query design window has a large number of columns.  When you design a query, you should use one column for each field you want to do something with.  In more complicated queries you can combine fields from different tables, restrict what rows to include, and much more.  The query you will design here is quite simple.  You only want to get the field ArtistName from the table tDraftCDs. In the first column, select the desired field.

    Query design for ArtistNames

    Save the query and close the query design window (or close the query design window and save the query upon request). Give this query the name qArtistNames.  The "q" is a reminder that it is a query.

  3. Now open your newly created query!  You do this in the very same way you would open a table to enter data.  (Hint: right click...).

    Your query should give you the names of all of the artists in your database.  When you open the query you should see the rows and columns that you requested in the query.  A query is just another way of viewing your table data and is, therefore, sometimes simply called a view.  If you change any information in the query result, the original table (tDraftCDs) will be updated.

    Query result

    These names will be used later in the table tArtists.  Before you can use them, you need to get rid of unwanted duplicates.  This means you must modify the query.

  4. Close the query result window and open the query in design view.  Right click on the background in the top half of the window to bring up a menu.  Select Properties... at the very bottom of the menu.  This will show the properties of your query:

    Query properties window

    Change the settings for Unique Values (a few lines down from the top) from No to Yes.  This will give you only the unique values in the table.  When you are done, close the property window, close the query design window (save changes to the query) and open your query to see the results.  It should now have no duplicate names.

    Query result

  5. Select all of the rows from the first/only column of your query results by clicking on the column title (ArtistName). All rows should be selected (dark or black colored).  Copy by pressing Ctrl-c or by using Edit>Copy.  Once the rows are copied, you can close the query.

  6. Next you'll paste the results from the query into the tArtists table:  Open the tArtists table.  The table should be empty.  The first column corresponds to the ArtistId and will get values automatically.  The second column is where you'll add all of the rows from the query!

    Select the second column by clicking on the column title ArtistName.  Paste the previously copied rows by pressing Ctrl-v or use Edit>Paste.  Answer Yes to the question if you want to paste the rows.  After a successful paste, you can close the tArtists table (and the query if you didn't close it before).

  7. You have now copied the unique information for the artists from your old CD table.  Do the same thing for the style of music.  That is, repeat steps 2 to 6 for the StyleName field and the tStyles table! Save the query as qStyleNames.  The result should only be two rows, since there are only two styles of music in the old table (i.e., Jazz and Pop/Rock).

  8. You should now add information about all the CDs to the new CD table.  You would like to have all of the rows from the old CD table, but the artist and style names should be replaced by their respective identifiers in tArtists and tStyles.  You need a query that takes all of the rows in the tDraftCDs, finds the row in tArtists that has a matching ArtistName, finds the row in tStyles that has a matching StyleName, combines these rows, and gives the fields CDName, ArtistId and StyleId as a result.

    Open a new query in design view.  Add the tables tDraftCDs, tArtists and tStyles.  Connect the fields that should match.  That is, drag ArtistName in the tArtists to ArtistName in tDraftCDs.  Do the equivalent for StyleName.  Then make sure the query shows the three fields CDName, ArtistId and StyleId in the bottom half of the query window.  Make sure the order is correct.  Save the query design as qDraftCDs.

    Query design

  9. Close the query design window and run the query.  You should see the four CDs and the IDs for the artist and the style of music:

    Query design

    Select all rows and all columns.  The easiest way is to click and drag over the column titles.  Copy the data with Ctrl-c or Edit>Copy.  Open tCDs to enter data and select the three columns CDName, ArtistId and StyleId. Paste in the data with Ctrl-v or Edit>Paste.  Accept to insert the four rows.  Close tCDs.

  10. You are now done with your database!  You might think it was a lot of work for four rows of data, but what you just did would work for thousands of rows without any extra typing!

  11. To verify that you no data has been damaged or lost, design a query based on the tCDs, tArtists and tStyles tables.  Note that you do not need to specify any relationships or matching requirements between the tables.  They are already set up in the design. Have the query show the fields CDName, ArtistName and StyleName.  Save the query as qCDs and open it to see the query result. It should contain the following four rows:

    Query design

  12. Before you close your database or Access, select Tools>Database Utilities>Compact and Repair Database... and accept to open the lab8.mdb.  This will remove unneeded space in your database file, which otherwise can become very large!

  13. Close all your database windows, close Access and copy your lab8.mdb to your lab8 directory on dante

    Select Tools>Database Utilities>Compact and Repair Database... before you close your database, after you have worked with the database for a while.  This will decrease the size of your database file. Also close your database file and Access before copying your database file to your dante account, otherwise you may end up with an empty file!

 Ensure that all of your files are accessible from a web browser.

Check List

 

I understand the key words for this lab and can give examples of what they mean.

 

I know how to design tables and how to set up relationships between them.

 

I know how to add data to a table.

 

I know how to create a query.

 

I know how to view the results from a query.