CSE444 Glossary

These terms are all defined in the textbook, but here we've gathered them together into one place for convenience. Mostly, we add things here when you ask a question about them or ask us to add them -- we will continue to build up the glossary over the quarter. If there's something you'd like us to add to it, please just ask!

Abort: Also known as "rollback". Aborting a transaction consists of ensuring that state of the database becomes as if the transaction had never begun. Any changes made by the transaction must be undone and we must ensure that those changes had no effect on any concurrently executing transactions.

Assertion: An integrity constraint in SQL that is not associated with any particular table and can involve any number of tables. Assertions are not supported in many SQL implementations, and must be simulated using collections of triggers.

Attribute: A single-valued, unstructured property describing an entity (in the E/R model) or a tuple (in the relational model). In the relational model, also known as a field or column.

Avoids Cascading Aborts: A schedule avoids cascading aborts if every transaction in the schedule only reads data written by committed transactions. Thus no single transaction abort will trigger another transaction abort.

B+ Tree: A tree-structured index in which the nodes are 1 disk block in size and must be at least 50% full (except for the root). A B+ tree is characterized by its order, d, and the requirement is that every node contain d <= m <= 2d index entries.

Block: See page. The term "block" is usually used to refer to portions of the disk and "page" is usually used to refer to portions of main memory, but the usage isn't consistent. Also, confusingly, used to refer to a collection of physical disk blocks grouped together to process a block-oriented nested loops join.

Buffer Pool: A section of main memory reserved for holding database disk blocks (pages). The buffer pool consists of frames, each of which holds a page of data from the disk. When a requested disk page isn't in the buffer pool, a replacement policy is employed to choose which of the unpinned buffer frames to overwrite.

Cascading Abort: A situation in which the abort of one transaction forces the abort of another transaction to prevent the second transaction from reading invalid (uncommitted) data. Also called "cascading rollback".

Catalog: A collection of system-maintained relations in the DBMS that contains statistics about the database's relations, indices, keys, etc. Used for query parsing and query optimization. Sometimes referred to as metadata or a data dictionary.

Check Constraint: An integrity constraint in SQL that applies to a single table. It specifies a condition that must be true of all tuples in the table at all times. In the SQL standard, the condition can refer to other tables; in many SQL implementations, it cannot.

Checkpoint: A snaphot (complete or partial) of the database state used to limit the size of the log, which otherwise would have to keep a record of all changes ever made to the database (or all changes made since the last tape backup, if those are kept). The checkpoints used in ARIES are fuzzy checkpoints, in the sense that a complete snapshot of the database is not written to disk. Instead, we force to disk (as part of the checkpoint log record) copies of the in-memory structures describing the DBMS activity at the time the checkpoint began. In this way, recovery can begin by easily reconstructing the state of main memory as of the checkpoint time instead of using the older portion of the log to reconstruct this state.

Column: see attribute.

Commit: The point at which a transaction becomes persistent (permanent) and can no longer be aborted. Any updates made by the transaction are now part of the database.

Concurrency: The interleaving, by the DBMS scheduler process, of actions from more than one transaction.

Conflict: In a schedule, two actions from different transactions conflict if they involve the same data item and one of them is a Write action. Thus the three kinds of conflicts are WR (can lead to dirty reads), RW (can lead to unrepeatable reads), and WW (can lead to lost updates).

Conflict-serializable: A schedule is conflict-serializable if it is conflict-equivalent to some serial schedule. Two schedules are conflict-equivalent if every pair of conflicting actions is ordered in the same way in both schedules.

Conservative 2PL: A variant of strict 2PL in which all locks are requested at the beginning of the transaction. Conservative 2PL prevents deadlocks, in addition to providing all the benefits of strict 2PL.

Cylinder: The collection of all tracks in a disk having the same diameter. One track per surface is in a cylinder, and a track is only in one cylinder.

Data Entry: A record in an index file that contains a search key value and a way to reach the record(s) matching that search key value. Alternative 1 for storing a data entry is to simply have the data entry contain the actual tuple(s) matching the search key value. Alternatives 2 and 3 store physical pointers (rids) to the actual data records, which typically reside in a different file.

DDL: Data Definition Language. A language in which we describe the structure (logical and/or physical) of the database. In SQL, the DDL consists of the commands to create and alter tables, indices, triggers, etc.

Deadlock: A situation in which two or more transactions are all waiting, in a cycical manner, for other transactions to release locks. None of the transactions will ever commit unless one of them is aborted to break the deadlock. The most common case involves two transactions, in which T1 is waiting for a lock held by T2 and T2 is waiting for a lock held by T1. Longer cycles can and do occur, however.

Decomposition: the process of transforming a relation into 2 or more smaller (in the sense of fewer attributes) relations with the goal of removing redundancies. You should ensure that every decomposition is lossless by linking the new relations with the original one via a primary key-foreign key relationship (the FK belongs in the original relation). Another goal is to ensure that the decomposition is dependency-preserving. It is always possible to achieve a 3NF, lossless, dependency-preserving decomposition. It is also always possible to achieve a BCNF, lossless decomposition, but we are not guaranteed to be able to always achieve BCNF, lossless, and dependency- preserving at the same time. A lossless decomposition is necessary for correctness of the database; a dependency-preserving decomposition is merely a performance enhancement.

Dependency-Preserving: A decomposition preserves dependencies if every non-redundant FD for the original relation can be assigned to one of the new relations. A FD can be "assigned to" a relation if that relation contains all attributes mentioned in the FD.

Dirty Read: A read action within a transaction in which the value read was written by an uncommitted transaction.

DML: Data Manipulation Language. A language in which we describe the updates to and retrievals from the database. In SQL, the insert, update, delete, and select statements form the DML.

Entity: A uniquely identifiable thing in the real world.

Entity Set: The set of all entities of the same "type". They all have the same attributes.

Field: see attribute.

Frame: a page-sized section of the buffer pool. A frame holds exactly one disk block (page). If the page in the frame is pinned, it is in use by some DML statement, and cannot be replaced (overwritten). Otherwise, it is available for replacement. If it is chosen for replacement (when a non-memory-resident page is requested), it is first written out to disk (if it had been marked "dirty" by its most recent user) or simply overwritten.

Functional Dependency: A particular kind of integrity constraint, written X ®Y. In any pair of tuples containing equal values on all the X attributes, those two tuples must also contain equal values for all the Y attributes (X, Y are sets of attributes).

Heap File: A file organization for a relation in which new tuples are added at the end of the file, with new pages allocated there as needed. The pages may or may not be physically contiguous on disk, but performance is best if they are. Tuple deletion can result in file fragmentation over time, so periodic reorganization is necessary to maintain performance and space utilization. This is the most common default file organization in relational products and is the only option in some (indices are used to achieve other organizations).

Index: An associate access structure used to provide fast access to tuples specified by a condition on the search key columns of the index. Generally an index is a separate physical file from a table, although some implementations that store the actual tuples in the index maintain just one physical file. An index, specifically, provides a quick mechanism for reaching a data entry given a value for (a prefix of) the search key columns. Indices can be categorized along several dimensions:

A primary index is an index whose search key contains the primary key of the table, along with possibly other attributes. A secondary index is any non-primary index. Note that other definitions are sometimes used; there is no standard terminology here.

A clustered index is one in which the order of data records corresponds to the order of data entries in the index. This usually means the relation is sorted on the search key, usually ascending. An unclustered index is an index that is not clustered.

A dense index is an index that contains at least one data entry for every search key value in the relation. A sparse index, in the book's defintion, is an index that contains one data entry for every page in the relation. A more common definition of a sparse index is that it is an index that is not dense.

Index Entry: In a tree-structured index, a record consisting of a search key value and a physical pointer (page id) to another page of the index. The referenced page contains only search key values greater than or equal to the search key value in the index entry.

Integrity Constraint: A statement about the database that must be true at all times. This means, among other things, that the truth of an IC must be checked, in principle, whenever an update is made to the database that could affect the truth of that IC.

Join: In relational algebra, a cross-product followed by a selection. When the selection condition involves comparing attributes from the two input relations, this query can be processed much more efficiently than a cross-product. We will consider only such joins in this course. A very common query, especially since it is the primary way of combining two relations. There are two important special cases:

Equijoin: A join in which the selection condition consists only of equality comparisons connected by AND operations. The result tuples have fewer columns than they would in a general join or cross-product: for every pair of fields being compared for equality, only one of these fields appears in the result.

Natural Join: An equijoin in which the only comparisons are between columns of the same name. In this case, the join symbol need not have a subscript -- no subscript implies a natural join.

Join Algorithms: Many join algorithms exist. In this course, we study nested loops join algorithms and the sort-merge join algorithm. The tuple-oriented nested loops join algorithm is almost never used because the page-oriented NL join algorithm will always perform better. With more buffer frames available, we can generalize page-oriented NL to block-oriented NL. With an index available on the join column of the inner relation, we can use the index NL algorithm. The sort-merge join has very different behavior, but can sometimes perform better, especially if the relations are already sorted or need to be sorted for a later order-by clause, grouping, or duplicate elimination.

All join algorithms can be used to process equijoins. Non-equijoins can also be processed using the tuple-, page-, and block-oriented NL algorithms but not using index NL or sort-merge.

Key: Keys are generally divided into two forms: those used for searching and sorting (we'll get to those later in the quarter) and those used to identify entities (in the E/R model) and tuples (in the relational model).

In the E/R model, an entity set has a key, which is a set of one or more attributes used to uniquely identify the entities within the entity set. A key constraint in the E/R model is a property involving an entity set and a relationship set. If an entity set has a key constraint in a relationship, it indicates that each entity in the entity set appears at most once in the relationship set. A simple key is a key with one attribute.

In the relational model, we need to understand four kinds of keys: superkeys, candidate keys, primary keys, and foreign keys.

A superkey is a set of attributes that can be used to uniquely identify any tuple in the relation. In terms of functional dependencies, a superkey functionally determines all attributes in the relation.

A candidate key is a minimal superkey. "Minimal" means that if you take away an attribute from a candidate key, it can no longer be used to identify tuples uniquely. Another way to think of it is that a candidate key has no "useless" attributes, just attributes that are necessary in order to uniquely distinguish the tuples in the relation. When the book and notes refer to just plain "key" with no modifiers, we generally mean "candidate key". In terms of functional dependencies, a candidate key is a superkey in which no proper subset of it functionally determines all attributes in the relation. A simple key is a key with one attribute.

A primary key is a candidate key chosen by the DBA to be the key that will be primarily used by other relations if they need to refer to this relation. This corresponds to the notion of "key" in the E/R model. A simple key is a key with one attribute.

A search key is an ordered list (not a set) of attributes used to build an index. A multi-field search key can be viewed as a single-field search key in which we form the single field by simply concatenating the values in the multiple search key fields.

Left-Deep Plan: A query plan tree in which the right (inner) input to every join is a base relation, not the result of some other operator node in the plan tree.

Lock: A logical mechanism used to control access to data items (database objects). In principle, anything can be locked (single column values, tuples, pages, relations, databases). In reality, most locking is on tuples, pages, and relations. The two basic kinds of locks are the share (S) lock and the exclusive (X) lock. Any number of transaction may hold an S lock on an item at any time, but no X locks may be held if any S locks exist on that data item. If an X lock is held on an item, no other locks of any kind may be held on that item. Before a transaction reads a data item, it obtains an S lock (read lock) on that item. After the read is complete, the lock is released (unlocked). Before a transaction writes a data item, it obtains an X lock (write lock) on that item. After the write is complete, the lock is released (unlocked). If a lock request cannot be immediately granted, the requesting transaction is placed on a queue of requests for that data item.

Log: A disk-based record of all updates made to the database, used recover from XACT abort, system failures, and media failures. Normally consists of one record per database update, containing the before and after images of the updated bits.

Lossless Join Property: A decomposition is lossless if we can join the new tables together and get exactly the old one back. We have not lost any information by doing the decomposition.

Lossy Join Property: A decomposition is lossy if we can not join the new tables together and get exactly the old one back. We have lost information by doing the decomposition.

Lost Update: A write action within a transaction in which the data item value written is later overwritten by another transaction, before the transaction making the first update commits.

Nested Query: An SQL query that contains one or more other SQL queries (called subqueries). These subqueries can appear in the FROM, WHERE, or HAVING clauses, but most commonly they appear in the WHERE clause. A subquery is correlated (with an outer query) if it uses a range variable from that outer query. Otherwise, it is uncorrelated. Any subquery that could be executed by itself is uncorrelated.

Normal Forms: A normal form is a property of a relation schema. It is used to identify the kinds of redundancies that are and are not present in the relation. For our purposes, we will only use 1NF, 2NF, 3NF, and BCNF.

First Normal Form (1NF): Every attribute value in every tuple must be a simple, scalar value, not a collection (set, list, etc.) of values. This is part of the definition of relation, so every relation is in 1NF by definition.

Second Normal Form (2NF): Relation R is in 2NF if, for every FD X ®A, either (i) A is prime or (ii) X is not a proper subset of any candidate key for the relation. More intuitively (and less precisely), there are no redundancies caused by partial dependencies.

Third Normal Form (3NF): Relation R is in 3NF if, for every FD X ®A, either (i) A is prime or (ii) X is a superkey for R. More intuitively (and less precisely), there are no redundancies caused by partial dependencies or transitive dependencies.

Boyce-Codd Normal Form (BCNF): Relation R is in BCNF if, for every FD X ®A, X is a superkey for R. More intuitively (and less precisely), there are no redundancies caused by functional dependencies.

Page: The unit of data transfer between disk and main memory. A page on disk is a certain number of physical sectors on the disk (the same for every page on the disk). In main memory, a page resides in a frame in the buffer pool. Typical sizes are 4K and 8K.

Phantom: A data item with some property P inserted into the database in such a way that a transaction T having physical locks on all existing data items with property P never sees the new data item. Even in the presence of strict 2PL, this can lead to a nonserializable schedule because T expected to have all such items locked. Solutions include index locking or logical (predicate) locking.

Pipelining: A query processing technique in which tuples are passed from one node in the plan tree to its parent node without materializing them on disk. Pipelining is facilitated by only considering left-deep plan trees. Not all left-deep plan trees can be completely pipelined, though.

Plan Tree: An annotated relational algebra tree in which each node specifies its implementation algorithm. It is a complete plan for executing the query.

Platter: A flat, disc-shaped piece of magnetized material consisting of one or two usable surfaces.

Prime: A prime attribute is an attribute that is an element of one or more candidate keys for a relation.

Query Block: A single, unnested, select-from-where SQL query, containing no subqueries. This is the unit of query optimization in many relational optimizers.

Query Optimization: the process of taking a relational algebra tree (produced by the parser) and generating mutiple plan trees, with possibly different structure than the original algebra tree, for executing the query. The cost of each plan tree is estimated and the most efficient one is executed.

Recoverable: A schedule is recoverable if every transaction in the schedule commits after all transactions it "reads from" commit. One transaction "reads from" another if it reads a value previously written by the other.

Relation: A relation is a set of tuples, each of which has the same number and types of fields (also called attributes or columns). Each field is an atomic value, meaning it is single-valued (as opposed to being a set, e.g.) and has no accessible structure (except possibly a CHAR string, which in some cases can be accessed one CHAR at a time). The number of rows in a relation is its cardinality. The number of columns in a relation is its degree or arity.

Relational Algebra: A logical query language (for retrievals only) based on set theory, selections, and projections. The five primitive operators are union, difference, cross- product, select, and project. It forms the basis for the implementation and optimization of most relational query processing engines. A functional, operational language.

Relational Calculus: A logical query language (for retrievals only) based on first-order predicate logic. Properly restricted, has the same expressive power as the relational algebra. A declarative, rather than an operational or imperative language.

Relationship: An association of two or more entities (usualy just two, as most relationships are binary).

Relationship set: the set of all relationships of the same "type".

Row: see tuple.

Schedule: A sequence of actions from a set of transactions. The sequence may be interleaved. The actions of a transaction within a schedule are in the same order as when the transaction is considered separately.

Sector: A physical subdivision of a track. A typical size is 512 bytes.

Serial: A serial schedule is one in which the actions of different transactions are not interleaved. One transaction is executed from start to finish, then another one is executed from start to finish, etc., guaranteeing consistency and isolation (and poor performance).

Serializable: A serializable schedule is one in which the state of the database at the end of the schedule is guaranteed to be the same as the state of the database after some serial execution of the transactions in the schedule. This is sufficient to guarantee consistency (maintenance of integrity constraints).

SQL: Structured Query Language, a combination DDL/DML developed at IBM. It is a standard, though no two products implement it in exactly the same way. Nearly all relational DBMSs provide SQL (Oracle, DB2, SQL Server, etc.).

SQL Server: A relational DBMS product from Microsoft. We will be using it for homeworks and projects in this course.

Strict: A strict schedule is one in which a value written by a transaction T is not read or written by other transactions until after T commits.

Strict 2PL: A variant of 2PL in which all locks are held until the end of the transaction (all unlock actions occur at commit time). This guarantees that the schedule avoids cascading aborts and other anomalies associated with aborting transactions in the presence of WW conflicts.

Table: see relation.

Track: A ring-shaped portion of a disk surface, made up of a certain number of sectors and a certain number of blocks (pages).

Transaction: An atomic, consistent, isolated, durable unit of work in a DBMS. It is composed of actions, which include reads, writes, in-memory manipulations, lock and unlock requests, and waits. Every transaction eventually commits or aborts. Abbreviated "xact".

Trigger: An integrity contsraint in SQL that can detect and act upon changes to a single relation. Its implementation in actual SQL products varies wildly. A trigger can detect insertions, deletions, and updates to rows and columns of a relation and can then take appropriate action, as specified by DML and other commands. Typical actions include disallowing the update that caused the trigger to be activated (fired) or propagating the update to other locations in the database.

Tuple: an element of a relation. Also called a row. See relation for more.

Two Phase Locking: A locking protocol in which no locks are requested by a transaction after it performs its first unlock action. That point in time is known as the lock point of the transaction. It is preceded by the growing phase and followed by the shrinking phase. Two phase locking (2PL) guarantees conflict serializability.

Union Compatible: In relational algebra, two relations are union-compatible if they have the same number of fields and these fields, in order, have the same domains. Most SQL implementations use the same definition. Union-compatible relations can be unioned, subtracted, and intersected.

Unrepeatable Read: A read action in a transaction that produces a different value than the previous read of that data item within the same transaction.

Write Ahead Log Protocol: A protocol that forces log records for an update out to the log before the actual updated data is written to disk. The log records are forced to disk before commit (or at the same time) to ensure durability.