|
1
|
|
|
2
|
- Comments on phase 2 of the project
- HW 3 is out.
- Today: DBMS internals part 1 --
- Next week: query optimization.
|
|
3
|
- Store large amounts of data
- Process queries efficiently
- Allow multiple users to access the database concurrently and safely.
- Provide durability of the data.
- How will we do all this??
|
|
4
|
|
|
5
|
- I/O model of computation
- We only count accesses to disk.
- Indexing:
- Basic techniques: B+-tree, hash indexes
- Secondary indexes.
- Efficient operator implementations: join
- Optimization: from what to how.
|
|
6
|
|
|
7
|
- Fastest, most expensive
- Today: 512MB-2GB are common on PCs
- Many databases could fit in memory
- New industry trend: Main Memory Database
- E.g TimesTen
- Main issue is volatility
|
|
8
|
- Disks
- Slower, cheaper than main memory
- Persistent !!!
- Used with a main memory buffer
|
|
9
|
- Data must be in RAM for DBMS to operate on it!
- Table of <frame#, pageid> pairs is maintained.
- LRU is not always good.
|
|
10
|
|
|
11
|
- Tapes or optical disks
- Extremely slow: used for long term archiving only
|
|
12
|
- Mechanical characteristics:
- Rotation speed (5400RPM)
- Number of platters (1-30)
- Number of tracks (<=10000)
- Number of bytes/track(105)
|
|
13
|
- Disk latency = time between when command is issued and when data is in
memory
- Is not following Moore’s Law!
- Disk latency = seek time + rotational latency
- Seek time = time for the head to reach cylinder
- Rotational latency = time for the sector to rotate
- Rotation time = 10ms
- Average latency = 10ms/2
- Transfer time = typically 10MB/s
- Disks read/write one block at a time (typically 4kB)
|
|
14
|
- In main memory algorithms we care about CPU time
- In databases time is dominated by I/O cost
- Assumption: cost is given only by I/O
- Consequence: need to redesign certain algorithms
- Will illustrate here with sorting
|
|
15
|
- Illustrates the difference in algorithm design when your data is not in
main memory:
- Problem: sort 1Gb of data with 1Mb of RAM.
- Arises in many places in database systems:
- Data requested in sorted order (ORDER BY)
- Needed for grouping operations
- First step in sort-merge join algorithm
- Duplicate removal
- Bulk loading of B+-tree indexes.
|
|
16
|
- Pass 1: Read a page, sort it, write it.
- only one buffer page is used
- Pass 2, 3, …, etc.:
|
|
17
|
- Each pass we read + write each page in file.
- N pages in the file => the number of passes
- So total cost is:
- Improvement: start with larger runs
- Sort 1GB with 1MB memory in 10 passes
|
|
18
|
- We have more main memory
- Should use it to improve performance
|
|
19
|
- B: Block size
- M: Size of main memory
- N: Number of records in the file
- R: Size of one record
|
|
20
|
- Phase one: load M bytes in memory, sort
- Result: runs of length M/R records
|
|
21
|
- Merge M/B – 1 runs into a new run
- Result: runs have now M/R (M/B – 1) records
|
|
22
|
- Merge M/B – 1 runs into a new run
- Result: runs have now M/R (M/B – 1)2 records
|
|
23
|
- Number of passes:
- Think differently
- Given B = 4KB, M = 64MB, R =
0.1KB
- Pass 1: runs of length M/R =
640000
- Have now sorted runs of 640000 records
- Pass 2: runs increase by a
factor of M/B – 1 = 16000
- Have now sorted runs of 10,240,000,000 = 1010 records
- Pass 3: runs increase by a
factor of M/B – 1 = 16000
- Have now sorted runs of 1014 records
- Nobody has so much data !
- Can sort everything in 2 or 3 passes !
|
|
24
|
|
|
25
|
|
|
26
|
- Relational database elements:
- A tuple is represented as a record
|
|
27
|
- Information about field types same for all records in a file; stored in system
catalogs.
- Finding i’th field requires scan of record.
- Note the importance of schema information!
|
|
28
|
|
|
29
|
|
|
30
|
|
|
31
|
- Blocks have fixed size (typically 4k)
|
|
32
|
- How do we store efficiently large amounts of data?
- The appropriate storage depends on what kind of accesses we expect to
have to the data.
- We consider:
- primary storage of the data
- additional indexes (very very important).
|
|
33
|
- As a good approximation, we ignore CPU costs:
- B: The number of data pages
- R: Number of records per page
- D: (Average) time to read or
write disk page
- Measuring number of page I/O’s ignores gains of pre-fetching blocks of
pages; thus, even I/O cost is only approximated.
- Average-case analysis; based on several simplistic assumptions.
|
|
34
|
- Heap Files:
- Equality selection on key; exactly one match.
- Insert always at end of file.
- Sorted Files:
- Files compacted after deletions.
- Selections on sort field(s).
- Hashed Files:
- No overflow buckets, 80% page occupancy.
- Single record insert and delete.
|
|
35
|
|
|
36
|
- An index on a file speeds up selections on the search key fields for the
index.
- Any subset of the fields of a relation can be the search key for an
index on the relation.
- Search key is not the same as key (minimal set of fields that uniquely
identify a record in a relation).
- An index contains a collection of data entries, and supports efficient
retrieval of all data entries with a given key value k.
|
|
37
|
- Primary/secondary
- Clustered/unclustered
- Dense/sparse
- B+ tree / Hash table / …
|
|
38
|
- File is sorted on the index attribute
- Dense index: sequence of (key,pointer) pairs
|
|
39
|
|
|
40
|
|
|
41
|
- Sparse index: pointer to lowest search key in each block:
- Search for 20
|
|
42
|
- Better: pointer to lowest new search key in each block:
- Search for 20
- Search for 15 ? 35 ?
|
|
43
|
- To index other attributes than primary key
- Always dense (why ?)
|
|
44
|
- Primary indexes = usually clustered
- Secondary indexes = usually unclustered
|
|
45
|
|
|
46
|
- Applications:
- index other attributes than primary key
- index unsorted files (heap files)
- index clustered data
|
|
47
|
- Clustered data
- Company(name, city), Product(pid, maker)
|
|
48
|
- Composite Search Keys: Search on a combination of fields.
- Equality query: Every field value is equal to a constant value. E.g.
wrt <sal,age> index:
- Range query: Some field value is not a constant. E.g.:
- age =20; or age=20 and sal > 10
|
|
49
|
- Search trees
- Idea in B Trees:
- Idea in B+ Trees:
- Make leaves into a linked list (range queries are easier)
|
|
50
|
- Parameter d = the degree
- Each node has >= d and <= 2d keys (except root)
- Each leaf has >=d and <= 2d keys:
|
|
51
|
|
|
52
|
- How large d ?
- Example:
- Key size = 4 bytes
- Pointer size = 8 bytes
- Block size = 4096 byes
- 2d x 4 + (2d+1) x 8 <=
4096
- d = 170
|
|
53
|
- Exact key values:
- Start at the root
- Proceed down, to the leaf
- Range queries:
- As above
- Then sequential traversal
|
|
54
|
- Typical order: 100. Typical
fill-factor: 67%.
- Typical capacities:
- Height 4: 1334 = 312,900,700 records
- Height 3: 1333 =
2,352,637 records
- Can often hold top levels in buffer pool:
- Level 1 = 1 page =
8 Kbytes
- Level 2 = 133 pages = 1 Mbyte
- Level 3 = 17,689 pages = 133 MBytes
|
|
55
|
- Secondary storage hash tables are much like main memory ones
- Recall basics:
- There are n buckets
- A hash function f(k) maps a key k to {0, 1, …, n-1}
- Store in bucket f(k) a pointer to record with key k
- Secondary storage: bucket = block, use overflow blocks when needed
|
|
56
|
- Assume 1 bucket (block) stores 2 keys + pointers
- h(e)=0
- h(b)=h(f)=1
- h(g)=2
- h(a)=h(c)=3
|
|
57
|
- Search for a:
- Compute h(a)=3
- Read bucket 3
- 1 disk access
|
|
58
|
- Place in right bucket, if space
- E.g. h(d)=2
|
|
59
|
- Create overflow block, if no space
- E.g. h(k)=1
- More over-
flow blocks
may be needed
|
|
60
|
- Excellent, if no overflow blocks
- Degrades considerably when number of keys exceeds the number of buckets
(I.e. many overflow blocks).
- Typically, we assume that a hash-lookup takes 1.2 I/Os.
|
|
61
|
- File organizations: sorted, hashed, heaps.
- Indexes: hash index, B+-tree
- Indexes can be clustered or not.
- Data can be stored in the index or not.
- Hence, when we access a relation, we can either scan or go through an
index:
|
|
62
|
- Multi-dimensional indexing:
- how do we index regions in space?
- Document collections?
- Multi-dimensional sales data
- How do we support nearest neighbor queries?
- Indexing is still a hot and unsolved problem!
|
|
63
|
- Applications: geographical databases, data cubes.
- Types of queries:
- partial match (give only a subset of the dimensions)
- range queries
- nearest neighbor
- Where am I? (DB or not DB?)
- Conventional indexes don’t work well here.
|
|
64
|
- Hash like structures:
- Grid files
- Partitioned indexing functions
- Tree like structures:
- Multiple key indexes
- kd-trees
- Quad trees
- R-trees
|
|
65
|
|
|
66
|
- A hash function produces k bits identifying the bucket.
- The bits are partitioned among the different attributes.
- Example:
- Age produces the first 3 bits of the bucket number.
- Salary produces the last 3 bits.
- Supports partial matches, but is useless for range queries.
|
|
67
|
|
|
68
|
|
|
69
|
|
|
70
|
|
|
71
|
|
|
72
|
|
|
73
|
|
|
74
|
- Table scan: iterate through the records of the relation.
- Index scan: go to the index, from there get the records in the file
(when would this be better?)
- Sorted scan: produce the relation in order. Implementation depends on
relation size.
|
|
75
|
- The iterator model. Each operation is implemented by 3 functions:
- Open: sets up the data structures and performs initializations
- GetNext: returns the the next tuple of the result.
- Close: ends the operations. Cleans up the data structures.
- Enables pipelining!
- Contrast with data-driven materialize model.
- Sometimes it’s the same (e.g., sorted scan).
|
|
76
|
- We will consider how to implement:
- Selection ( ) Selects a subset of rows from
relation.
- Projection ( )
Deletes unwanted columns from relation.
- Join ( ) Allows us to combine two relations.
- Set-difference Tuples in reln.
1, but not in reln. 2.
- Union Tuples in reln. 1 and in
reln. 2.
- Aggregation (SUM, MIN, etc.) and
GROUP BY
|
|
77
|
- Purchase:
- Each tuple is 40 bytes long, 100
tuples per page, 1000 pages (i.e., 100,000 tuples, 4MB for the entire
relation).
- Person:
- Each tuple is 50 bytes long, 80
tuples per page, 500 pages (i.e, 40,000 tuples, 2MB for the entire
relation).
|
|
78
|
- Of the form
- With no index, unsorted: Must
essentially scan the whole relation; cost is M (#pages in R).
- With an index on selection attribute:
Use index to find qualifying data entries, then retrieve
corresponding data records. (Hash
index useful only for equality selections.)
- Result size estimation:
- (Size of R) * reduction factor.
- More on this later.
|
|
79
|
- Cost depends on #qualifying tuples, and clustering.
- Cost of finding qualifying data entries (typically small) plus cost of
retrieving records.
- In example, assuming uniform distribution of phones, about 54% of
tuples qualify (500 pages, 50000 tuples). With a clustered index, cost is
little more than 500 I/Os; if unclustered, up to 50000 I/Os!
- Important refinement for unclustered indexes:
- 1. Find sort the rid’s of the qualifying data entries.
- 2. Fetch rids in order. This
ensures that each data page is looked at just once (though # of such
pages likely to be higher than with clustering).
|
|
80
|
- First approach: Find the most selective access path, retrieve tuples
using it, and apply any remaining terms that don’t match the index:
- Most selective access path: An index or file scan that we estimate will
require the fewest page I/Os.
- Consider city=“seattle AND phone<“543%” :
- A hash index on city can be used; then, phone<“543%”
must be checked for each retrieved tuple.
- Similarly, a b-tree index on phone
could be used; city=“seattle” must then be checked.
|
|
81
|
- Second approach
- Get sets of rids of data records using each matching index.
- Then intersect these sets of rids.
- Retrieve the records and apply any remaining terms.
|
|
82
|
- Two parts:
- (1) remove unwanted
attributes,
- (2) remove duplicates from the
result.
- Refinements to duplicate removal:
- If an index on a relation contains all wanted attributes, then we can
do an index-only scan.
- If the index contains a subset of the wanted attributes, you can remove
duplicates locally.
|
|
83
|
- R S is a common operation.
The cross product is too large. Hence, performing R
S and then a selection is too inefficient.
- Assume: M pages in R, pR tuples per page, N pages in S, pS
tuples per page.
- In our examples, R is Person and S is Purchase.
- Cost metric: # of I/Os. We will ignore output costs.
|
|
84
|
- How would you implement join?
|
|
85
|
- For each tuple in the outer relation R, we scan the entire inner
relation S.
- Cost: M + (pR * M) * N =
1000 + 100*1000*500
I/Os: 140 hours!
- Page-oriented Nested Loops join:
For each page of R, get each page of S, and write out matching
pairs of tuples <r, s>, where r is in R-page and S is in S-page.
- Cost: M + M*N = 1000 + 1000*500
(1.4 hours)
|
|
86
|
- If there is an index on the join column of one relation (say S), can
make it the inner.
- Cost: M + ( (M*pR) *
cost of finding matching S tuples)
- For each R tuple, cost of probing S index is about 1.2 for hash index,
2-4 for B+ tree. Cost of then
finding S tuples depends on clustering.
- Clustered index: 1 I/O
(typical), unclustered: up to 1 I/O per matching S tuple.
|
|
87
|
- Hash-index on name of Person (as inner):
- Scan Purchase: 1000 page I/Os,
100*1000 tuples.
- For each Person tuple: 1.2 I/Os
to get data entry in index, plus 1 I/O to get (the exactly one)
matching Person tuple.
Total: 220,000 I/Os. (36
minutes)
- Hash-index on buyer of Purchase
(as inner):
- Scan Person: 500 page I/Os,
80*500 tuples.
- For each Person tuple: 1.2 I/Os
to find index page with data entries, plus cost of retrieving matching
Purchase tuples. Assuming
uniform distribution, 2.5 purchases per buyer (100,000 / 40,000). Cost of retrieving them is 1 or 2.5 I/Os depending on
clustering.
|
|
88
|
- Use one page as an input buffer for scanning the inner S, one page as
the output buffer, and use all remaining pages to hold ``block’’ of
outer R.
- For each matching tuple r in R-block, s in S-page, add <r, s> to result. Then read next R-block, scan S, etc.
|
|
89
|
- Sort R and S on the join column, then scan them to do a ``merge’’ on the join column.
- Advance scan of R until current R-tuple >= current S tuple, then
advance scan of S until current S-tuple >= current R tuple; do this
until current R tuple = current S tuple.
- At this point, all R tuples with same value and all S tuples with same
value match; output <r, s>
for all pairs of such tuples.
- Then resume scanning R and S.
|
|
90
|
- R is scanned once; each S group is scanned once per matching R tuple.
- Cost: M log M + N log N + (M+N)
- The cost of scanning, M+N, could be M*N (unlikely!)
- With 35, 100 or 300 buffer pages, both Person and Purchase can be sorted
in 2 passes; total: 7500. (75 seconds).
|
|
91
|
- Partition both relations using hash fn h: R tuples in partition i will only
match S tuples in partition i.
|
|
92
|
- In partitioning phase, read+write both relations; 2(M+N). In matching
phase, read both relations; M+N I/Os.
- In our running example, this is a total of 4500 I/Os. (45 seconds!)
- Sort-Merge Join vs. Hash Join:
- Given a minimum amount of memory both have a cost of 3(M+N) I/Os. Hash Join superior on this count if
relation sizes differ greatly.
Also, Hash Join shown to be highly parallelizable.
- Sort-Merge less sensitive to data skew; result is sorted.
|
|
93
|
- Hash Join
- Partially pipelined: no output until inner read
- Asymmetric (inner vs. outer) — optimization requires source behavior
knowledge
|
|
94
|
- How would you build a query optimizer?
|