B-Trees

B-Trees
Applying the Algorithm Design Process to solve the problem with Binary Search Trees, and analysis of 2-3 / 2-3-4 / B-Trees.
Kevin Lin, with thanks to many others.
1
Ask questions anonymously on Piazza. Look for the pinned Lecture Questions thread.

Feedback from the Reading Quiz
2

Best Case and Worst Case Height
Suppose we want to build a BST out of the numbers 1, 2, 3, 4, 5, 6, 7.
Give a sequence of add operations that result in (1) a spindly tree and (2) a bushy tree.
3
Q
4
2
6
1
3
5
7
4
2
6
1
3
5
7
Height: 6
Average Depth: 3
Height: 2
Average Depth: 1.43
Q1: Give a sequence of add operations that result in (1) a spindly tree and (2) a bushy tree.

What about the Real World?
These examples are contrived. What about real-world workloads?
An approximation: randomized insertion.
4
Random Insertion into a BST (Kevin Wayne/Princeton)
Random trees have Θ(log N) average depth and height.
Random trees are bushy, not spindly.
?: How far off is the randomized BST from the optimal BST in terms of the average depth of nodes in the tree?

Mathematical Analysis
Binary search tree height is in O(N).
Worst case: Θ(N).
Best case: Θ(log N).
Θ(log N) via randomized insertion.

We can also show that randomized trees including deletion are still Θ(log N) height.
5
The Height of a Randomized Binary Search Tree (Reed/STOC 2000)
Average Depth of a Randomized BST.If N distinct keys are inserted in random order, the expected average depth is
~ 2 ln N = Θ(log N).
Height of a Randomized BST.If N distinct keys are inserted in random order, the expected height is
~ 4.311 ln N = Θ(log N).
Note that ~ notation is the same thing as Big-Theta, but we don’t drop the multiplicative constants.

We won’t discuss too much of the math behind these arguments.

?: What can we say about the average case runtime for contains given a randomized BST?




?: What can we say about the worst case runtime for contains given a randomized BST?

Good News and Bad News
Good news.
BSTs have great a runtime if we insert keys randomly.
Θ(log N) per insertion.

Bad news.
We can’t always insert our keys in a random order. Why?
6
Q
e
b
g
o
n
p
m
q
r
s
Q1: We can’t always insert our keys in a random order. Why?

Implementer’s Design Decision Hierarchy
7
Set
Abstract Data Type
Which ADT is the best fit?

Data Structure
Which data structure offers the best performance for our input/workload?

Implementation Details
How do we maintain invariants?
Binary Search Tree
Linked Nodes
For every node X in the tree:
All keys in the left subtree ≺ X’s key.
All keys in the right subtree ≻ X’s key.
Map
As the ADT implementer, we always had to keep in mind our invariants when thinking through the problem.

?: How does the Binary Search Tree Invariant affect the implementation of contains, add, and remove?

Algorithm Design Process
8
Hypothesize. How do invariants affect the behavior for each operation?
Identify. What strategies have we used before? What examples can we apply?
Plan. Propose a new way from findings.
Analyze. Does the plan do the job? What are potential problems with the plan?
Create. Implement the plan.
Evaluate. Check implemented plan.
ArrayList Invariant.
data is an array of items, never null.The i-th item in the list is always stored in data[i].
ArrayQueue Invariant.
data is an array of items, never null.The i-th item in the list is always stored in data[(start + i) mod length].
Programming, Problem Solving, and Self-Awareness: Effects of Explicit Guidance (Loksa et al./CHI ‘16)
Iterative Refinement
Let’s zoom in on the Data Structure and Implementation Details. We need to optimize the worst case height of our binary search tree.

Iterative Refinement. Like the debugging process we learned earlier, information is key and motivates how we improve our invariants. As with debugging, the solutions are often very closely related to a particular framing of the problem. That’s why there are lots of unsolved problems in theoretical CS. Oftentimes, we don’t have the right understanding or perspective–hence why it’s so easy to get stuck.

?: How have we applied iterative refinement before?

Rewriting Invariants
Hypothesis. Worst-case height trees are spindly trees.
Identify.
Spindly tree: all nodes have either 0 children (leaf) or 1 child.
Bushy tree: all nodes have either 0 children (leaf) or 2 children.
Plan. Say we have a BST in which every node has either 0 or 2 children.
Analyze.
What is the worst case search time in this case?
What do worst case trees look like?
9
Q
4
2
6
1
3
5
7
4
2
6
1
3
5
7
Say we have a BST in which every node has either 0 or 2 children.

Q1: What is the worst case search time in this case?




Q2: What do worst case trees look like?

What is the worst case search time in a BST in which every node has either 0 or 2 children?
10

Rewriting Invariants
11
A
H(N) ∈ Θ(N)
Examples are key to helping us learn and iterate from our initial attempts. Unfortunately, this new invariant doesn’t capture the complexity of the problem.

?: What is the tilde notation (like Big-Theta but keeping multiplicative constants) for the height of this tree?

A Different Hypothesis
Hypothesis. Unbalanced growth leads to worst-case height trees.


How does adding a new node affect the height of a tree? Explain in terms of the height of the left and right subtrees.
12
2
1
3
6
5
7
4
8
9
10
Q
Q1: How does adding a new node affect the height of a tree? Explain in terms of the height of the left and right subtrees.

A Different Hypothesis
Identify.
New nodes are added as leaves.
Unbalanced leaves lead to one subtree growing faster than the other.

Plan. Overstuff existing leaves to avoid adding new leaves.
If we never add new leaves, the tree can never get unbalanced.
13
A
2
1
3
6
5
7
4
8
9
10
Linear-height trees come about when keys are added more frequently to one side rather than the other, leading to one side’s height growing much faster than the other side’s height.

?: Why is it the case that all new nodes are added as leaves?

Overstuffing Leaves
Problem. New keys are added as leaves.

Avoid adding new leaves by overstuffing existing leaves.

What’s the problem with this idea?
14
5
2
7
15
14
16  17
13
5
2
7
15
14
16  17  18
13
?: Does this suggestion increase the height of the tree?




?: What’s the problem with this idea?

Overstuffing Leaves
Problem. New keys are added as leaves.

Avoid adding new leaves by overstuffing existing leaves.

What’s the problem with this idea?
15
5
2
7
15
14
16  17  18  19
13
5
2
7
15
14
16  17  18  19  20  21  22  23  24
13
?: What’s the problem with this idea?

Promoting Keys
Height is balanced but leaves are too full.

Set a limit L on number of keys, e.g. L=3.
If any node has more than L keys, give a key to the parent, e.g. the left-middle key.
16
5
2
7
15
14
16  17  18  19
13
Promote to parent
5
2
7
15  17
14
16  18  19
13
?: Promoting keys introduces a new problem. What’s the problem with the bottom tree?

Promoting Keys
Height is balanced but leaves are too full.

Set a limit L on number of keys, e.g. L=3.
If any node has more than L keys, give a key to the parent, e.g. the left-middle key.

However, now 16 is to the right of 17.
Suggest a way to resolve this problem.
17
5
2
7
15
14
16  17  18  19
13
Promote to parent
5
2
7
15  17
14
16  18  19
13
Q
Q1: Suggest a way to resolve this problem.

Promoting Keys
Height is balanced but leaves are too full.

Set a limit L on number of keys, e.g. L=3.
If any node has more than L keys, give a key to the parent, e.g. the left-middle key.
Promoting a key splits the node into two new parts: left and right.
18
5
2
7
15
14
16  17  18  19
13
Promote to parent
A
5
2
7
15  17
14
18  19
13
16
?: How many children does the (15, 17) node now have?

Adding More Keys
Suppose we add the keys 20 and 21.
If our cap is at most L=3 keys per node, draw the post-split tree.
19
Q
5
2
7
15  17
14
18  19  20  21
13
16
Q1: If our cap is at most L=3 keys per node, draw the post-split tree.

Adding More Keys
Suppose we add the keys 20 and 21.
If our cap is at most L=3 keys per node, draw the post-split tree.
20
A
5
2
7
15  17  19
13
14
18
16
20  21
Q1: If our cap is at most L=3 keys per node, draw the post-split tree.

Add 25 and 26
21
5
2
7
15  17  19
13
14
18
16
20  21  25  26
5
2
7
15  17  19  21
13
14
18
16
25  26
20
?: Predict what will happen next.

Add 25 and 26
22
5
2
7
13  17
19  21
18
25  26
20
14
16
15
5
2
7
15  17  19  21
13
14
18
16
25  26
20

Overstuffing the Root Node
Draw the tree after the root is split.
23
Q
19  21
15
5
13  17
19  21  22  23
15
5
13  17
22  23
15
5
13  17  21
19
22  23  24  25
15
5
13  17  21
19
24  25
15
5
13  17  21  23
19
22
?
Q1: Draw the tree after the root is split.

Overstuffing the Root Node
Draw the tree after the root is split.
24
A
24  25
15
5
13  17  21  23
19
22
24  25
15
5
21  23
19
22
17
13

2-3, 2-3-4, and B-Trees
We chose limit L=3 keys in each node. Formally, this is called a 2-3-4 Tree: each non-leaf node can have 2, 3, or 4 children.


2-3 Tree. Choose L=2 keys. Each non-leaf node can have 2 or 3 children.

B-Trees are the generalization of this idea for any choice of L.
25
2-3-4 Tree



Max 3 keys and 4 non-null children per node.
s u w
r
y z
t
n
p
o
e
b
g
m q
v
2-3 Tree



Max 2 keys and 3 non-null children per node.
s u
r
t
n
p
o
e
b
g
m q
v
B-Trees are popular in two contexts.
Small L (L=2 or L=3). Used as a conceptually simple balanced search tree as we saw today.
Large L (in the thousands). Used in practice for databases and filesystems with very large records.

B-Tree Bushy-ness
26
While B-Trees are perfectly balanced, some nodes have only 2 children while other nodes can have 2 or 3 children (in the case of 2-3-4 Trees).

Tree Insertion
Give an insertion order for the keys 1, 2, 3, 4, 5, 6, 7 that results in a max-height 2-3 Tree.
What about for a min-height 2-3 Tree?
27
Q
?: What is the least number of keys we can stuff into a 2-3 Tree node? The greatest number of keys?




Q1: Give an insertion order for the keys 1, 2, 3, 4, 5, 6, 7 that results in a max-height 2-3 Tree.




Q2: Do the same for a min-height 2-3 Tree.

Tree Insertion
Give an insertion order for the keys 1, 2, 3, 4, 5, 6, 7 that results in a max-height 2-3 Tree.
What about for a min-height 2-3 Tree?
28
A
2
6
4
1
3
5
7
3  5
1  2
6  7
4
1, 2, 3, 4, 5, 6, 7
2, 3, 4, 5, 6, 1, 7
Demo
?: Does your insertion order yield a similar-looking tree? What characteristics affect the height?

B-Tree Invariants
All leaves must be the same depth from the root.
A non-leaf node with k keys must have exactly k + 1 non-null children.

These invariants guarantee bushy trees.
The tree to the right is not a possible B-Tree based on these invariants.
29
2  3
5  6  7
4
1
?: Why is the tree to the right impossible? Which invariants does it violate?




?: Based on our algorithm design principle, explain to yourself why these invariants must be true.

Height of a B-Tree with Node-Item Limit L
Largest possible height is all non-leaf nodes have just 1 key.
H(N) ~ log2(N) ∈ Θ(log N)
Smallest possible height is all nodes have L keys.
H(N) ~ logL + 1(N) ∈ Θ(log N)
30
*
*
*
* *
*
*
*
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
N=8, L=2, H(N) = 2
N=26, L=2, H(N) = 2

Search Runtime
Worst case number of nodes to inspect: H + 1
Worst case number of keys to inspect per node: L
Overall runtime: O(HL)
Since H(N) ∈ Θ(log N) and L is a constant, overall runtime is O(log N).
31
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
?: Describe the procedure for searching in a B-Tree.

Insertion Runtime
Worst case number of nodes to inspect: H + 1
Worst case number of keys to inspect per node: L
Worst case number of split operations: H + 1
Overall runtime: O(HL)
Since H(N) ∈ Θ(log N) and L is a constant, overall runtime is O(log N).
32
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
*  *
?: How do we get the worst case number of split operations?




?: This assumes each split operation takes constant time. Give examples that demonstrate why this is the case.

Summary
The Algorithm Design Process, especially choosing a hypothesis, informed our results.

B-Trees are one type of balanced search tree: a modification of the binary search tree that avoids Θ(N) worst case.
Nodes may contain between 1 and L keys.
Searching for an key works almost exactly like a normal BST.
Insertion works by overstuffing leaf nodes. If nodes are overfull, they split.
The resulting tree has perfect balance so runtime is in O(log N).
B-Trees are more complex than BSTs but can handle any workload.
33