HW 4 CSE473 : Machine
Learning (due Nov. 29)
1. Exercise 2.4 in Mitchell (attached)
SOLUTION:
Here’s a postscript file with the solution to problem #1 (#2 on this particular assignment,
so you’ll have to scroll down.
2. Suppose we want to classify whether a given balloon is
inflated based on
four attributes: the color of the balloon, the size of the
balloon, the "act"
that the person holding the balloon is engaged in, and the
age of the person
holding the balloon.
Show the decision tree that ID3 would build to learn
this classification.
Display the information gain for each candidate
attribute at the root of the tree (the calculations that you
used to choose
the first attribute to split on).
Color |
Size |
Act |
Age |
Inflated? |
YELLOW |
SMALL |
STRETCH |
ADULT |
T |
YELLOW |
SMALL |
STRETCH |
CHILD |
T |
YELLOW |
SMALL |
DIP |
ADULT |
T |
YELLOW |
SMALL |
DIP |
CHILD |
F |
YELLOW |
SMALL |
DIP |
CHILD |
F |
YELLOW |
LARGE |
STRETCH |
ADULT |
T |
YELLOW |
LARGE |
STRETCH |
CHILD |
T |
YELLOW |
LARGE |
DIP |
ADULT |
T |
YELLOW |
LARGE |
DIP |
CHILD |
F |
YELLOW |
LARGE |
DIP |
CHILD |
F |
PURPLE |
SMALL |
STRETCH |
ADULT |
T |
PURPLE |
SMALL |
STRETCH |
CHILD |
T |
PURPLE |
SMALL |
DIP |
ADULT |
T |
PURPLE |
SMALL |
DIP |
CHILD |
F |
PURPLE |
SMALL |
DIP |
CHILD |
F |
PURPLE |
LARGE |
STRETCH |
ADULT |
T |
PURPLE |
LARGE |
STRETCH |
CHILD |
T |
PURPLE |
LARGE |
DIP |
ADULT |
T |
PURPLE |
LARGE |
DIP |
CHILD |
F |
PURPLE |
LARGE |
DIP |
CHILD |
F |
SOLUTION:
First we calculate the information gain for splitting each attribute at the root: The starting entropy is 0.971
Gain( Color ) [12+, 8-] = 0.971 – (10/20) 0.971 – (10/20) 0.971 = 0,
since splitting on color produces two sets with [6+, 4-].
Gain( Size ) [12+, 8-] = 0.971 – (10/20) 0.971 – (10/20) 0.971 = 0.
Gain( Act ) [12+, 8-] = 0.971 – (8/20) 0 – (12/20) 0.918 = 0.42
since splitting on act yields a set [8+, 0-] with entropy 0, and a set [4+,8-] with entropy 0.918.
Gain( Age ) [12+, 8-] = = 0.971 – (8/20) 0 – (12/20) 0.918 =
0.42
So we can choose either Age or Act and then recurse on the new set of 12 examples…
The final tree (choosing Act):
[12+, 8-]
Act?
/ \
/ \
[8+,0-] [4+,8-]
Stretch Age?
/ / \
YES [4+,0-] [0+,8-]
Adult Child
/ \
YES NO
3. a) How would you modify ID3 to do a k-step lookahead
(making a greedy decision on
how to add nodes one by one to the tree, this algorithm
would pick the one that
leads to the best subtree of depth k)? For example, if the problem has three
attributes each with four values a1...a4, b1...b4, c1...c4,
a sample set might look
like:
A B
C
---------------
a2 b1
c2
a3 b1
c4
a1 b2
c3
...
When deciding which node to split on for a given set of
candidates, a "2-step
lookahead" would look at combinations of two attributes
such as {A,B} or {A,C}
rather than single attributes.
b) Give the pros and cons of this modified algorithm
with respect to basic ID3 in terms of accuracy and time complexity.
SOLUTION:
There was a huge range in terms of complexity with regards to your answer to this. I ended up giving full credit to
most people, given that you had put some thought into how this algorithm would be implemented and displayed
an intuition of how accurate/costly it would be.
a) The ID3 with k-step lookahead will consider combinations of size k over the n attributes of the examples.
There will be (n * (n-1))/k of these to consider. We could imagine a simple extension of ID3 that loops on the
parameter K and sums the total gain of splitting on the k different attributes in each combo. The algorithm could
then take the max of these and split on one of the attributes in the combo.
b) Why is the k-step lookahead better? With regular ID3, a greedy decision is made whenever a node is added –
we simply split on the attribute with the highest gain. ID3 is myopic – it might make sense “locally” to split on a
certain attribute, but this would lead to a more complicated tree in the long run. By looking ahead k steps, ID3 can
avoid these sorts of “local minima/maxima.” It might not always be the case that the lookahead will lead to better
decisions. It might build the same tree at an added cost for computation. We can say that it will never be LESS
accurate than regular ID3.
In terms of complexity, the k-step lookahead will make the algorithm exponential and intractable for large k
(some of you formulated the factorial bound, but I didn’t end up requiring that for full credit).
4. For each of the learning methods below, identify an
inductive bias that is
inherent in the approach.
a) Decision tree learning
b) Bayesian Belief Networks
c) The CANDIDATE-ELIMINATiON where hypotheses take the form
of conjunctions of
attributes.
d) Naive Bayes classifiers
SOLUTION:
a) Decision tree learners like ID3 favor shorter trees that place attributes with the highest information
gain near the root.
b) Bayesian Belief Networks assume conditional independence between a subset of nodes (as represented
by the directed graph).
c) It’s assumed that the concept is included in the hypothesis space (conjunctions of attributes).
d) We assume that the attribute values are independent given the classifications.
5. Consider the following task : You are asked to build a
model for predicting
whether or not a patient has the fatal PROCRASS-10.8 virus,
which appears to be
highly deadly and potentially very contagious. The virus has just been spotted
in a remote African village, and you've been sent in to
investigate it. Due to
the recency of the breakout, only 13 cases have been
observed. Each of these
samples is a vector of continuous-valued results of six
diagnostic tests. What
sort of learner would you use to address this problem? Why?
SOLUTION:
I accepted many solutions, hinging on the strength of the argument you made. The best answer is a instance-based
learner since there are only a handful of examples and the attribute values are continuous. It’s true that the small
amount of data is a detractor for instance-based methods, but that is true for all learning models. Trees, especially,
have a great deal of trouble when the dataset is small, since you’re removing a great deal of information at each
split (and thus decreasing the information available to the next level). This leads to an early exhaustion of a small
dataset. Also, the small size of the example set means we won’t have to worry about the possibility of an unwieldy
model with an instance-based learner (the size is small).
Many of you said CANDIDATE-ELIMINATION, which was a big surprise! If you did give this answer, it was
important that you made some provision for noise in the dataset (certainly a possibility with this problem).
CANDIDATE-ELIMINATION is a nice algorithm for proving things about version/hypothesis spaces, but it’s not
generally something we’d expect to apply in the real world…
6. Using the figure below, sketch out the Voronoi diagram
induced by the Euclidean
distance between query instances and the training
examples. Shade in the positive
regions of the graph (areas that will be classified as
"+" given these training
examples) that are generated by the 1-Nearest Neighbor
algorithm.
SOLUTION: (scanned image)
Several people forgot to draw the boundaries between the different negative examples, and just shaded in
the positive region. The Voronoi diagram still needs to show the neighborhoods for each training example.
7. Consider the following Bayesian network, in which the
variables A, B, and C are
boolean:
A
/ \
v v
B C
Suppose you want to learn the parameters for this network
using the training set
{<1,0,1>, <0,0,1> <0,?,1> <1,1,1>},
where examples are in the form <A,B,C> and "?"
indicates a missing value.
Show the sequence of filled-in values and parameters
produced by the EM algorithm, assuming the parameters are
initialized ignoring the
missing values.
SOLUTION:
I accepted answers that derived fractional terms for the missing values as well as those that used
“harsh EM” to fill them in with either 0 or 1.
Initialize the probabilities in the net:
P(A) = 0.5
P(B|A) = 0.5
P(B|~A) = 0
P(C|A) = 1
P(C|~A) = 1
Use EM to calculate P(?=1) or (B|~A)
E STEP # 1:
P(B) = P(A)P(B|A) + P(~A)P(B|~A)
= 0.5 * (0.5) + 0 (0.5)
= 0.25
P(B|~A) = 0.25 + 0 / 2 = 0.125
Fill in the missing value:
< 0,0,1> {Harsh EM}
<0,0.125, 1>
M STEP #1: (only P(B), P(B|~A) change, to {0.25, 0.125} )
E STEP #2 :
P(B) = P(A)P(B|A) + P(~A)P(B|~A)
= 0.5 * 0.5 + 0.5 * 0.125
= 0.3125
P(B|~A) = 0 + 0.3125 / 2 = 0.1563
Fill in the missing value:
<0,0,1> {Harsh EM, so if you did this you could have stopped here}
<0,0.1563, 1> {obviously, this method will give a finer grained answer}
M STEP #2: (only P(B), P(B|~A) change, to, to {0.3125,0.1563} )
E STEP #3:
P(B) = P(A)P(B|A) + P(B|~A)P(~A)
= 0.5 * 0.5 + 0.1563 * 0.5
= 0.3281
P(B|~A) = 0.3281 + 0 / 2 = 0.1641
Fill in the missing value:
<0, 0.3281, 1>
M STEP #3: (only P(B), P(B|~A) change, to {0.3281, 0.1641}
E STEP #4:
P(B) = P(A)P(B|A) + P(B|~A)P(~A)
= 0.5 * 0.5 + 0.1641 * 0.5
= 0.3320
P(B|~A) = 0.3320 + 0 / 2 = 0.1660
Fill in the missing value:
<0, 0.1660, 1>
… the algorithm converges to 0.1666 for P(B|~A), 0.333 for P(B) if you use fractional values
for the missing parameters.
8. Give an example
of a learning problem that would be appropriate for the Naive
Bayes classifier.
Give a specific example (i.e., avoid something brief and general
such as "text classification").
SOLUTION:
I accepted all solutions that showed an understanding of Naïve Bayes and the sort of assumptions that
it makes about a problem. It was nice to see some responses that veered away from the text classification
“sphere” of learning domains.