HW 4   CSE473 : Machine Learning  (due Nov. 29)

1. Exercise 2.4 in Mitchell (attached)

SOLUTION:

Here’s a postscript file with the solution to problem #1 (#2 on this particular assignment,

so you’ll have to scroll down.

2. Suppose we want to classify whether a given balloon is inflated based on

four attributes: the color of the balloon, the size of the balloon, the "act"

that the person holding the balloon is engaged in, and the age of the person

holding the balloon.  Show the decision tree that ID3 would build to learn

this classification.  Display the information gain for each candidate

attribute at the root of the tree (the calculations that you used to choose

the first attribute to split on).

 Color Size Act Age Inflated? YELLOW SMALL STRETCH ADULT T YELLOW SMALL STRETCH CHILD T YELLOW SMALL DIP ADULT T YELLOW SMALL DIP CHILD F YELLOW SMALL DIP CHILD F YELLOW LARGE STRETCH ADULT T YELLOW LARGE STRETCH CHILD T YELLOW LARGE DIP ADULT T YELLOW LARGE DIP CHILD F YELLOW LARGE DIP CHILD F PURPLE SMALL STRETCH ADULT T PURPLE SMALL STRETCH CHILD T PURPLE SMALL DIP ADULT T PURPLE SMALL DIP CHILD F PURPLE SMALL DIP CHILD F PURPLE LARGE STRETCH ADULT T PURPLE LARGE STRETCH CHILD T PURPLE LARGE DIP ADULT T PURPLE LARGE DIP CHILD F PURPLE LARGE DIP CHILD F

SOLUTION:

First we calculate the information gain for splitting each attribute at the root:  The starting entropy is 0.971

Gain( Color ) [12+, 8-] = 0.971 – (10/20) 0.971 – (10/20) 0.971 = 0,

since splitting on color produces two sets with [6+, 4-].

Gain( Size ) [12+, 8-] = 0.971 – (10/20) 0.971 – (10/20) 0.971 = 0.

Gain( Act ) [12+, 8-] = 0.971 – (8/20) 0 – (12/20) 0.918 = 0.42

since splitting on act yields a set [8+, 0-] with entropy 0, and a set [4+,8-] with entropy 0.918.

Gain( Age ) [12+, 8-] = = 0.971 – (8/20) 0 – (12/20) 0.918 = 0.42

So we can choose either Age or Act and then recurse on the new set of 12 examples…

The final tree (choosing Act):

[12+, 8-]

Act?

/     \

/         \

[8+,0-]  [4+,8-]

Stretch       Age?

/              /      \

YES     [4+,0-]  [0+,8-]

/                \

YES            NO

3. a) How would you modify ID3 to do a k-step lookahead (making a greedy decision on

how to add nodes one by one to the tree, this algorithm would pick the one that

leads to the best subtree of depth k)?  For example, if the problem has three

attributes each with four values a1...a4, b1...b4, c1...c4, a sample set might look

like:

A    B    C

---------------

a2   b1   c2

a3   b1   c4

a1   b2   c3

...

When deciding which node to split on for a given set of candidates, a "2-step

lookahead" would look at combinations of two attributes such as {A,B} or {A,C}

rather than single attributes.

b) Give the pros and cons of this modified algorithm

with respect to basic ID3 in terms of accuracy and time complexity.

SOLUTION:

There was a huge range in terms of complexity with regards to your answer to this.  I ended up giving full credit to

most people, given that you had put some thought into how this algorithm would be implemented and displayed

an intuition of how accurate/costly it would be.

a) The ID3 with k-step lookahead will consider combinations of size k over the n attributes of the examples.

There will be (n * (n-1))/k of these to consider.  We could imagine a simple extension of ID3 that loops on the

parameter K and sums the total gain of splitting on the k different attributes in each combo.  The algorithm could

then take the max of these and split on one of the attributes in the combo.

b) Why is the k-step lookahead better?  With regular ID3, a greedy decision is made whenever a node is added –

we simply split on the attribute with the highest gain.  ID3 is myopic – it might make sense “locally” to split on a

certain attribute, but this would lead to a more complicated tree in the long run.  By looking ahead k steps, ID3 can

avoid these sorts of “local minima/maxima.”  It might not always be the case that the lookahead will lead to better

decisions.  It might build the same tree at an added cost for computation.  We can say that it will never be LESS

accurate than regular ID3.

In terms of complexity, the k-step lookahead will make the algorithm exponential and intractable for large k

(some of you formulated the factorial bound, but I didn’t end up requiring that for full credit).

4. For each of the learning methods below, identify an inductive bias that is

inherent in the approach.

a) Decision tree learning

b) Bayesian Belief Networks

c) The CANDIDATE-ELIMINATiON where hypotheses take the form of conjunctions of

attributes.

d) Naive Bayes classifiers

SOLUTION:

a) Decision tree learners like ID3 favor shorter trees that place attributes with the highest information

gain near the root.

b) Bayesian Belief Networks assume conditional independence between a subset of nodes (as represented

by the directed graph).

c) It’s assumed that the concept is included in the hypothesis space (conjunctions of attributes).

d) We assume that the attribute values are independent given the classifications.

5. Consider the following task : You are asked to build a model for predicting

whether or not a patient has the fatal PROCRASS-10.8 virus, which appears to be

highly deadly and potentially very contagious.  The virus has just been spotted

in a remote African village, and you've been sent in to investigate it.  Due to

the recency of the breakout, only 13 cases have been observed. Each of these

samples is a vector of continuous-valued results of six diagnostic tests.  What

sort of learner would you use to address this problem?  Why?

SOLUTION:

I accepted many solutions, hinging on the strength of the argument you made.  The best answer is a instance-based

learner since there are only a handful of examples and the attribute values are continuous.  It’s true that the small

amount of data is a detractor for instance-based methods, but that is true for all learning models.  Trees, especially,

have a great deal of trouble when the dataset is small, since you’re removing a great deal of information at each

split (and thus decreasing the information available to the next level).  This leads to an early exhaustion of a small

dataset.  Also, the small size of the example set means we won’t have to worry about the possibility of an unwieldy

model with an instance-based learner (the size is small).

Many of you said CANDIDATE-ELIMINATION, which was a big surprise!  If you did give this answer, it was

important that you made some provision for noise in the dataset (certainly a possibility with this problem).

CANDIDATE-ELIMINATION is a nice algorithm for proving things about version/hypothesis spaces, but it’s not

generally something we’d expect to apply in the real world…

6. Using the figure below, sketch out the Voronoi diagram induced by the Euclidean

distance between query instances and the training examples.  Shade in the positive

regions of the graph (areas that will be classified as "+" given these training

examples) that are generated by the 1-Nearest Neighbor algorithm.

SOLUTION: (scanned image)

Several people forgot to draw the boundaries between the different negative examples, and just shaded in

the positive region.  The Voronoi diagram still needs to show the neighborhoods for each training example.

7. Consider the following Bayesian network, in which the variables A, B, and C are

boolean:

A

/ \

v   v

B     C

Suppose you want to learn the parameters for this network using the training set

{<1,0,1>, <0,0,1> <0,?,1> <1,1,1>}, where examples are in the form <A,B,C> and "?"

indicates a missing value.  Show the sequence of filled-in values and parameters

produced by the EM algorithm, assuming the parameters are initialized ignoring the

missing values.

SOLUTION:

I accepted answers that derived fractional terms for the missing values as well as those that used

“harsh EM” to fill them in with either 0 or 1.

Initialize the probabilities in the net:

P(A) = 0.5

P(B|A) = 0.5

P(B|~A) = 0

P(C|A) = 1

P(C|~A) = 1

Use EM to calculate P(?=1) or (B|~A)

E STEP # 1:

P(B) = P(A)P(B|A) + P(~A)P(B|~A)

= 0.5 * (0.5) + 0 (0.5)

= 0.25

P(B|~A) = 0.25 + 0 / 2 = 0.125

Fill in the missing value:

< 0,0,1>  {Harsh EM}

<0,0.125, 1>

M STEP #1: (only P(B), P(B|~A) change, to {0.25, 0.125} )

E STEP #2 :

P(B) = P(A)P(B|A) + P(~A)P(B|~A)

=  0.5 * 0.5 + 0.5 * 0.125

=  0.3125

P(B|~A) = 0 + 0.3125 / 2 = 0.1563

Fill in the missing value:

<0,0,1> {Harsh EM, so if you did this you could have stopped here}

<0,0.1563, 1> {obviously, this method will give a finer grained answer}

M STEP #2: (only P(B), P(B|~A) change, to, to {0.3125,0.1563} )

E STEP #3:

P(B) = P(A)P(B|A) + P(B|~A)P(~A)

= 0.5 * 0.5 + 0.1563 * 0.5

= 0.3281

P(B|~A) = 0.3281 + 0 / 2 = 0.1641

Fill in the missing value:

<0, 0.3281, 1>

M STEP #3:  (only P(B), P(B|~A) change, to {0.3281, 0.1641}

E STEP #4:

P(B) = P(A)P(B|A) + P(B|~A)P(~A)

= 0.5 * 0.5 + 0.1641 * 0.5

= 0.3320

P(B|~A) = 0.3320 + 0 / 2 = 0.1660

Fill in the missing value:

<0, 0.1660, 1>

the algorithm converges to 0.1666 for P(B|~A), 0.333 for P(B) if you use fractional values

for the missing parameters.

8.  Give an example of a learning problem that would be appropriate for the Naive

Bayes classifier.  Give a specific example (i.e., avoid something brief and general

such as "text classification").

SOLUTION:

I accepted all solutions that showed an understanding of Naïve Bayes and the sort of assumptions that

it makes about a problem.  It was nice to see some responses that veered away from the text classification

“sphere” of learning domains.