Homework #3

Your success at Scotland yard has spread rapidly. Now a team of cancer specialists would like to enlist your aid in deciding whether to operate on a number of their patients. All of their patients have tumors but the specialists are unsure whether the tumors are malignant or benign - if a tumor is malignant, the team must operate but if it is benign there is no need for a risky operation. The specialists have kept detailed records of past cases of patients with tumors and whether these tumors turned out to be malignant. They want you to use this past information to make predictions about the possible malignancy of the tumors of the new patients. Much is riding on these predictions since your great Aunt Myrtle is one of the new patients.

The data the specialists have collected is summarized in the table below. There are three attributes of each tumor that the cancer specialists can determine without an operation: 1) is the tumor large? 2) is the tumor hard? 3) is the tumor symmetrical? These attributes are abbreviated L, H and S respectively and their truth value is given in the table. For the collected data, the specialists also know whether the tumor turned out to be malignant or not. This fact is in the last column labeled with an M.

Data For Past Patients (Training data)

      L  H  S  M
      ----------
1)    T  F  F  T  
2)    F  F  F  T
3)    F  T  T  T
4)    F  F  T  F
5)    T  F  T  F


Data for the New Patients (Test data)

      L  H  S  M
      ----------
6)    F  F  F  ?   
7)    T  T  T  ?
8)    F  F  T  ?

Problem #1:

Your first attempt at this problem is to grow a decision tree. Your splitting criteria is Information Gain. The stopping criteria is zero entropy(split the node unless there is zero entropy).

a) Give the decision tree which is grown on the training data. Please represent this tree in text in the following way: Call the root of the tree Root. For all other nodes, give a string of L's and R's specifying how to reach that node by traversing left or right from the root. For nodes where a question about an attribute is asked, give the abbreviation for that attribute. Assume that if the attribute is true, we go to the left child and if false to the right child. Also give the information gain to two decimal points of asking that question. For the leaf nodes, give the predicted classification.

For example, here's a sample text representation of a tree with 5 nodes. The root's left child is a leaf. The right child is a parent node which has two children which are leaf nodes:

      Root:   H?   .52
      L:      M=True
      R:      L?   .63
      RL:     M=False     
      RR:     M=True

**ANSWER**
      ROOT:  S?  .42
      L:     H?  .92
      LL:    M=True
      LR:    M=False
      R:     M=True

we calculate the info gain in the following way:

H() is the function which returns the entropy of a distribution.

Entropy(Root)=H(3+, 2-) = 0.97
Info Gain(S?) = 0.97 - (0.4* H(2+, 0-) + 0.6*H(1+, 2-) = 0.42
since 40% of data is in right node, and 60% in left node
Info Gain(S?) = 0.97 - (0.4* 0 + 0.6*.918)
Info Gain(S?) = 0.97 - .55 = 0.42

Entropy(Left) = H(1+, 2-) = 0.92
Info Gain(H?) = 0.92 - (0.33 * H(1+, 0) + 0.67 * H(0, 2-))
Info Gain(H?) = 0.92 - (0.33 * 0 + 0.67 * 0)
Info Gain(H?) = 0.92

b) How does your decision tree classify the tumors of the new patients(6-8)?

**ANSWER**
M=True, M=True, M=False

Problem #2

Much to your chagrin, Aunt Myrtle decides to sit in on the cse592 PAC learning lecture and has some ideas about a reformulation of this learning problem. She assumes that the past and current patients are both drawn from the same probability distribution. She also gets the cancer specialists to come up with a set of 10,000 likely hypotheses. The specialists are certain that one hypothesis in this set will predict the malignancy of a tumor perfectly but don't know which one it is. Myrtle is a gambling woman and she would be happy with an algorithm that will find with probability .001 a hypothesis which has probability of error .01. Aunt Myrtle's algorithm is simply to choose randomly among the specialist's hypotheses which are completely consistent with the training data. (Note: the data for this problem contains more attributes than the data in problem #1 so the 10,000 hypothesis are all unique).

Please answer the following questions about this technique:


a) How many training examples (data on past patients) must be found in
order to guarantee Aunt Myrtle's bounds.  Show your work.

**Answer**
eps = .01, delta = .001
n >= 1/epsilon * (ln (|H|) + ln(1/delta)
n >= (1/.01) (ln 10,000 + ln(1/.001))
n >= 1612

or using dan's bound:

n >= ln(delta/|H|) / ln(1-epsilon)
n >= 1604


b) Can you give an upper bound on the probability that Myrtle's
algorithm will misclassify a new tumor chosen according to the
underlying probability distribution?  If so, give the bound.  If not
explain.

**ANSWER**
P(hyp chosen by alg is correct on Patient)
>= P(alg picks eps-good hyp AND hyp is correct on Patient)
=  P(alg picks eps-good hyp) * P(hyp is correct on Patient | hyp is eps-good)
>= (1 - delta)*(1 - epsilon)
=  .989
 
c) Assume the cancer specialists are wrong and their set of hypotheses
does not contain a perfect hypothesis.  If Myrtle's algorithm finds a
hypothesis consistent with the amount of training data given in answer
a), will the required epsilon and delta bounds still hold?  Explain.

**ANSWER**
yes.  the probability of a consistent hyp not being epsilon-good is still
delta.  However we can no longer satisfy arbitrary epsilon and delta
requests since for small epsilon there may be no hyp. in H that is
epsilon-good.

Problem #3

Aunt Myrtle is also interested in VC dimensions. Please humor a woman in dire straits and answer the following questions for her. a) Aunt Myrtle believes that the VC dimension of the hypotheses set in problem #2 is <= log 10,000. Can you show this by proving that for a set of binary classifying hypotheses H, VC(H) <= log |H|? If so, do it. If not, give a counter example.

**ANSWER**

Let d = log |H|. assume VC(H) is > d. Then d+1 instances can be shattered by the set H. But there are 2^(d+1) possible configurations for the d+1 instances. So we need at least 2^(d+1) to shatter the instances and there are only 2^d hypotheses in H. This is a contradiction.

b) Aunt Myrtle is also curious about the VC dimension of the hypothesis set in Problem #1 where each instance has only 3 attributes and the set of hypotheses is all possible decision trees. Give the VC dimension of this set of all decision trees which predict malignancy given the three boolean attributes.

**ANSWER**

There are only 2^3 distinct instances so VC(H) <= 8. Consider the set of full trees of height 4, where nodes at levels 1,2 and 3 all ask about L, H and S respectively and where the 8 leaf nodes have all possible combinations of malignancy predictions. This set of decision trees shatters the set of 8 unique instances. Hence VC(H) = 8.

Jared Saia