Take Home Final for CSE 454, Autumn 2002

Note: this exam is open book and open Web, but please do not discuss any question with any mammal besides Tal or Dan.

Due by 10:30am sharp, Wed 12/18 (bring a hardcopy to class at the start of the project presentations). Typed answers requested (if you want to use pencil and paper for math or figures, that’s ok, but please be neat!) We award partial credit when appropriate, but this is only possible if we can read/understand what you did and why.

1) You are building a datamining system to analyze weblogs and you’ve decided to use a decision tree learning system. You write a perl script to process the log files and cull out the key attributes, A_i, e.g., features like whether they have followed a specific link and whether they have registered by logging in etc. Now comes the time to learn a tree that predicts whether they will make a purchase. Consider the following set of examples (+ indicates an example where the customer made a purchase):

Instance	Classification	A₁	A₂	A₃
1	-	F	F	T
2	+	T	F	T
3	-	T	T	F
4	-	T	T	F
5	-	T	T	F
6	+	F	F	F
7	+	T	T	T
8	+	F	T	F
9	-	F	F	T
10	+	F	T	F

A) [2 pts] What is the entropy of this collection of training examples with respect to the target function classification?

B) [3 pts] What is the information gain of A₁, A₂andA₃ relative to the training examples?

C) [5 pts] Draw the decision tree that would be found using greedy (hill climbing) search (maximizing information gain), which continues expanding the tree until no choice provides information gain. If the metric doesn’t discriminate between attributes, break ties by choosing the lower numbered attribute.

D) [1 pt] What is the predicted class for <F, T, T>?

2) [4 pts] Develop a naïve Bayes classifier using Laplace smoothing with an m-estimate of m=1 for the previous dataset. What class does it predict for <F, T, T>? Show your work.

3) [2 pts] Does SSL provide effective user authentication? How? If not, why not and what needs to be added?

4) [2 pts] What is the difference between a hash function and a cryptographic hash? What’s an example of the latter?

5) [4 pts] What is the vector space angle between each pair of the following sentences? Show all intermediate work.

A) Alaska has amazing flies.

B) Sam flies Alaska airlines.

C) Tent flies eat Tsetse flies.

6) One of the problems with Google is that it is not always easy to find specific information quickly since only snippets and URLs are returned from queries. Suppose that you were interested in using the Web as a data source to create a table that contained people’s names, emails, and phone numbers that you could query directly. Please answer the following questions as clearly as possible (2 pts each). Think carefully before you start writing because there is a limit of 60 words per question (no exceptions).

A. How might you extract specific information such as phone numbers and email addresses from the web pages?

B. How might you connect a name to other information such as a phone number?

C. In some cases information might be spread out between multiple linked pages; e.g. a person’s name might be on one page, and that person’s contact info might be on a linked page. Can you think of any general (or heuristic) way to deal with this?

D. How might you handle missing, duplicate, ambiguous, or dirty information? What if one page listed Dan Weld’s email as one thing and another page listed Daniel S. Weld’s phone number as another?

E. If you plan to solve any of these problems using supervised learning, where would you get the training data? Are there any techniques that you could use to reduce the effort a user would have to spend manually training such a system?

7) Consider the following T-D matrix defining 6 documents defined in terms of 4 keywords.

	D1	D2	D3	D4	D5	D6
Bush	5	15	7	9	7	0
Kalahari	5	7	1	0	1	0
Iraq	1	0	7	4	6	0
Saddam	0	1	6	4	0	4

We decide to reduce the noise and dimensionality of this data through SVD analysis. The SVD of this T-D matrix, according to MATLAB is: USV^twhere U,S, V^t are given by:

0.8817 0.1969 -0.0444 -0.4264

0.2887 0.4928 0.1190 0.8122

0.3033 -0.6652 -0.5674 0.3790

0.2173 -0.5253 0.8136 0.1222

23.33 0 0 0 0 0

0 9.76 0 0 0 0

0 0 5.03 0 0 0

0 0 0 3.27 0 0

0.2638 0.6627 0.4237 0.4293 0.3549 0.0373

0.2850 0.6018 -0.6079 -0.3061 -0.2171 -0.2151

-0.0385 0.1948 0.1425 0.1162 -0.7138 0.6460

0.7038 -0.1795 0.3700 -0.5590 0.0308 0.1491

0.5557 -0.3294 -0.1526 0.6077 -0.3198 -0.2965

-0.2090 0.1411 0.5201 -0.1635 -0.4629 -0.6519

Ignore the fact that S isn’t square unless you want to think about degenerate eigenvectors; focus on the diagonal in any case. Suppose we decided to just keep the top two most important dimensions after the LSI analysis. The loss is:

(1-) = 0.05331 or 5.3%

A) [2pt] Draw a bounding box around the parts of U,S,V matrices above that will be retained after this decision. [You can answer this question by directly marking the matrices above.]

B) [2 pt] Suppose the two most important dimensions after LSI are called F1 and F2 respectively. Plot the six documents as pointed in the vector space (use the plot on the next page). (It is okay if you put the point in the rough place they will come; no need to spoil your eyesight counting all the small grid lines).

C) [2pt] Comment (40 word maximum) on the way the documents appear in the plot – is their placement related in any rational way to their similarity you would intuitively attach to them?

D) [2 pt] What is the vector space (cos) similarity between D5 and D6 in the original T-D matrix?

E) [2 pt] What is the vector space similarity between D5 and D6 after the LSI transformation to the most significant 2 dimensions?

F) [1 pt] Is the change intuitively justified, or is LSI inadvertently collapsing distinct concepts, or is this a case where the 5% noise is outweighing the signal?