Take
Home Final for CSE 454, Autumn 2002
Note: this exam is open book and open
Web, but please do not discuss any
question with any mammal besides Tal or Dan.
Due by 10:30am sharp, Wed 12/18 (bring
a hardcopy to class at the start of the project presentations). Typed answers
requested (if you want to use pencil and paper for math or figures, that’s ok,
but please be neat!) We award partial credit when appropriate, but this is only
possible if we can read/understand what you did and why.
1)
You are building a datamining system to analyze weblogs and you’ve
decided to use a decision tree learning system. You write a perl script to
process the log files and cull out the key attributes, Ai, e.g.,
features like whether they have followed a specific link and whether they have
registered by logging in etc. Now comes the time to learn a tree that predicts
whether they will make a purchase. Consider the following set of examples (+
indicates an example where the customer made a purchase):
Instance |
Classification |
A1 |
A2 |
A3 |
1 |
- |
F |
F |
T |
2 |
+ |
T |
F |
T |
3 |
- |
T |
T |
F |
4 |
- |
T |
T |
F |
5 |
- |
T |
T |
F |
6 |
+ |
F |
F |
F |
7 |
+ |
T |
T |
T |
8 |
+ |
F |
T |
F |
9 |
- |
F |
F |
T |
10 |
+ |
F |
T |
F |
A) [2 pts] What is the entropy of this collection of training examples with respect to the target function classification?
B) [3 pts] What is the information gain of A1, A2 and A3 relative to the training examples?
C) [5 pts] Draw the decision tree that would be found using greedy (hill climbing) search (maximizing information gain), which continues expanding the tree until no choice provides information gain. If the metric doesn’t discriminate between attributes, break ties by choosing the lower numbered attribute.
D) [1 pt] What is the predicted class for <F, T, T>?
2) [4 pts] Develop a naïve Bayes
classifier using Laplace smoothing with an m-estimate of m=1 for the previous
dataset. What class does it predict for <F, T, T>? Show your work.
3) [2 pts] Does SSL provide effective user authentication? How?
If not, why not and what needs to be added?
4) [2 pts] What is the difference
between a hash function and a cryptographic hash? What’s an example of the
latter?
5) [4 pts] What is the vector space angle between each pair of the following sentences? Show all intermediate work.
A) Alaska has amazing flies.
B) Sam flies Alaska airlines.
C) Tent flies eat Tsetse flies.
6) One of the problems with Google is
that it is not always easy to find specific information quickly since only
snippets and URLs are returned from queries.
Suppose that you were interested in using the Web as a data source to
create a table that contained people’s names, emails, and phone numbers that
you could query directly. Please answer
the following questions as clearly as possible (2 pts each). Think carefully
before you start writing because there is a limit of 60 words per question (no exceptions).
A. How might you extract specific information such as phone
numbers and email addresses from the web pages?
B. How might you connect a name to other information such as a
phone number?
C. In some cases information might be spread out between
multiple linked pages; e.g. a person’s name might be on one page, and that
person’s contact info might be on a linked page. Can you think of any general
(or heuristic) way to deal with this?
D. How might you handle missing, duplicate, ambiguous, or
dirty information? What if one page listed Dan Weld’s email as one thing and
another page listed Daniel S. Weld’s phone number as another?
E. If you plan to solve any of these problems using supervised
learning, where would you get the training data? Are there any techniques that you could use to reduce the effort
a user would have to spend manually training such a system?
7) Consider the following T-D matrix
defining 6 documents defined in terms of 4 keywords.
|
D1 |
D2 |
D3 |
D4 |
D5 |
D6 |
Bush |
5 |
15 |
7 |
9 |
7 |
0 |
Kalahari |
5 |
7 |
1 |
0 |
1 |
0 |
Iraq |
1 |
0 |
7 |
4 |
6 |
0 |
Saddam |
0 |
1 |
6 |
4 |
0 |
4 |
We decide to reduce the noise and dimensionality of this
data through SVD analysis. The SVD of this T-D matrix, according to MATLAB is:
USVt where U,S, Vt are given by:
0.8817 0.1969 -0.0444 -0.4264
0.2887 0.4928 0.1190 0.8122
0.3033 -0.6652 -0.5674 0.3790
0.2173 -0.5253 0.8136 0.1222
23.33 0 0 0 0 0
0 9.76 0 0 0 0
0 0 5.03 0 0 0
0 0 0 3.27 0 0
0.2638 0.6627 0.4237 0.4293 0.3549 0.0373
0.2850 0.6018 -0.6079 -0.3061 -0.2171 -0.2151
-0.0385 0.1948 0.1425 0.1162 -0.7138 0.6460
0.7038 -0.1795 0.3700 -0.5590 0.0308 0.1491
0.5557 -0.3294 -0.1526 0.6077 -0.3198 -0.2965
-0.2090 0.1411 0.5201 -0.1635 -0.4629 -0.6519
Ignore the fact that S isn’t square unless you want to think about degenerate eigenvectors; focus on the diagonal in any case. Suppose we decided to just keep the top two most important dimensions after the LSI analysis. The loss is:
(1-) = 0.05331 or 5.3%
A) [2pt] Draw a bounding box around the parts of U,S,V
matrices above that will be retained after this decision. [You can answer this question by directly
marking the matrices above.]
B) [2 pt] Suppose the two most important dimensions after
LSI are called F1 and F2 respectively.
Plot the six documents as pointed in the vector space (use the plot on
the next page). (It is okay if you put
the point in the rough place they will come; no need to spoil your eyesight
counting all the small grid lines).
C) [2pt] Comment (40 word maximum) on the way the
documents appear in the plot – is their placement related in any rational way
to their similarity you would intuitively attach to them?
D) [2 pt] What is the vector space (cos) similarity between
D5 and D6 in the original T-D matrix?
E) [2 pt] What is the vector space similarity between D5 and
D6 after the LSI transformation to the most significant 2 dimensions?
F) [1 pt] Is the change intuitively justified, or is LSI
inadvertently collapsing distinct concepts, or is this a case where the 5%
noise is outweighing the signal?