CSE 546: Machine Learning (Winter 2010)

Assignment 3: Bayesian Learning and Neural Networks

Due Date: Thu, Feb 18, 2010 copy of the report in class and submit code online (the 7th week)

  1. Bayesian learning

    • 1.1. [Based on a real experience] UltraClear 2009W is a best selling LCD monitor model from the company Kell. It uses the top-quality IPS LCD panels -- the same technology used in the iPad. Due to the bad economy, in an act to reduce the cost, Kell decides to switch most of its UltraClear 2009W monitors over to cheaper but lower-quality PVA panels. As a result, only 10% of UltraClear 2009W monitors now use IPS panels. Consumers are definitely not happy. They want the IPS panels. However, the difference between the two types of panels are too subtle for the eyes to tell. Fortunately, there are some other clues. We now know that 80% of the IPS version are made in Mexico and only 10% of the PVA version are made in Mexico. So, if you see in store a UltraClear 2009W made in Mexico, what is the probability that it uses a IPS panel?

      Another clue is the product serial number. We know that 99% of the IPS version have serial numbers ending with letter "L" and only 1% of the PVA version have such serial numbers. Suppose the same monitor that you see in the store (made in Mexico) has a serial number "IERY2009123457L", what is the probability that it uses a IPS panel? Assume that the serial number and the country of origin are independent of each other given the panel type.

    • 1.2. Consider the following Bayesian network:

      • Is A independent of B?
      • Is D independent of E?
      • Is D independent of A given C?
      • Is A independent of B given E?
      • Is D independent of E given C?

      Justify your answers.

    • 1.3. Consider a Bayesian network with four Boolean nodes A, B, C, and D, where A is the parent of B, and B is the parent of C and D. Suppose you have a training set composed of the following examples in the form (A,B,C,D), with "?" indicating a missing value: (0,1,1,1), (1,1,0,0), (1,0,0,0), (1,0,1,1), (1,?,0,1). Show the first iteration of EM algorithm (initial parameters, E-step, M-step), assuming the parameters are initialized ignoring missing values.

  2. Neural networks

    • 2.1. Mitchell problem 4.8.

    • 2.2. Mitchell problem 4.10.

  3. Spam filtering.

    • The dataset we will be using is a subset of 2005 TREC Public Spam Corpus. You can download it here. It contains a training set and a test set. Both files use the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham (non-spam), and the rest are words and their occurence numbers in this email. In preprocessing, I removed non-word characters, and conducted feature selection similar to what Mehran Sahami did in his original paper using Naive Bayes to classify spams, though with larger cut-offs since our corpus is larger.

    • Implement both the multinomial naive Bayes algorithm and the perceptron algorithm to classify spam. Use your algorithm to learn from the training set and report accuracy on the test set. You may program in C, C++, Java or Matlab. If you'd like to use another language, ask TA first.

    • Try various smoothing parameters for the Naive Bayes learner and different learning rates for perceptron. Which parameters work best? How many iterations does it take for perceptron learning to converge? Which algorithm performs better? Which one learns faster?

    • (Extra credit 10%) Our feature selection makes learning much easier, but it also throws out useful information. For example, exclamation mark (!) often occurs in spams. Even the format of email sender matters: in the case when an email address appears in the address book, a typical email client will replace it with the contact name, which means that the email is unlikely to be a spam (unless, of course, you are a friend of the spammer :-). Sahami's paper talked about a few such features he had used in his classifier. For extra credit, you can play with the original files and come up with useful features that improve your classifier. Here are the lists of the files used in train/test.

  4. Turn in the following:

    • Your code for Problem 3. Submit your files here. The due date is 12 noon, Feb 18, 2010. Your code should contain appropriate comments to facilitate understanding.

    • A report with your answers to the homework problems. For problem 3, as usual, you should describe in high-level how your code works, what is the accuracy of your results; if your accuracy is low, analyze the errors and explain what you think might be the cause and what changes can be made for improvement.

Good luck, and have fun!