CSE P 544 Homework 1
To understand and be able to write simple SQL queries.
6.1 - 6.4
Number of points:
Monday, April 12th, 2004 5:45pm
Tools for the assignment:
For this assignment, you can use your SQL Server account on IISQLSRV.cs.washington.edu. Login using SQL Server authentication, and your UW CSE login name. Your password is your UW student id. Once you are logged into SQL Server, use Query -> Change Database to view the databases on the server. The database to be used for this assignment is called SPAM - please do not alter it in any way, you may only execute queries. Click here for additional instructions, including how to change your SQL Server password (which you should do promptly).
Prepare a file named queries.txt containing your sql queries along with the output they produced when you ran them. Also include in this file any explanation as to why you may have done the query in a peculiar manner. Make sure your name appears at the top of the file. Once you are ready click here to turn in the file. The turn-in server will be closed 15 minutes before the beginning of lecture, Monday April 12th. Alternatively, you can hand in a printout at the beginning of class if you prefer.
The database you will be working with holds information about many thousands of actual spam emails. The database has nine tables. The following are the tables you should use:
statCountries(domainName, countryName, msgCount)
unicodeBody(id, body, uniqueid, domain, langid)
unicodeHeaderV2(id, fromName, fromAddr, ..., subject, domainName)
- dictionaries is a list of words that have been used within the
bodies of the spam messages. Each word has a unique id.
- domains is a list of Internet domains. Domain names are used on
the Internet to specify what sector or region a particular page or email address is from or represents. For example '.edu' represents an educational institution and '.it' represents the country of Italy. This table is a mapping between domains and the name of the country in which the domain is primarily used.
- languages is a list of languages that have been used in our spams.
Each language has a unique id.
- statCountries contains a count of the number of spams that have
been sent from each domain.
- unicodeBody contains the body of every spam in this database.
Each tuple contains an id that is a foreign key to the body's header, the body itself, a uniqueid for the body, the domain of the email address from which the spam was sent, and the id of the language that the spam was written in.
- unicodeHeader contains quite a bit of information, most of which you will not use. Some of the attributes you will use are: a unique id for each individual spam message, the name of the person that sent the message, the email address of the person that sent the message (the spammer), the subject of the spam and the domain of the email address from which the spam was sent. The way to determine which body goes with which header, again, is to use the id attribute.
As you do these problems, please remember to save your SQL queries somewhere as you definitely be handing the queries themselves in. If we do not specify what to call a particular column, you may choose whatever label you which. If you have questions or concerns about what we are asking you to do, please email us or come to our office hours.
- [6 points] List how many spams were sent from the Cayman Islands.
- [6 points] List how many spams were sent from countries whose names begin with 'D'.
- [6 points] List the top 10 domains (and the corresponding countries) from which spams were sent. List the domains/countries in decreasing order of number of spams sent.
- [6 points] List the top 10 spammers in this database. List them in decreasing order of the number of spams they sent. The table returned should have two columns: one named FromAddr and the other named NumSpams.
- [11 points] List the set of all countries from which spam was sent with both the words 'free' and 'game' in the subject.
- [11 points] List the number of spams sent in each (non-null) language. The table you return should have two columns: one named Language and the other named NumSpams.
- [11 points] List the number of spams sent in each (non-null) language by the spammer with the from address of 'Mail'.
- [16 points] List the set of all domains from which the database has no spams. You may not use the statCountries table to answer this question.
- [16 points] List the set of spammers (their from address only) that sent spams in both English and French.
- [11 points] Write your own unique query. Make it at least as interesting/difficult as problems 5, 6, and 7.
Sample Query Output