What datasets can we use for the course project?
You can use any dataset of your choice as long as it enables an interesting and non-trivial project as described here. You can consider the list of publicly available datasets below (originally based on Stanford CS224W).
- You can download all Reddit comments from 2009 through 2018 through Google BigQuery or pushshift.io
- Covers diverse spectrum of topics including politics (e.g. /r/politics and 100’s more), mental health (e.g. /r/SuicideWatch), and altruism (e.g. /r/randomactsofpizza).
- Various datasets and snetworks can be constructed from this data.
- Very rich metadata (comment text, upvote/downvote scores, time); great dataset for projects that combine network analysis with natural language processing.
- The entire dataset is massive (~1Tb), but you can download all 2014 comments from the /r/politics subreddit here (0.5Gb uncompressed).
Sage BioNetworks
- Open mobile health datasets
- Examples: mPower mobile Parkinson Disease study (tapping, voice, walking, ...), and Asthma Mobile Health study (demographics, zip code, surveys, medical history, ...)
- Application for data takes ~30 min.
1 Million Jupyter Notebooks
- 1M notebooks from 200k repositories
- For instance, this data can be used to study the data science process
- Description and Download here
gab Social Network
- Gab is a social media website, known for its mainly far-right user base
- You can download the dataset here.
- You can find a list of public Twitter datasets here.
- Example datasets can be found here.
Kiva Microlending
DonorsChoose Education Crowdfunding
- Example datasets can be found here.
Kaggle Competition Datasets
- Example datasets can be found here.
Food Webs
- Food web data selected from the Ecosystem Network Analysis site and from ATLSS - Network Analysis of Trophic Dynamics in South Florida Ecosystems. (Metadata)
Wolfe Primates Interaction
- Dataset represent 3 months of interactions among a troop of monkeys.
- Vertex attributes: (1) ID number of the animal; (2) age in years; (3) sex; (4) rank in the troop.
Trade Networks
- Import and export data of goods between countries by the Food and Agriculture Organization of the UN.
Stack Exchange
Microfinance
- Financial datasets by Stanford Economics professor Matthew Jackson
Interpersonal expertise overlap within a company
- Interpersonal expertise dataset
- Within a company, employees were asked to respond to this question: For each person in the list below, please show how strongly you agree or disagree with the following statement: In general, this person has expertise in areas that are important in the kind of work I do.”
- Data types: Origin node, destination node, weight of connection (1-5)
Moviegalaxies
- Social networks of 200 movies where each network represents how characters interact in one movie
Bitcoin
- Dataset of bitcoin transactions.
- More information on bitcoin related topics below
Neural Network of a Caenorhabditis elegans worm
- Dataset
- Format of Data: Origin node (Neuron), destination node (Neuron), weight of link
Airports in the United States
- Dataset
- Description: Flights between US airports in 2002 (undirected), weighted by how many available seats where on flights between two airports over the course of the year.
- Type of Data: Airport 1, Airport 2, number of seats across the entire year that were available
- Additional flight data can be found here.
Author Citation Networks
- DBPL and ACM Citations Network dataset
- Microsoft Academic Graph
- Description: A set of roughly 630,000 papers, and their respective authors
- Type of Data:(would require some text processing to extract) Name of paper, index of paper, authors
.uk Domain Network
Python Dependency for PyPi
- Dataset
- Description: The libraries which depend on other libraries in the package PyPi
- Format: name of dependency, version extracted, json string of other dependencies
Stanford Large Network Dataset Collection
Coauthorship and Citation Networks
- DBLP: Collaboration network of computer scientists
- KDD Cup Dataset
Internet Topology
- AS Graphs: AS-level connectivities inferred from Oregon route-views, Looking glass data and Routing registry data
Stack Overflow
Yelp Data
- Yelp Review Data: reviews of the 250 closest businesses for 30 universities for students and academics to explore and research
Youtube dataset
- Youtube data: YouTube videos as nodes. Edge a->b means video b is in the related video list (first 20 only) of a video a.
Amazon product copurchasing networks and metadata
- Amazon Data: The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes).
Wikipedia
- Wikipedia page to page link data: A list of all page-to-page links in Wikipedia
- DBPedia: The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia.
- Edits and talks: Complete edit history (all revisions, all pages) of Wikipedia since its inception till January 2008.
Movie Ratings
- IMDB database: Movie ratings from IMDB
- User rating data: Movie ratings from MovieLens
Who trusts whom data at Trustlet
- Trust network datasets: Includes trust/distrust edges and Epinions product reviews/review ratings
Mark Newman's pointers
- Network data: More than 20 network datasets
Reality Commons data
- Mobile data: Several mobile data sets that contain the dynamics of several communities of about 100 people each.
Google Local Dataset
- The dataset contains ratings and reviews of local businesses obtained from Google, courtesy Julian McAuley.
Bitcoin
- Bitcoin is a digital currency invented in 2008 and operates on a peer-to-peer system for transaction validation. This decentralized currency is an attempt to mimic physical currencies in that there is limited supply of Bitcoins in the world, each Bitcoin must be "mined", and each transaction can be verified for authenticity. Bitcoins are used to exchange every day goods and services, but it also has known ties to black markets, illicit drugs, and illegal gambling transactions. The dataset is also very inclined towards anonymization of behavior, though true anonymization is rarely achieved.
- The Bitcoin dataset captures transaction-level information. For each transaction, there can be multiple senders and multiple receivers as detailed here. This dataset provides a challenge in that multiple addresses are usually associated with a single entity or person. However, some initial work has been done to associated keys with a single user by looking at transactions that are associated with each other (for example, if a transaction has multiple public keys as input on a single transaction, then a single user owns both private keys). The dataset provided provides these known associations by grouping these addresses together under a single UserId (which then maps to a set of all associated addresses).
- Key Challenge Questions:
- Can we detect bulk Bitcoin thefts by hackers? Can we track where the money went after thefts?
- Can we detect illicit transactions based on Bitcoin transaction behavior? What sort of graph patterns emerge?
- Can we detect attempts at money laundering (called a "mixing service" in Bitcoin)
- Can we detect money laundering attempts and the people who use them? Note: Current Bitcoin mixing services tend to mix Bitcoins amongst all the people who bother to use a mixing service so does the mixing service actually obfuscate anything?
- Can we trace back the originator of these laundering attempts?
- Can we detect currency manipulation (hackers try to destabilize Bitcoin currency exchanges to deflate prices)
- Is Bitcoin gaining traction or losing traction among the regular population for use as a regular digital currency?
- It is Bitcoin best practice to generate and use a new address with every transaction. Is this practice followed? If not, then what can we learn from this?
- Can we identify and extract organizational behavior amidst the Bitcoin transactions?
- Can we determine which Bitcoin addresses belong to a single entity? While the initial pass over the data have yielded some resolution of entities, can we further improve this mapping?
MOOC Forums Dataset
- All data from Stanford's courses on Coursera and NovoEd is available. For Coursera format details see this page. For an explanation of data available from Stanford courses offered on our OpenEdX platform, see Datastage. To request any of the data, fill in this form. For more details, please contact Jure.
- A number of (relatively) new OpenEdX data are now available on datastage.stanford.edu. These include both data that the OpenEdX platform collects, and tables that result from computations over that base data. In addition, processes are now in place to keep the data current on a daily to weekly basis (Coursera and NovoEd data is integrated at the end of each course)
- In summary, the additions are:
- ActivityGrade: Assignment grades Includes right/wrong for each problem part, the learners' solution choice for each answer, and the first and last solution submission times.
- Cumulative assignment performance per learner
- 'Raw' final grades, updated at the end of courses.
- Demographic information: country, gender,year_of_birth, and level_of_education. This information is not fully populated because its provision is optional
- A much slimmed view of the OpenEdX tracking log events. The view only includes fields that are currently in use by the platform.
- An anonymized record of the forum from each class.
- The country of origin of each class participant (by IP address).