{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Spark Word Count Colab.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "kPt5q27L5557" }, "source": [ "# CSE 481DS - Spark Colab\n", "## Wordcount in Spark\n", "\n", "Adapted From Stanford CS246" ] }, { "cell_type": "markdown", "metadata": { "id": "p0-YhEpP_Ds-" }, "source": [ "### Setup" ] }, { "cell_type": "markdown", "metadata": { "id": "Zsj5WYpR9QId" }, "source": [ "Let's setup Spark on your Colab environment. Run the cell below!" ] }, { "cell_type": "code", "metadata": { "id": "k-qHai2252mI" }, "source": [ "!pip install pyspark\n", "!pip install -U -q PyDrive\n", "!apt install openjdk-8-jdk-headless -qq\n", "import os\n", "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "-CJ71AKe91eh" }, "source": [ "Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.\n", "\n", "**Make sure to follow the interactive instructions.**" ] }, { "cell_type": "code", "metadata": { "id": "5K93ABEy9Zlo" }, "source": [ "from pydrive.auth import GoogleAuth\n", "from pydrive.drive import GoogleDrive\n", "from google.colab import auth\n", "from oauth2client.client import GoogleCredentials\n", "\n", "# Authenticate and create the PyDrive client\n", "auth.authenticate_user()\n", "gauth = GoogleAuth()\n", "gauth.credentials = GoogleCredentials.get_application_default()\n", "drive = GoogleDrive(gauth)" ], "execution_count": 2, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "0orRvrc1-545" }, "source": [ "id='158kRwe6PPT4kub20hR_FlV8qpZsdW_rm' # this file is in the course google drive with public visibility \n", "downloaded = drive.CreateFile({'id': id})\n", "downloaded.GetContentFile('pg100.txt')" ], "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "qwtlO4_m_LbQ" }, "source": [ "If you executed the cells above, you should be able to see the file *pg100.txt* under the \"Files\" tab on the left panel." ] }, { "cell_type": "markdown", "metadata": { "id": "CRaF2A_j_nC7" }, "source": [ "### Your task" ] }, { "cell_type": "markdown", "metadata": { "id": "ebLNUxP0_8x3" }, "source": [ "If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.\n", "\n", "Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.\n", "\n", "For this task we ask you to the [**RDD MapReduce API**](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD) from spark (map, reduceByKey, flatMap, etc.) instead of **DataFrame API**. " ] }, { "cell_type": "code", "metadata": { "id": "xu-e7Ph2_ruG" }, "source": [ "from pyspark.sql import *\n", "from pyspark.sql.functions import *\n", "from pyspark import SparkContext\n", "import pandas as pd\n", "\n", "# create the Spark Session\n", "spark = SparkSession.builder.getOrCreate()\n", "\n", "# create the Spark Context\n", "sc = spark.sparkContext" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "AuAxGFPFB43Y" }, "source": [ "# YOUR CODE HERE" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "SIrXJyVNP2AI" }, "source": [ "Once you obtained the desired results, **head over to Canvas and submit your solution for this Colab**!\n", "\n", "Please paste your code and your results (word counts for each of the 26 letters) in some human-readable format (PDF or text file) and submit this to canvas." ] }, { "cell_type": "code", "metadata": { "id": "w1pDavlAEDE0" }, "source": [ "" ], "execution_count": null, "outputs": [] } ] }