{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Spark Word Count Colab.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kPt5q27L5557"
      },
      "source": [
        "# CSE 481DS - Spark Colab\n",
        "## Wordcount in Spark\n",
        "\n",
        "Adapted From Stanford CS246"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p0-YhEpP_Ds-"
      },
      "source": [
        "### Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Zsj5WYpR9QId"
      },
      "source": [
        "Let's setup Spark on your Colab environment.  Run the cell below!"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "k-qHai2252mI"
      },
      "source": [
        "!pip install pyspark\n",
        "!pip install -U -q PyDrive\n",
        "!apt install openjdk-8-jdk-headless -qq\n",
        "import os\n",
        "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\""
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-CJ71AKe91eh"
      },
      "source": [
        "Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.\n",
        "\n",
        "**Make sure to follow the interactive instructions.**"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5K93ABEy9Zlo"
      },
      "source": [
        "from pydrive.auth import GoogleAuth\n",
        "from pydrive.drive import GoogleDrive\n",
        "from google.colab import auth\n",
        "from oauth2client.client import GoogleCredentials\n",
        "\n",
        "# Authenticate and create the PyDrive client\n",
        "auth.authenticate_user()\n",
        "gauth = GoogleAuth()\n",
        "gauth.credentials = GoogleCredentials.get_application_default()\n",
        "drive = GoogleDrive(gauth)"
      ],
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0orRvrc1-545"
      },
      "source": [
        "id='158kRwe6PPT4kub20hR_FlV8qpZsdW_rm' # this file is in the course google drive with public visibility \n",
        "downloaded = drive.CreateFile({'id': id})\n",
        "downloaded.GetContentFile('pg100.txt')"
      ],
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qwtlO4_m_LbQ"
      },
      "source": [
        "If you executed the cells above, you should be able to see the file *pg100.txt* under the \"Files\" tab on the left panel."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CRaF2A_j_nC7"
      },
      "source": [
        "### Your task"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ebLNUxP0_8x3"
      },
      "source": [
        "If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.\n",
        "\n",
        "Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.\n",
        "\n",
        "For this task we ask you to the [**RDD MapReduce API**](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD) from spark (map, reduceByKey, flatMap, etc.) instead of **DataFrame API**. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xu-e7Ph2_ruG"
      },
      "source": [
        "from pyspark.sql import *\n",
        "from pyspark.sql.functions import *\n",
        "from pyspark import SparkContext\n",
        "import pandas as pd\n",
        "\n",
        "# create the Spark Session\n",
        "spark = SparkSession.builder.getOrCreate()\n",
        "\n",
        "# create the Spark Context\n",
        "sc = spark.sparkContext"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AuAxGFPFB43Y"
      },
      "source": [
        "# YOUR CODE HERE"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SIrXJyVNP2AI"
      },
      "source": [
        "Once you obtained the desired results, **head over to Canvas and submit your solution for this Colab**!\n",
        "\n",
        "Please paste your code and your results (word counts for each of the 26 letters) in some human-readable format (PDF or text file) and submit this to canvas."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "w1pDavlAEDE0"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}