{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 6: Data Frames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Objectives\n", "\n", "Last week we started to see how to process these CSV files. We used the list of dictionaries representation to help us answer these questions. While the list of dictionaries format was helpful, it was still a bit tedious to write all that code. This week, we will introduce pandas, a popular library that supports data scientists. A **library** is code someone else wrote and shared with you to help solve problems. By the end of this lesson, students will be able to:\n", "\n", "1. Import values and functions from another module using import and from statements.\n", "2. Select individual columns from a `pandas` `DataFrame` and apply element-wise computations.\n", "3. Filter a `pandas` `DataFrame` or `Series` with a mask.\n", "\n", "The last two learning objectives are particularly ambitious. It will probably take more time, practice, and opportunities to engage with it before you feel fully comfortable. But we're just at the start of learning about data frames." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports\n", "\n", "When we write code in a Python file like `main.py`, we're actually defining a Python **module**. You can treat the words \"module\" and \"file\" interchangeably.\n", "\n", "* A module can be **executed** as a standalone program from the terminal, such as python `main.py`.\n", "* A module can be **imported** so that another module can access its values and function definitions.\n", "\n", "> Remember the main-method pattern? The **main-method pattern** ensures that certain code is only run when executed as a standalone program as opposed to when it's imported from another module. More details at the end of this slide.\n", "\n", "We use importing to use values or functions defined inside one module so they can be used in another module. You have already been using this on your homework! We defined a module `cse163_utils` and in order to use the function `parse` defined in that module, we imported it.\n", "\n", "There are 2 primary ways to import in Python that we will explore in this slide. For the following examples, assume we have defined the module `module_a` as the file `module_a.py`.\n", "\n", "```python\n", "# Contents of: module_a.py\n", "def fun1() -> None:\n", " print(\"Calling a's fun1\")\n", " print(\"Ending a's fun1\")\n", "\n", "\n", "def fun2() -> None:\n", " print(\"Calling a's fun2\")\n", " fun1()\n", " print(\"Ending a's fun2\")\n", "```\n", "\n", "Our goal is to call `fun2` inside another module, `module_b`. To do this, we need to import the module to use its functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `import module_a`\n", "\n", "The simplest syntax simply uses the `import` statement to **import** a module. The following snippet shows the contents of `module_b`, a short program that uses fun2 defined in `module_a`.\n", "\n", "```python\n", "# Contents of: module_b.py\n", "import module_a \n", "\n", "\n", "def fun1() -> None:\n", " print(\"Calling b's fun1\")\n", " print(\"Ending b's fun1\")\n", "\n", "\n", "def fun2() -> None:\n", " print(\"Calling b's fun2\")\n", " print(\"Ending b's fun2\")\n", " \n", " \n", "def main():\n", " fun2()\n", " module_a.fun2()\n", " \n", "\n", "if __name__ == '__main__':\n", " main()\n", "```\n", "\n", "When you import a module, you are importing the whole thing, including all of the code and values defined within it. To keep things organized, Python puts all these values defined in `module_a` in a different **namespace**. All of the code in `module_a` resides in its own namespace while all of the code in `module_b` resides in another namespace. Notice how we call `module_a.fun2()` in order to access `fun2()` in `module_a` rather than the `fun2()` that we defined right above the `main` method.\n", "\n", "**Food for thought:** What is the output of running `python module_b.py`?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `import module_a as m`\n", "\n", "A minor alternative to the above is to import a module and also define a shorthand name for it. In this case, we assigned `module_a` the shorthand name `m`, so we can just say `m.fun2()`. The output and behavior of the program are exactly the same.\n", "\n", "```python\n", "# Contents of: module_b.py\n", "import module_a as m\n", "\n", "\n", "def fun1() -> None:\n", " print(\"Calling b's fun1\")\n", " print(\"Ending b's fun1\")\n", "\n", "\n", "def fun2() -> None:\n", " print(\"Calling b's fun2\")\n", " print(\"Ending b's fun2\")\n", " \n", " \n", "def main():\n", " fun2()\n", " m.fun2() # Notice m instead of module_a\n", " \n", "\n", "if __name__ == '__main__':\n", " main()\n", "```\n", "\n", "**Food for thought:** When might it be preferable to use this syntax instead of importing the module under its given name?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `from module_a import fun2`\n", "\n", "Sometimes, we only want to use a few functions from another module. Python provides another syntax that lets you import specific functions: `from module_a import fun2` only imports the function `fun2` from `module_a`. When this syntax is used, Python adds `fun2` directly to your current namespace. This means you don't need to call it with `module_a.fun2()`, you can just say `fun2()`.\n", "\n", "```python\n", "# Contents of: module_b.py\n", "from module_a import fun2\n", " \n", " \n", "def main():\n", " fun2() # Calling module_a's fun2\n", "\n", "\n", "if __name__ == '__main__':\n", " main()\n", "```\n", "\n", "What would happen if we defined a `fun2` in `module_b.py`? How would Python know which `fun2` to call if there was one that came from `module_a` and one from `module_b` when both share the same namespace?\n", "\n", "Python treats these import and function definitions exactly like assignment statements, so the value of a name always reflects its **most recent assignment**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why the main-method pattern?\n", "\n", "We use the main-method pattern so that others can import from our code, without having to worry about running all of our analysis.\n", "\n", "Imagine your writing a program to do some sort of analysis and it takes about 2 hours to run the analysis from start to end. You realize that you had some helper functions in this program that would be super useful for another project you're working on, so decide to import those functions (using one of the ways shown above) so you could use them from another module. But if you didn't use the main-method pattern part of the import process would be actually running your 2 hour analysis! That means you would have to wait 2 hours every time you run your new project just to import a function!!! 😱\n", "\n", "By using the main-method pattern we make our modules runnable when we want them to run (by using the python command or pressing the Run button) but not runnable when we want to just import a function from them!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas Context\n", "\n", "`pandas`, not 🐼s! `pandas` is a library that we will use to simplify much of our CSV processing. When it comes to data science in Python, `pandas` is the most popular way of interacting with your data. (The name derives from \"**pan**el **da**ta, which is a type of longitudinal data where you might have multiple observations for the same individuals.)\n", "\n", "A **library** is some code that someone else wrote that you can use in your code so you don't have to write everything from scratch. pandas is one such library that we will learn this quarter, but we will learn many more before the quarter is done!\n", "\n", "`pandas` uses several advanced Python features. There will be a lot of new concepts, but also a lot of new syntax. It may feel very different from all (or almost all) of your programming experience here at UW!\n", "\n", "The `pandas` library could be described a **declarative-style library**. Rather than writing `if` statements, `while` loops, and `for` loops to process all data element-by-element, `pandas` wraps all of those operations inside many other functions and syntax. Like we saw with the number of unique problem that we could solve as `len(set(...))`, learning `pandas` also means learning a new way of composing programs in this manner.\n", "\n", "*Things will probably be overwhelming at first. We're introducing new ideas and new syntax at the same time. You will get better with practice; that's why our in-class guided practice is so important!*\n", "\n", "Like our approach to learning Python, we aren't going to show you *everything* you can possibly do with the library. Instead, we'll show you foundational patterns and examples for adapting to different problems. And for problems that don't fit the taught examples, we hope that this will still bootstrap your understanding so that you know what and how to search for relevant information online.\n", "\n", "So after reading this, **we recommend trying to make your own reference sheet of the ideas and syntax you learned and actually use that reference sheet when trying to solve practice problems.** Reorganizing and reconfiguring the ideas by writing it down can really help. If you find your reference sheet was missing something important that made it difficult to solve a particular problem, go back and add that to the reference sheet. It's your external brain to help you make sense of how everything fits together!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing we'll have to do is **import** the `pandas` library. The convention is to abbreviate it to `pd`:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will load the data from the CSV file `tas.csv` that has the example data we were working with before. We will save it in a variable called `df` (stands for data frame which is a common `pandas` term). We do this with a provided function from `pandas` called `read_csv`." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('tas.csv')\n", "df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice that this shows the CSV in a tabular format! What is `df`? It's a `pandas` object called a **`DataFrame`** which stores a table of values, much like an Excel table. \n", "\n", "Notice on the top row, it shows the name of the columns (`Name` and `Salary`) and on the left-most side, it shows an index for each row (`0`, `1`, and `2`). \n", "\n", "`DataFrame`s are powerful because they provide lots of ways to access and perform computations on your data without you having to write much code! \n", "\n", "# Accessing a Column\n", "For example, you can get all of the TAs' names with the following call." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df['Name']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "`df['Name']` returns another `pandas` object called a **`Series`** that represents a single column or row of a `DataFrame`. A `Series` is very similar to a `list` from Python, but has many extra features that we will explore later.\n", "\n", "Students sometimes get a little confused because this looks like `df` is a `dict` and it is trying to access a key named `Name`. This is not the case! One of the reasons Python is so powerful is it lets people who program libraries \"hook into\" the syntax of the language to make their own custom meaning of the `[]` syntax! `df` in this cell is really this special object defined by `pandas` called a `DataFrame`.\n", "\n", "\n", "## Problem 1\n", "In the cell below, write the code to access the `Salary` column of the data!" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Your answer here!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, `pandas` isn't useful just because it not only lets you access this data conveniently, but also perform computations on them. \n", "\n", "A `Series` object has many methods you can call on them to perform computation. Here is a list of some of the most useful ones:\n", "* `mean`: Calculates the average value of the `Series`\n", "* `min`: Calculates the minimum value of the `Series`\n", "* `max`: Calculates the maximum value of the `Series`\n", "* `idxmin`: Calculates the index of the minimum value of the `Series`\n", "* `idxmax`: Calculates the index of the maximum value of the `Series`\n", "* `count`: Calculates the number values in the `Series`\n", "* `unique`: Returns a new `Series` with all the unique values from the `Series`.\n", "* And many more!\n", "\n", "For example, if I wanted to compute the average `Salary` of the TAs, I would write:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "average_salary = df['Salary'].mean()\n", "average_salary" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Reminder: Types matter\n", "When first learning `pandas`, it's easy to mix up `DataFrame` and `Series`. \n", "* A `DataFrame` is a 2-dimensional structure (it has rows and columns like a grid)\n", "* `Series` is 1-dimensional (it only has \"one direction\" like a single row or a single column).\n", "\n", "When you access a single column (or as we will see later, a single row) of a `DataFrame`, it returns a `Series`. \n", "\n", "## Problem 2\n", "For this problem, you should compute the \"range\" of TA salaries (`the maximum value - the minimum value`). \n", "\n", "*Hint: You might need to make two separate calls to `pandas` to compute this since you need both the min and the max.*" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Your answer here!" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Element-wise Operations\n", "For the rest of this slide, let's consider a slightly more complex dataset that has a few more columns. This dataset tracks the emissions for cities around the world (but only has a few rows)." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2 = pd.read_csv('emissions.csv')\n", "df2" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted to access the emissions column, we could write:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2['emissions']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the `dtype` of this `Series` is `int64` meaning that every element in the `Series` is an integer. \n", "\n", "If we wanted to access the population columm, we could write:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2['population']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "One useful feature of `pandas` is it lets you combine values from different `Series`. For example, if we wanted to, we could add the values of the emissions column and the population column." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2['emissions'] + df2['population']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice, this returns a new `Series` that represents the sum of those two columns. The first value in the `Series` is the sum of the first values in the two that were added, the second is the sum of the second two, etc. It does not modify any of the columns of the dataset (you will need to do an assignment to change a value).\n", "\n", "Since we are performing an element-wise operation, we will still need to respect concatenation rules. Taking a step back from `pandas`, recall that `\"cat\" + 3` will raise a `TypeError` and we must instead write `\"cat\" + str(3)`. The analog of casting in `pandas` is to use the `astype` method to change the `dtype` of `Series`. For example," ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2['city'] + df2['population'].astype('str')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here, our use of `astype` is telling `pandas` to perform an element-wise operation that casts each item in `df2['population']` into a `str`. Notice that if you remove the call to `astype` in the previous code snippet, `pandas` will be unable to concatenate the two `Series` of different types. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 3\n", "In the cell below, find the maximum \"emissions per capita\" (emissions divided by population). Start by computing this value for each city and then find the maximum value of that `Series` (using one of the `Series` methods shown above). \n", "\n", "*Note: You can save a `Series` in a variable! It's just like any other Python value!*" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Your answer here!\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "These element-wise computations also work if a one of the values is a single value rather than a `Series`. For example, the following cell adds 4 to each of the populations. Notice this doesn't modify the original `DataFrame`, it just returns a new `Series` with the old values plus 4." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2['population'] + 4" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note here that the output of the `Series` actually tells you a bit about the values to help you out! The `dtype` property tells you the type of the data. In this case it uses a specialized integer type called `int64`, but for all intents and purposes that's really just like an `int`. As a minor detail, it also stores the Name of the column the `Series` came from for refernce.\n", "\n", "Another useful case for something like this is to compare the values of a column to a value. For example, the following cell computes which cities have an emissions value of 200 or more. Notice that the `dtype` here is `bool` since each value is a `True/False`." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2['emissions'] >= 200" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Filtering Data \n", "You might have wondered why being able to compare a `Series` to some value is something we deemed \"useful\" since it doesn't seem like it does anything helpful. The power comes from using this `bool` `Series` to **filter** the `DataFrame` to the rows you want.\n", "\n", "For example, what if I wanted to print the names of the cities that have an emissions of 200 or more? I can use this `bool` `Series` to filter which rows I want! The syntax looks like the following cell." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df3 = df2[df2['emissions'] >= 200]\n", "df3['city']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "That's pretty cool how we can get this result without having to write any loops!\n", "\n", "Notice the return value has type `DataFrame`, so we can than use the syntax we learned at the beginning to grab a single column from that `DataFrame` (thus returning a `Series`). \n", "\n", "\n", "The way this works is the indexing-notation for `DataFrames` has special cases for which type of value you pass it.\n", "* If you pass it a `str` (e.g., `df2['emissions']`), it returns that column as a `Series`.\n", "* If you pass it a `Series` with `dtype=bool` (e.g., `df2[df2['emissions'] >= 200]`), it will return a `DataFrame` of all the rows that `Series` had a `True` value for!\n", "\n", "There is no magic with this, they just wrote an if-statement in their code to do different things based on the type provided!\n", "\n", "We commonly call a `Series` with `dtype=bool` used for this context a **mask**. It usually makes your program more readable to save those masks in a variable. The following cell shows the exact same example, but adding a variable for readability for the mask." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "high_emissions = df2['emissions'] >= 200\n", "df3 = df2[high_emissions]\n", "df3['city']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering on Multiple Conditions\n", "You can combine masks using logical operators to make complex queries. There are three logical operators for masks (like `and`, `or`, and `not` but with different symbols).\n", "* `&` does an element-wise `and` to combine two masks\n", "* `|` does an element-wise `or` to combine two masks\n", "* `~` does an element-wise `not` of a single mask\n", "\n", "For example, if you want to find all cities that have high emissions or are in the US, you would probably try writing the following (but you'll run into a bug).\n", "\n", "```python\n", "df2[df2['emissions'] >= 200 | df2['country'] == 'USA']\n", "```\n", "\n", "The problem comes from **precedence** (order of operations). Just like how `*` gets evaluated before `+`, `|` gets evaluated first because it has the highest precedence (so does `&`). This makes Python interpret the first sub-expression as (`200 | df['country']`), which causes an error since this operator is not defined for these types.\n", "\n", "Whenever you run into ambiguities from precedence, on way you can always fix it is to the sub-expressions in parentheses like in the following cell." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2[(df2['emissions'] >= 200) | (df2['country'] == 'USA')]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "A much more readable solution involves saving each mask in a variable so you don't have to worry about this precedence. This has an added benefit of giving each condition a human-readable name if you use good variable names!" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "high_emissions = df2['emissions'] >= 200\n", "is_usa = df2['country'] == 'USA'\n", "df2[high_emissions | is_usa]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 4\n", "In the cell below, write code to select all rows from the dataset that are in France and have a population greater than 50." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Your answer here!\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Location\n", "We've shown you how to select specific columns or select specific rows based on a mask. In some sense, it's a little confusiong that `df[val]` can be used to grab columns or rows depending on what is passed. This is because this syntax we have shown below, is really just special cases of a more generic syntax that lets you specific some location in the `DataFrame`. `pandas` provides this shorthand for convencience in some cases, but this more general syntax below works in many more!\n", "\n", "In its most general form, the `loc` property lets you specify a **row indexer** and a **column indexer** to specify which rows/columns you want. The syntax looks like the following (where things in `<...>` are placeholders)\n", "\n", "```\n", "df.loc[, ]\n", "```\n", "\n", "The row indexer refers to the index of the `DataFrame`. Recall, when we display a `DataFrame`, it shows values to the left of each row to identify each row in the `DataFrame`.\n", "\n", "It turns out the the column indexer is optional, so you can leave that out. For example, if I want to get the first row (row with index 0), I could write:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2.loc[0]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Interestingly, this actually returns a `Series`! It looks different than the `Series` returned from something like `df['name']` since now it has an index that are the column names themselves! This means I could index into a specifc column by doing something like:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "s = df2.loc[0]\n", "s['city']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now this was a bit tedious to have to use double `[]` to acess the column as well, which is exactly why `loc` lets you specify a column as a \"column indexer\". Instead, it's more common to write:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2.loc[0, 'city']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You might be wondering: I've used the word \"indexer\" a few times but haven't defined what that means! By indexer, I mean some value to indicate which rows/columns you want. So far, I have shown how to specify a single value as an indexer, but there are actually many options to chose from! You can always mix-and-match these and use different ones for the rows/cols.\n", "\n", "### List of indices and slices\n", "For example, you can use a list of values as an indexer to select many rows or many columns:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2.loc[[1,2,3], ['city', 'country', 'emissions']]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice now it returns a `DataFrame` instead of a single value.\n", "\n", "You can also use slice syntax like you could for `list`/`str` to access a range of values. There are a couple oddities about this:\n", "* The start/stop points are **both inclusive** which is different than for `list`/`str` where the stop point is exclusive.\n", "* They do some fancy \"magic\" that let you use ranges with strings to get a range of column names.\n", "\n", "For example" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2.loc[1:3, 'city':'emissions']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The way to read this `loc` access is \"all the rows starting at index 1 and to index 3 (both inclusive) and all the columns starting at city and going to emissions (both inclusive)\".\n", "\n", "How does it define the \"range of strings\"? It uses the order of the columns in the `DataFrame`.\n", "\n", "### Mask\n", "\n", "You can also use a `bool` Series as an indexer to grab all the rows or columns that are marked `True`. This is similar to masking we saw before, but can now put the mask as a possible indexer." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "high_emissions = df2['emissions'] >= 200\n", "is_usa = df2['country'] == 'USA'\n", "df2.loc[high_emissions | is_usa]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice in the last cell, I left out the column indexer and it gave me all the column (that is the default for the column indexer).\n", "\n", "### `:` for everything\n", "\n", "Instead of relying on defaults, you can explicitly ask for \"all of the columns\" using the special range `:`. This is a common syntax for many numerical processing libraries so `pandas` adopts it too. It looks like the following" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2.loc[[0, 4, 2], :]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can also do this for the rows as well! " ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df2.loc[:, 'city']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "A tip to help you read these in your head is to read `:` by itself as \"all\".\n", "\n", "## Recap Indexers\n", "So we saw the `.loc` property here is kind of like a universal way of asking your data. You can specify a row indexer and a column indexer to select your data. We saw the following things used as indexers:\n", "* A single value (row index for rows, column name for columns)\n", "* A list of values or a slice (row index for for rows, column names for columns)\n", "* A mask\n", "* `:` to select all values\n", "\n", "\n", "## Return Values\n", "One thing that is also complex about `.loc` is the type of the value returns depends on the types of the indexers. Recall that a `pandas` `DataFrame` is a 2-dimensional strucutre (rows and columns) while a `Series` is a single `row` or single `column`.\n", "\n", "To tell what the return type of a `.loc` call is, you need to look for the \"single value\" type of indexer.\n", "* If both the row and column indexers are a single value, returns a single value. This will be whatever the value is at the location so its type will be the same as the `dtype` of the column it comes from.\n", "* If only one of the row and colum indexers is a single value (meaning the other is multiple values), returns a `Series`.\n", "* If neither of the row and column indexers are single values (meaning both are multiple values), returns a `DataFrame`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }