CSE 373, Winter 2019: Git Usage Guide

Preface

This page provides a general overview of basic Git concepts, independent of the method of running the described functionality. For instructions on how to run the commands described on this page, see the resource pages for the IDE you use, which will describe the graphical user interface (GUI) for Git in that IDE.

What is Git?

Git is a version control system (VCS): a tool for tracking changes between versions of code and sharing those changes with other people. A VCS allows you to not only see changes from old code, but also revert back to older versions or create alternate versions, which you can later merge back together.

Git is one of the most widely-used VCSs; this is due in part to its flexibility, but its surrounding ecosystem also plays a large role in its popularity: many of you who aren't familiar with VCSs have still probably heard of GitHub, which has become a leading platform for open-source projects.

So, what is GitHub? GitHub is essentially a website that provides servers for Git repositories (repos). In the context of VCSs, a repository is a copy of your code and its history—it's a data structure representing your project. When working with multiple users in a Git project, each user has their own copy of the code repository on their machine. To share code, all users sync their local repositories with a single remote repository on some server—these repositories are what GitHub provides.

Now that we've gotten that introduction out of the way, we won't actually be using GitHub in this class. Instead, we'll be using GitLab, a Git-repository hosting service similar to GitHub, but open source so that it can run on our CSE servers.

Some notes before we start

We mentioned before that Git is very flexible; this has a couple important implications:

  1. We won't be able to cover everything you can do with Git in this guide. Git has many features, and they can be combined in many different workflows; for simplicity, we'll be describing only the bare minimum set of features necessary to get started with Git. You may use other Git functionality, but we will not be providing guides for those. You can check out the official Git documentation for more information or search for other resources. (GitHub, GitLab, and Atlassian tend to have some pretty nice stuff, since they all have repo hosting services.)

  2. You should be careful when using Git. Git is a powerful tool, and some of the more powerful functionality may destroy your local repository if used incorrectly. None of the functionality we go over should allow you to do this, but you should definitely understand what you're doing before running random Git commands you find on the internet.

Fortunately, GitLab should prevent you from destroying the remote repo, so in the worst case, you can just download the code from the remote repo again. You can also come in to office hours for help with Git. (The discussion board tends not to work too well for debugging issues with Git though.)

Obtaining the repository

In this class, the remote repositories on GitLab will be created for you beforehand with the starting code, so you just need to clone that remote repo onto your machine.

The resulting directory on your machine will contain the current version of the repo's files, but with an extra .git directory inside which contains all the extra data Git stores, such as the history of changes. This .git directory is actual local repository. We refer to the directory containing the regular files as the working directory. (However, typically, whenever we refer to the directory or path of a Git repository on your machine, we mean the working directory and not the .git directory inside; this is a minor inconsistency in terminology, but do ask if it's not clear which we're referring to.)

This means that there are really two versions of code on your machine: the working directory will contain the copy of the code that you're actively working on as real files, whereas the local repository will store the entire history of the project as a series of changes between sequential versions.

Basic workflow

Regular Git usage involves syncing changes between the working directory and the local repository, and between the local repository and the remote repository. Syncing changes in Git always works in a single direction at a time: new changes get copied from one place to another. Here's a diagram with the names for basic commands that move changes in each direction:

As you can see, there are three main commands:

  • Committing adds changes from the working directory to the local repo.
  • Pushing copies changes from the local repo to the remote repo.
  • Pulling is a little more complicated: it first copies changes from the remote repo to the local repo, then applies those changes to the working directory.

Committing

Committing is the main way that changes enter Git repos (and the only way in this basic workflow) since pushing and pulling involve moving changes between repos or from the local repo out and into the working directory. Whenever you commit, Git compares the working directory to the local repo and adds the set of changes to the local repo. This set of changes is referred to as a commit; at the same time, since a commit also stores a reference to the previous/parent commit (kind of like a linked list), we can trace the commits all the way back to the beginning of the commit history; in essense, this means that a commit also specifies a particular version of code.

Now, referring to commits by their full list of changes is pretty awkward, and referring to them by the contents of all files is even worse, so Git lets us assign a commit message when we're committing to briefly describe the changes made. Also, internally, Git generates a commit hash for each commit—a string of hexadecimal characters that acts as an ID for the commit. The commit hash is usually not very useful for humans, but in this course, we include it in our project feedback to indicate which commit we graded.

There's actually a little bit more to committing than this: Git allows us to choose exactly which changes in the working directory to commit, which involves keeping another version of code between the working directory and the local repo; however, we'll be using a GUI that manages that for us, so we'll ignore it.

Pushing

Pushing is relatively simple, since it just involves copying commits from the local repo to the remote repo.

The only thing to note here is that this copying process doesn't try to do anything fancy and will fail if there are changes on the remote repo that aren't in the local repo, i.e., if the local and remote repos are desynced because someone pushed to the remote repo after the last time you pulled changes from it. In this case, Git may not be able to figure out what to do with your changes, so requires you to first pull those changes and resolve the desync by merging the changes before pushing.

Pulling

Pulling is fairly simple for the same reason that pushing is: usualy, it just copies new commits from the remote repo to the local repo, then applies the changes from those commits to the working directory.

However, there are two issues that may come up during this process. If you have changes in the working directory that would get overwritten by the newly-pulled changes, the pull will fail, and you'll need to either commit those changes or undo them before you can continue. The second case occurs if your local repo is desynced from the remote repo, so both have changes that the other doesn't. In the latter case, the commit history in the local and remote repo have diverged at some point, so you'll need to merge the two sets of changes before you can continue.

Merging

Merging is the process of combining two versions of code that branched off from some common ancestor commit. Git will try to automatically handle a merge: if the two sets of changes involve different files, it will just apply both sets; additionally, if both sets of changes alter different parts of the same file, Git will attempt to integrate both sets of changes automatically.

However, if both sets of changes touch the same line, you'll need to manually resolve the merge; this is called a merge conflict. In this case, for each altered part of the files with conflicts, Git will include and mark the results of both sets of changes directly in the files in the working directory. (Many GUIs for Git will include functionality to display this nicely; otherwise, you'll need to edit the files manually.) The merge finishes after all conflicts are resolved and the changes are committed.

As a side note, the fact that merges are commits means that commits can have multiple parent commits, so the Git history isn't quite a linked list. Also, because there must be "branching" in order for the merge operation to make sense, the history isn't quite a tree either; we call this kind of data structure a graph. Graphs are a very interesting and flexible data strucutre, and we'll be going over them later in the course.