CSE 374, Lecture 17: git

We've been talking about tools that you can use in the terminal and for programming, and today we'll continue with version control.

Version Control

Programmers use a group of technologies called "version control" to make them more productive and effective. Version control was developed to address three problems:

Backups. In order to prevent loss of data when your computer crashes or dies, you should back up your work somewhere else.
Collaboration. If you're working on a large project with someone else, how do you collaborate? Do you email files back and forth? Do you save things in a common Google Drive folder? Neither of these are very scalable solutions if you have a large project with many files and many collaborators. How do you deal with conflicting changes from differing people? How do you keep track of which version is the "final" version?
Version log. Have you ever made a mistake and tried to undo it with CTRL-Z, but been unable to do so because Word or whatever program you were working with wouldn't go back far enough? Version control helps with this problem by keeping a log of all previous changes and allowing you to retrieve those versions whenever you like.

Version control solves these three problems by managing files and coordinating how they are shared across computers. We'll discuss the theory behind it and then how to use it.

There are many different actual programs that do version control: git, subversion, mercurial, perforce, and others. Each of these works in a slightly different way, but the concepts are similar and can be extended from one to the other. We'll use git in this class.

None of these version control systems is language-specific or file-type-specific. Commonly people store source code in a version control system, but you can store whatever type of files you like! We use a git repository for storing course administrative files related to CSE 374, for example, which consists of source code, html files, PowerPoint presentations, pdfs, Word documents, and text files. Some people use source control for everything they do. (Note however that version control systems were originally built and optimized for text files - while you can store photos and videos in them, this may be less efficient due to how version control stores changes).

Finally, it's totally ok not to memorize all of the commands that we'll talk about! Know the concepts and the basics, and look up the rest as you need it.

Theory

The most traditional type of version control system is called a "distributed version control system."

A project lives in a collection called a "repository" (essentially a folder).
Each user has their own copy of the repository (in the following diagram notated with "R").
A user "commits" changes to their copy of the repository to save them.
Other users can "pull" changes from that repository into their own local repository.
The repository's history of commits ends up getting represented as a directed acyclic graph - a "DAG" or tree - because different people's versions of the repository can fork from each other.

 -------       -------
| Alice |     |  Bob  |
 -------       -------
  | R | <-----> | R |
   ---           ---
    ^             ^
    |             |
    |     ---     |
     --> | R | <--
        -------
       | Carol |
        -------

Distributed version control

The distributed version control system is powerful, but in large projects with a lot of collaborators, it can be infeasible for every person to pull changes from every other person. As an alternative to distributed version control, many projects use a "centralized version control" system:

Users have a shared repository (called "origin" in git) which lives on some central server.
Each users "clones" the repository to create a "local" copy.
A user "commits" changes to their copy of the repository to save them.
To share changes, a user "pushes" their local changes to the origin.
All users "pull" from the central server periodically to get changes (instead of from each other).
We call the central repository the "remote" repository - to access the remote repository, you'll need authentication to validate your permission to access the shared repository.
Once code is pushed to the central server, the history of commits is linear, not a DAG.

We'll use a central repository model for CSE 374, using a service called "GitLab" to maintain the central shared repository.

           --------------------
          | Central repository |
          |      (GitLab)      |
           --------------------
         -------> | R | <-------
        |          ---          |
        |           ^           |
        |           |           |
        |           |           |
        v           v           v
       ---         ---         ---
      | R |       | R |       | R |
     -------     -------     -------
    | Alice |   |  Bob  |   | Carol |
     -------     -------     -------

       Centralized version control

Typical terminology for tasks that you'll accomplish with version control:

Create a ...
- new repository/project - this is fairly rare and only occurs when you're starting a new collaboration.
- new branch - we won't use branches in CSE 374, but they are used for independent development on subprojects while still allowing for collaboration.
- new commit - a single change, which should be significant but not too large. This will happen daily or more.
Push - regularly, whenever you have made commits.
Pull - also regularly, so that you can make the best choices given the work that other people have been doing.

git

In CSE 374, we will be using git as our version control system, with Gitlab as our central repository (very similar to Github).

There are three main steps to getting a repository set up in git:

Create a repository. In Gitlab this can be accomplished by selecting the "+ New Project" button in the Gitlab UI.
Set up authentication. In order to use git to collaborate, you need to have a way to prove that you are you and are allowed to access the repository (as opposed to some other member of the class). We use ssh keys to authenticate (same technology as we use to connect to klaatu via SSH), and you'll have to create your own keys to use with Gitlab - instructions are linked on the course webpage.

Clone the repository onto your local computer using the "git clone" command.

    $ cd where-you-want-to-put-it
    $ git clone git@gitlab.cs.washington.edu:path/to/repo

A typical workflow for working on code in a git repository is as follows:

    # Get the latest version of the code from the central repository.
    # Pull often to prevent merge conflicts (see below).
    $ git pull

    # Edit the files
    $ emacs main.c

    # Check the status - what files have changed?
    $ git status

    # Mark the file "main.c" as ready for the next commit.
    $ git add main.c

    # View the line-by-line differences between the last commit and
    # any uncommitted local changes.
    $ git diff

    # Actually commit the change to git - "save" it. Commit messages
    # should be descriptive of what you changed to help others understand
    # what changed.
    $ git commit -m "increased max line length from 100 to 200"

    # View the history of all commits in the repository
    $ git log

    # Push the new commit to the central repository.
    $ git push

Gotchas and more advanced things (use "man git" to learn more):

git has special rules around moving and removing files: you can't just run a normal "rm" or "mv" command and expect git to understand (lots of error messages!). Instead, you'll need to use the git-specific "git mv" and "git rm" to move and remove, respectively.
Sometimes you've been making a bunch of changes, but you decide that you don't like them and want to undo them (you haven't made any commits yet). To reset your local repository to the last commit (forget all changes that you've made), you can run "git reset --hard HEAD". Here "HEAD" refers to the most recent commit.
If one of your past commits was BAD, you can undo it using "git revert"! If the second-to-last commit was bad, you can undo it by saying "git revert HEAD~1", where HEAD is the most recent commit and "1" signifies the one before it. This will create a NEW commit that is the opposite of the original commit.
Commits aren't completely static and permanent. If you make a commit but then realize you forgot one little thing, you can "amend"/modify your previous commit rather than creating a brand new commit using "git commit --amend".

Merging

git works easily if there's only one person working on a repository, but whenever more than one person is working on code at once, you run the risk of "merge conflicts". A merge conflict occurs when two people make changes to their own working copies and then try to push those changes to the central repository. The first person's push will succeed, but when the second person does "git pull" prior to pushing their code, they may encounter a merge conflict.

If git detects a merge issue (the same line of code edited in two non-sequential commits, i.e. commits made at the same time), it will do its best to try to resolve the issue on its own. As long as the two commits didn't touch the same line, the conflict should be resolved automatically. But if the commits did touch the same line of code, you will have to fix the conflict manually.

git will tell you which files had merge conflicts (use git status to see conflicts), and the files will be edited to identify the conflict:

    <<<<<<<< HEAD
    for (int i=0; i<10; i++)
    ===============
    for (int i=0; i<=10; i++)
    >>>>>>>> master

You must modify the section to contain the code you want, then save, add, and commit the merge.

.gitignore

You can store any files that you'd like in git, but there is a certain class of files that you should NOT store in git, because they are unnecessary and pollute the environment.

Temporary files. Emacs stores automatic backups in files that are appended with "~" (e.g. "foo.txt~"). There is no need to share these with collaborators.
System-specific files. For instance, OS X (Mac) will create a ".DS_Store" file for information that is used in the Finder program. If the git repository is cloned onto a Linux or Windows machine, this file is meaningless.
Temporary compilation files, like .o files or executables. These types of files change often, are computable from the source files, and can vary slightly from computer to computer. Run "make clean" before committing files to remove these compiler artifacts.

Since it can be a pain to have to remember not to add these files to your commit, git allows you to create a file called ".gitignore", which is stored in the root directory of your git repository and contains a list of files (using "*" if you'd like) to ignore:

    # emacs backup files
    *~

    # OS X finder info files
    .DS_Store

    # built object files
    *.o

Summary

Git is another tool for letting the computer do what itâs good at.
- MUCH better than manually emailing files, adding dates to filenames, etc.
- Managing versions, storing the differences between versions.
- Keeping source-code safe (so local crashes won't lose data).
- Preventing concurrent access, detecting conflicts with collaborators.
We've put a git/Gitlab tutorial for CSE 374 on the course website.
Full git docs and a book are online, free and downloadable, but beware of complexity - much of what they describe is beyond the scope of CSE 374. Keep it simple!
Ask for help when you need it - git has a steep learning curve.