CSE 374, Lecture 8: Intro to C

Java vs C

Today we'll be diving into C programming. A lot of this will look familiar from your experience with Java:

Control flow: if, while, for, switch
Curly braces to denote blocks of code (scope): {}
A semicolon at the end of every line;
Familiar types and operators

However, there are some notable differences that will become important as we go along:

Operations in C are more similar to what the computer is actually doing (compared to Java). This is what people mean when they say C is a "lower-level" language. It also means you'll have to take care of more things yourself.
Procedural programming. Java is based on an object-oriented design, which means that all the code you write resides inside classes (objects) which interact with each other. C does not have any objects, and all computation takes place within plain old functions. This type of programming is called "procedural".
C is unsafe. An incorrect program might do anything. Java programs are run within a virtual environment (the "JVM") and are "sandboxed", meaning that anything that goes wrong within the program computation should not be able to affect anything in the global environment. C is not sandboxed in this way.
The standard library in C is much smaller and more limited compared with the extended Java libraries.

The differences are significant enough that we will have to learn a new view of programming.

References for C

In addition to the Kernighan & Ritchie book which is recommended for this class, there are a few more C resources linked from the course webpage. As we move through the next few weeks, these are great links to use to solidify your understanding of the material and to find out more.

Stanford's "Essential C" guide: http://cslibrary.stanford.edu/101/
- Good short intro to C
cplusplus.com's reference section: http://www.cplusplus.com/reference/
- Good current reference for the C standard libraries.

However as with bash programming, simply reading the guides and documentations won't be sufficient; you should try writing little programs and experimenting in order to learn the material.

Bits vs bytes

All information in a computer is ultimately stored as a sequence of 0s and 1s. This is because computers are electronic machines and use different voltage levels (high and low) to actually store the data.

A single 0 or 1 is called a "bit" and is the smallest unit of information in a computer.
Eight bits combined together form a unit called a "byte". A byte is the typical unit of measure for our purposes. Since it has 8 bits, a byte can also be represented as an integer in the range of 0 to 2^8 (or 256).

The data types that we have seen in Java and which we also have in C take up different numbers of bytes:

char: 1 byte
int
- short int: 2 bytes
- normal int: 4 bytes
- long int: 8 bytes
- ints of all kinds can be either signed or unsigned. This changes the range of the values of each type (more later).
floating point
- double: 8 bytes
- long double: 16 bytes
- float: 4 bytes

Computer memory

Computers consist of a couple components that you may have heard of.

The CPU (central processing unit): this is a highly complicated bit of circuitry that performs program instructions.
The storage or hard disk: where data is persisted for long periods of time and can persist across restarts.
Memory or (RAM): the computer's memory stores data and code related to a running process for as long as that process is running. We'll be diving into how memory works in this class as it helps us understand how C works.

Each computer has a finite amount of memory available to run processes. All processes must share the computer's memory; it is the operating system's job to decide how to split the memory among all of the processes. The operating system will provide an illusion to all processes that they own all of the memory, although this is not really the case (processes only have access to their own chunks given to them by the operating system. We call this illusion "virtual memory" and it prevents processes from corrupting each other's memory. In the following sections, we will look at memory from the view of a single process. The way that this virtual memory is translated into physical memory is beyond the scope of this class; for our purposes, we can imagine that each process owns all of the computer's memory.

Address space

Within a running process, memory is represented as one huge array of bytes. This includes everything pertaining to the running process, including the code and data. We refer to this array as the "address space". Each index or position in the array is referred to as an "address".

                     -------------------------------------------------------------------------------------
    "low addresses" | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "high addresses"
                     -------------------------------------------------------------------------------------
                     ^                                                                                   ^
              address/index = 0                                                                 address/index = 2^63 - 1

How many elements are in this array of bytes? It depends on your system. If you have a "64-bit" system, then there will be 2^64 entries (bytes) in the array (from 0 to 2^63 - 1). If you have a "32-bit" system, then there will be 2^32 entries.

(Side note: How many total bits are in the address space in each case? Since each byte is 8 bits, a 64-bit system will have (2^64)*(2^8) = 2^72 total bits.)

The address space is organized to separate out the pieces of information that a process needs to run. A simplified picture of the location of information in the address space:

         ------------------------------------------------------------
    low |  |  code  |  globals  |  heap ->                 <- stack  | high
         ------------------------------------------------------------

1) On the low-end of the address space is the process's code: the actual instructions that the program is executing in order to perform its task. 2) Next come the "global" variables. These are things like static constants in Java, variables that are written in the code itself and don't change. Also includes things like string literals like "Hello world\n" in our upcoming hello-world example. 3) The "heap" is a section of memory that stores variables that the program allocates dynamically. In Java, these will be objects that are created with the "new" keyword. We create variables on the heap in C with "malloc" (future lecture). The heap is not a defined size; it starts at lower addresses and grows towards higher addresses as new variables are added. 4) The "stack" grows from high-memory-addresses towards the lower end of memory (opposite of the heap). The stack stores a sequence of "frames". Each frame holds information about the execution of a particular function - a "return address" (when this function is done, where is the next code to execute?), a "previous frame pointer" (where is the previous frame located on the stack?), parameters to the function, and any local variables that the function has created (those not created on the heap). Each new function call adds a new "frame" to the stack.

If the heap and the stack ever grow so large that they run into each other, the process will crash with an "out of memory" condition. This does sometimes happen.

Pointers

We call an index for the address space array an "address", and we also call it a "pointer." You can think of a C pointer as a literal pointer or arrow. If we declare an integer x, which stores the value 4 at a particular location in the address space, we can find the address of that variable with the ampersand operator (&). The type of the thing that will store the address is a pointer, and we declare pointers using a star (*):

    int x = 4;
    int *xPtr = &x;
    int xCopy = *xPtr;

           ---
        x | 4 |    address = 3488
           ---
            ^
            |
          --|---
    xPtr | 3488 |  address = 3872
          ------
           ---
    xCopy | 4 |     address = 8471
           ---

In this example, we see that the TYPE of xPtr is "int*": a pointer to an integer, or the address of an integer. We accessed the address of x by using "&x". To get the actual value out of the pointer again ("follow the pointer"), we use the "*" operator on the pointer's name, which accesses the value stored at the pointer's address. When we store that value in a variable of type int, we perform a COPY of the value and store it in y.

POINTERS are different from actual values. We can see this in the following example.

    int x = 4;
    int y = x; // makes a COPY of x - y is a point in memory where we store the value 4.
    x = 3;     // y's value doesn't change because it made a copy of x before x was changed.

       ---        ---
    x | 3 |    y | 4 |
       ---        ---

With pointers, however, we can modify the actual value of x:

    int x = 4;
    int *xPtr = &x;
    *xPtr = 3;
    // x now stores 3 - we followed xPtr to the value and changed it.
           ---
        x | 3 |    address = 3488
           ---
            ^
            |
          --|---
    xPtr | 3488 |  address = 3872
          ------

Gotcha: You can create pointers to anything.

    int* xPtr = 0;
    int xCopy = *xPtr; # THIS IS BAD

Trying to access some position that doesn't have data (or not the data you think it should have), such as whatever is stored at position 0 in the address space in this case, will usually cause something called a "segmentation fault" or "segfault", which results in an immediate crash of your program. We'll learn more about segfaults as we learn more about C. These reveal bugs in your code.

Hello world

Traditionally, programmers start a new language by writing a program that prints "Hello, World!" We can do that for C in a program stored in "hello.c":

    #include <stdio.h>

    int main(int argc, char **argv) {
      printf("Hello, World!\n");
      return 0;
    }

Just like Java, C must be compiled before you can run it. You compile a file using a program called "gcc".

    $ gcc -o hello hello.c

And then to run the program, we execute the compiled program:

    $ ./hello
    Hello, World!

Intuitively, just like in Java, this basic program works by running the "main" function with any arguments that are given to the program on the command line and it exits when the function returns. However, since there's a lot going on here, we'll break the program down line-by-line.

Libraries

The first line of the file is this "include" directive:

    #include <stdio.h>

Any line in a C program that starts with a "#" character is a command to the "preprocessor", which is essentially part of the compiler. We'll talk a lot more about the preprocessor in a later lecture, but think of this line as finding the file stdio.h and then "copy and pasting" the contents into hello.c. Where does the compiler know where to find this file stdio.h? Well for standard library files like this one, there are a few standard places on the Linux system where they are stored and the compiler looks there. In this case, check out the /usr/include/ directory.

Functions

In Java, we have "methods" which are part of a class. In C, we call these "functions", and they exist outside of a class (ie there is no "this" in C). In the hello world example, we have exactly one function with the name main, and it takes two arguments as parameters. The name "main" indicates that this function is the special function that should run when you execute the program (the same way that Java works).

Next time

Next time we'll talk about arrays, strings, and the rest of the hello world program, along with a few new examples.