Lecture: Linux scalability

An Analysis of Linux Scalability to Many Cores, OSDI 2010

Question

Below is a simple sloppy counter implementation based on Section 4.3. Suppose (1) four threads execute the work function on four different cores; (2) the four threads start after the init function. Consider the counter update function. Do you think the implementation is correct?

#define THREAD_NUM 4
#define SLOPPY_THRESHOLD 8

typedef struct _count_t {
    int global_cnt;
    int local_cnt[THREAD_NUM];
    int threshold;
} count_t;

count_t ref_cnt;

void init(count_t *c)
{
    int i;

    c->global_cnt = 0;
    for (i = 0; i < THREAD_NUM; i++)
        c->local_cnt[i]  = 0;
    c->threshold = SLOPPY_THRESHOLD;
}

void update(count_c *c, int tid, int cnt)
{
    c->local_cnt[tid] += cnt;
    if (c->local_cnt[tid] >= c->threshold) {
        c->global_cnt += c->local_cnt[tid];
        c->local_cnt[tid] = 0;
    }
}

void *worker(void *arg)
{
    int i, tid = *(int *)arg;

    for (i = 0; i < COUNT_CHECK; i++)
        update(&ref_cnt, tid, i % SLOPPY_THRESHOLD);
}

Question

Below is a conditional set function implementation on a lock-free hashtable. It applies a version-based lock-free protocol and allows concurrent updates, where as the one (described in Section 4.4) only permits an exclusive writer. What can go wrong with this implementation?

typedef struct _ht_entry_t {
    char *key;
    char *val;
    uint64_t version_cnt;
    // other fields
} ht_entry_t;

ht_entry_t* myht; // a hashtable
ht_entry_t* ht_get(char *key); // atomic lock-free
ht_entry_t* ht_set(char *key, char *val); // atomic lock-free

void my_cond_set(char *key)
{
    uint64_t local_version;
    ht_entry_t *e;
    char *new_val;

    while (1) {
        e = ht_get(myht, key);
        local_version = e->version_cnt;

        // perform computation
        // new_val is set

        if (_compare_and_swap(e->version, local_version, local_version + 1)) {
            ht_set(myht, key, val);
            break;
        }
    }
}

Question

As Figure 3 shows, gmake scales well as the number of cores increases. However, it still falls short of perfect scalability. Why do you think that’s the case?

Question

Provide a list of questions you would like to discuss in class. Feel free to provide any comments on the paper and related topics (e.g., which parts you like and which parts you find confusing).