We have an unrelenting need for better data structures and algorithms -- we have more data than we know what to do with. e.g. Earth is 5*10^8 sq km, which is 5*10^14 samples at a resolution of a square meter. Toss in 6 different dimensions (say wavelengths) and 365 days of the year. Now we're up to 10^18... What could this be used for? How about weather prediction. Or look at genetics: humans have 3.2 billion base pairs. Imagine this for every person. And then the field of epigenetics, where the DNA expresses itself differently in different cells. Processors and harddisks are growing at equal rates, but runtimes grow faster for sublinear (slower than O(n)) algorithms. Hence writing efficient data structures and algorithms will always remain top priority. We'd like to analyze how much time or space an algorithm takes to run. One tactic is to implement it and run it and time it, perhaps averaging several trials. Some pros: - we get to find out how the system's configuration affects performance - we can stress-test things to see how they perform in a dynamical distributed environment - we don't have to do any complex math Some cons: - need to implement code - can be hard to extrapolate performance - comparisons of two different algorithms need to be done with everything else identical (same computer, OS, processor, system load, etc.) The other tactic, which we'll look at, is to describe the algorithm by the rate of its growth as the problem size grows. Some pros: - independent of system-specific information such as processor speed - good for extrapolating - don't need to implement code Some cons: - won't give you info on cache efficiency, exact runtimes - only gives useful information for large problem sizes Most of the time we'll be concerned with the worst-case performance of the algorithm, e.g. if we're searching for something, we don't find it up front. Sometimes however we'll look at average performance or comment on the best-case performance. One way to see how long an algorithm runs is to count the individual operations that happen. We can assume that operations such as comparisons, assignments, arithmetic, array accesses, and calling/returning take constant time. Let's look at indexOf(int[] arr, int val) for an unsorted array: int indexOf(int[] arr, int val) { for(int i=0; i n0) then f(n) <= c*g(n). Basically what this says is that g(n) is an upper bound for representing the growth rate of f(n), and it ignores lower-order terms and constant multipliers. Another way of writing the definition is that lim_{n->oo} f(n)/g(n) is finite. One catch about big-Oh is that technically a linear function is also O(n^2) and O(2^n). It's like saying that I can sell you this car for cheaper than a million dollars or a billion dollars when I can really say something more useful like cheaper than 10000 dollars. So when we'll say a runtime of an algorithm using big-Oh, we give the strictest, most informative answer we can. This is often identical to Theta(n), but we still say Big-Oh. When we define a runtime in terms of Big-Oh, we want it to be as concise and simple as possible. So while I could say an algorithm is O(2*n+7), it is equivalent and much simpler to say O(n). Another way of seeing that lower-order terms and constants play no role when n is large is to look at the growth rates. Let's look at how n^2 grows when n is doubled, and then how 3*n^2+7*n+18 grows when n is doubled. We'll look at when n gets arbitrarily large. lim_n->oo (2n)^2 / n^2 = 4n^2/n^2 = 4 lim_n->oo (3*(2n)^2+7*(2n)+18)/(3*n^2+7*n+18) = lim_n->oo (12*n^2+14*n+18)/(3*n^2+7*n+18) now let's multiply numerator and denomerator by 1/n^2: lim_n->oo (12+14/n+18/n^2)/(3+7/n+18/n^2) You can see that the terms with n or n^2 in denominator quickly approach 0, and the limit is 12/3 = 4. The following are the most common Big-Ohs: O(1) - constant, e.g. adding two numbers, cracking bad passwds doubling the problem size has no effect on runtime O(log n) - logarithmic, binary search, Euclid's algorithm, fast exponentiation squaring the problem size doubles the runtime, doubling the problem size (when n is large) is a very slight increase O(n) - linear search, computing an average, dot product doubling the problem size doubles the runtime O(n log n) - good sorts (e.g. merge), Fast Fourier Transform doubling the problem size increases the runtime by a little more than double when n is large O(n^2) - quadratic, insertion sort, matrix add 2x problem size means 4x runtime O(n^3) - cubic, matrix multiply (dot product per element) 2x problem size means 8x runtime higher order polynomials suffer from curse of dimensionality O(2^n) - exponential: factoring n-bit numbers, cracking good passwds, game searches exponential algorithms suffer from combinatorial explosion increasing the problem size by +1 means 2x runtime Some other stuff: O(1) < O(log log n) < O(log n) < O(n^0.0001) O(2^n) < O(n!) note that O(log_2 n)=O(log_10 n) because log_b n = log n / log b, basically a constant factor. similarly, O(log (n^2)) = O(log n) because log (n^2) = 2*log n also, O(f(n))+O(g(n)) = O(f(n)*g(n)) and O(f(n))*O(g(n)) = O(f(n)*g(n)) Let's look at code for seeing if an unsorted array has duplicates. We need to check every pair to see if there's a match. for(int i=0; i