Storing Lots of Stuff: Databases and Data Silos

The essence of information technology, information management, information retrieval, information processing, etc., is having information easily available; hence, the need of some device that can store vast amounts of information permanently and also provide quick retrieval of any single piece of information.

The platters of a hard drive store information as magnetic patterns. A stack of platters rotates at high speed while read/write heads fly just over the top and bottom surface of each platter. Each platter has its information recorded in concentric circles called tracks, which are further divided into smaller sections called sectors.

The very first production hard disk was the IBM 305 RAMAC (Random Access Method of Accounting and Control), introduced on September 13, 1956. It stored 5 million characters on 50 disks, each 24 inches in diameter. The data transfer rate of the RAMAC was 8,800 bytes per second.

Something to think about: So what are you going to do? Just dump your data on to the your hard drive the way you throw beer cans on the ground? If you are going to retrieve data, it will have to be organized, i.e., given an address where the hard drive can find it. [This is organization of data at the physical layer, i.e. addresses and pointers]

But, if you were really clever you would write a computer program that enhanced your power to retrieve data, i.e., "Give me all the chevrolets that are blue in color and that have black leather interiors", etc. [This implies that you can point to "all the chevrolets" and then point to "all the chevrolets AND color is blue", etc. Hence your data must have some logical organization. Jargon alert! Your data has a "schema"]

Let's call such a program, a database management program.

So What Metaphor Will You Use To Model Your Data? More Jargon Alert! Your DBMS has a Schema Level

In 1961, Charles Bachman at General Electric Co. developed the first successful database management system. Bachman's integrated data store (IDS) featured data schemas and logging. But it ran only on GE mainframes, could use only a single file for the database, and all generation of data tables had to be hand-coded.

Something to read: The Programmer as Navigator by Charles Bachman 1973 ACM Turing Award Lecture

My proposition today is that it is time for the application programmer to abandon the memory-centered view, and to accept the challenge and opportunity of navigation within the n-dimensional data space. Charles Bachman, 1973

The Table Metaphor: Relational Database Systems

In the early 1970s databases used either a hierarchical structure or a complex set of pointers to the physical location of data. This strategy was efficient for specific queries, but new queries required complex reprogramming and adding new types of data forced a total redesign of the database. During 1970-72 Ted Codd proposed the relational model which disconnects the schema (logical organization) of a database from the physical storage methods.

Codd's idea was that relationships between data items should be based on the item's values, and not on the item's physical address.

Relational Tables Conform To The "Normalization Process"

Significance check for the inattentive: Normalization means that the "construction of information" becomes a science! (or perhaps, an engineering science) (or at least, a mechanical process) (Ok, so now we have some "rules") (Well, the rules that engineers would use) (But, you have to admit, some rules are better than no rules) (Well, maybe)

But would you let an engineer design your information?

Normalizing is Abnormal 
By Jeff Fowler, President, Decision Software, Inc.
 
 
If I had to choose the single most common mistake I’ve seen in marketing database design, 
it would be over normalization; or, as we dummies say: too many tables.  
Generally speaking, the more tables there are in a database, the more difficult it becomes 
for marketing people to ask marketing questions.

Database administrators like to normalize because that’s what good DBAs do.  There are 
formal rules governing relational database design, and breaking one is as inconceivable 
to them as watching a sporting event without drinking beer is to me.  So why do these 
maddeningly-precise DBAs like taking a perfectly good file and splitting it up into 
different tables?  For two very good reasons: to keep your database clean and accurate, 
and to save storage space.

Let’s delve further into our scenario and take a single customer file, household it, 
append demographics, and turn it over to our trusty DBA.  Meticulously applying the 
normalization logic we saw before, he takes the file and proceeds to create a household 
table, a customer table, a household demographic table, and a customer demographic table.  
Why?  Because clearly if a household has three people in it, we waste space by repeating 
the address three times, when in fact only their names are different.  And after all, 
don’t some demographics than the individual?  Also, because we had about a 70% match rate, 
many of the demographic fields are missing, influencing the decision to store demographic 
data in separate tables.

Next, being a thorough and diligent fellow, our DBA finds out that we’ve got some business 
data mixed into our file, and – you guessed it – suddenly we find ourselves with a separate 
business table along with a business demographic table,complete with data elements like 
SIC code, number of employees, and annual revenue.  Before we know it, the customer file 
we started with has become seven database tables!  Now let’s talk about some of the consequences 
of all this efficiency: slower performance, greater complexity, and increased risk of error.

The trick is to find the proper balance between marketing’s desire to make one Great Big Table 
and a DBA’s tendency to make a hundred.  Now that you can buy zillion gigabyte drives for $3, 
saving storage space simply isn’t that much of an issue.

Database Has Been An Enormously Successful Computer Application

World’s Largest Database reaches 500,000 Gigabytes

Last week at the Stanford Linear Accelerator Center (SLAC), the BABAR experiment's database stored its 500,000th Gigabyte - a milestone that makes it the largest known database in the world. April 12, 2002

In 2000, Oracle owned 33.8% of the worldwide database market while IBM commanded 30.1%, according to Gartner Dataquest. Microsoft was a distant third at 14.9%, according to the firm, though strong sales of its SQL Server database product have been hurting Oracle as well. Oracle Finds Competition at the Core TheStreet.com, 10/26/2001

Oracle chief executive Larry Ellison--Gates’ longtime nemesis in the ego-driven technology industry--now holds the top spot. Based on the closing prices of Oracle and Microsoft stock, Ellison is worth $53 billion, compared to Gates’ $51.75 billion. Gates loses title as world's richest man News.com April 28, 2000

Database Has Been Facilitated By Ever Cheaper Memory

Consider the price of a SECOND hard drive.
Dell Computer, Summer 2003

Finally Everyone Can Have His/Her Own Database (Or Six Of Them)

But is more necessarily better? Can there be too much database? Consider the Size Effect when you can't remember all the stuff that's supposed to be in this database, or that database, or what it was called, and when you (or your assistant) put it in there, and maybe it's under another name, etc., etc. Have you considered getting an assistant for your assistant? How about a database for your assistant to keep track of your database?

Rummaging Around In A Database (Or Six Of Them)

Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. Principles of Data Mining, David Hand, et al, 2001

Example of data mining: Don R. Swanson "Undiscovered Public Knowledge"

Example of data mining: Bill James "Sabermetrics"

I first published an article about a statistical analysis of stolen bases against catchers, something I puzzled over as a kid.

Before computers, the records were lousy, but with the computerization of record keeping, statistics about catchers are now something cited as a matter of course. I like to think that my work demystified the art of catching a little bit.
	
I also looked at the throwing arms of right fielders. Thirty years ago people thought that an outfielder could play a major role in keeping the number of runs down by being able to hold runners on base.

And clearly, the identity of the right fielder was important in keeping people from trying to go from first to third base on a base hit.

But it turns out that the number of times requiring a right fielder to throw in this situation was less than 100 times a season, and that they held fewer than 20 runners on base. This averaged out to three runs per season. It turns out not to be a big deal and actually diminished the perception of the value of right fielders.

Today, because of computers, a lot of new things are monitored in baseball.

Something to think about:

If you are forced to rummage around in one (or six) databases, then this is an indication of your failure to plan your database, or a grand opportunity to discover the non-obvious? What's your opinion? Parallel question that is closer to home: Your clothes closet: It represents your failure to plan your wardrobe or a grand opportunity to "rediscover" things you've forgotten you own? Your garage, etc., etc.

The preceding examples of Swanson and James are extraordinary, or they're the sort of thing that could happen everyday if we spent our time rummaging around in our databases? What's your opinion?

If you are a rummager, why not save time by planning your database in the first place?

In A Distributed World, Database Becomes Data Silo

"Redundant data, wrong data, missing data, miscoded data. Every company has some of each, probably residing in IT nooks that don't communicate much."

"For example, different sales, inventory or manufacturing systems at a clothing retailer might track the same item by different names. A central database - if there is one - might include "extra large," "XL" and "TG" (for the French term tres grande). But they all refer to the same thing."

"And then there's the attic problem familiar to most homeowners: Toss in enough boxes of seasonal clothes, holiday trim, family history documents and other important items, and soon there's a stored mess that's too big to manage. That can happen at companies, too. Multiple operating units, manufacturing plants and other facilities may all run different vendors' applications to do sales, human resources and other tasks. That mix of disparate data makes for a mass of unsorted and unreconciled information." Merging Data Silos Computerworld, April 15, 2002

Something to think about:

Your collaborator in Transylvania wants to share his information with you, but you've never heard of the DBMS he uses. Your collaborator in Dog Patch, USA, can only accept data specially formatted for his home-made DBMS. Your collaborator down the hall wants to give you data, but she's using an older version of your DBMS that will require reformatting before you load it in. Plus if your System Administrator shifts everyone in your company to a new operating system next year, you will have to jump to a different DBMS. Sound familiar?

Don't Believe It? Terry Brooks wrote a DBMS for himself in 1992 in Turbo Prolog and he uses it everyday to keep track of book reviews for JASIST. He no longer has the source code for his DBMS (don't bother to ask how this happened), and furthermore, he no longer even owns a working copy of Turbo Prolog. (And that's another mystery, but even if I were to find the floppies, my current office machine doesn't have a floppy drive.). Question: Is Terry Brooks living on the edge?