Storing Lots of Stuff: Databases and Data Silos

The essence of information technology, information management, information retrieval, information processing, etc., is having information easily available; hence, the need of some device that can store vast amounts of information permanently and also provide quick retrieval of any single piece of information.

The platters of a hard drive store information as magnetic patterns. A stack of platters rotates at high speed while read/write heads fly just over the top and bottom surface of each platter. Each platter has its information recorded in concentric circles called tracks, which are further divided into smaller sections called sectors.

The very first production hard disk was the IBM 305 RAMAC (Random Access Method of Accounting and Control), introduced on September 13, 1956. It stored 5 million characters on 50 disks, each 24 inches in diameter. The data transfer rate of the RAMAC was 8,800 bytes per second.


 

Something to think about: So what are you going to do? Just dump your data on to the your hard drive the way you throw beer cans on the ground? If you are going to retrieve data, it will have to be organized, i.e., given an address where the hard drive can find it. [This is organization of data at the physical layer, i.e. addresses and pointers]

But, if you were really clever you would write a computer program that enhanced your power to retrieve data, i.e., "Give me all the chevrolets that are blue in color and that have black leather interiors", etc. [This implies that you can point to "all the chevrolets" and then point to "all the chevrolets AND color is blue", etc. Hence your data must have some logical organization. Jargon alert! Your data has a "schema"]

Let's call such a program, a database management program.

So What Metaphor Will You Use To Model Your Data?    More Jargon Alert! Your DBMS has a Schema Level

In 1961, Charles Bachman at General Electric Co. developed the first successful database management system. Bachman's integrated data store (IDS) featured data schemas and logging. But it ran only on GE mainframes, could use only a single file for the database, and all generation of data tables had to be hand-coded.

Something to read:  The Programmer as Navigator by Charles Bachman  1973 ACM Turing Award Lecture

 

My proposition today is that it is time for the application programmer to abandon the memory-centered view, and to accept the challenge and opportunity of navigation within the n-dimensional data space. Charles Bachman, 1973

 

The Table Metaphor: Relational Database Systems

In the early 1970s databases used either a hierarchical structure or a complex set of pointers to the physical location of data. This strategy was efficient for specific queries, but new queries required complex reprogramming and adding new types of data forced a total redesign of the database. During 1970-72 Ted Codd proposed the relational model which disconnects the schema (logical organization) of a database from the physical storage methods.

Codd's idea was that relationships between data items should be based on the item's values, and not on the item's physical address.

 

Relational Tables Conform To The "Normalization Process"

Significance check for the inattentive: Normalization means that the "construction of information" becomes a science! (or perhaps, an engineering science) (or at least, a mechanical process) (Ok, so now we have some "rules") (Well, the rules that engineers would use) (But, you have to admit, some rules are better than no rules) (Well, maybe)








But would you let an engineer design your information?

 

Database Has Been An Enormously Successful Computer Application

World’s Largest Database reaches 500,000 Gigabytes

Last week at the Stanford Linear Accelerator Center (SLAC), the BABAR experiment's database stored its 500,000th Gigabyte - a milestone that makes it the largest known database in the world. April 12, 2002

In 2000, Oracle owned 33.8% of the worldwide database market while IBM commanded 30.1%, according to Gartner Dataquest. Microsoft was a distant third at 14.9%, according to the firm, though strong sales of its SQL Server database product have been hurting Oracle as well. Oracle Finds Competition at the Core   TheStreet.com, 10/26/2001

Oracle chief executive Larry Ellison--Gates’ longtime nemesis in the ego-driven technology industry--now holds the top spot. Based on the closing prices of Oracle and Microsoft stock, Ellison is worth $53 billion, compared to Gates’ $51.75 billion. Gates loses title as world's richest man  News.com April 28, 2000

 

 

Database Has Been Facilitated By Ever Cheaper Memory

 

Consider the price of a SECOND hard drive.
Dell Computer, Summer 2003


 

Finally Everyone Can Have His/Her Own Database (Or Six Of Them)

But is more necessarily better? Can there be too much database? Consider the   Size Effect    when you can't remember all the stuff that's supposed to be in this database, or that database, or what it was called, and when you (or your assistant) put it in there, and maybe it's under another name, etc., etc. Have you considered getting an assistant for your assistant? How about a database for your assistant to keep track of your database?

 

Rummaging Around In A Database (Or Six Of Them)

Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. Principles of Data Mining, David Hand, et al, 2001

Example of data mining:    Don R. Swanson  "Undiscovered Public Knowledge"

Example of data mining:    Bill James  "Sabermetrics"

 

Something to think about:

If you are forced to rummage around in one (or six) databases, then this is an indication of your failure to plan your database, or a grand opportunity to discover the non-obvious? What's your opinion? Parallel question that is closer to home: Your clothes closet: It represents your failure to plan your wardrobe or a grand opportunity to "rediscover" things you've forgotten you own? Your garage, etc., etc.

The preceding examples of Swanson and James are extraordinary, or they're the sort of thing that could happen everyday if we spent our time rummaging around in our databases? What's your opinion?

If you are a rummager, why not save time by planning your database in the first place?

In A Distributed World, Database Becomes Data Silo

"Redundant data, wrong data, missing data, miscoded data. Every company has some of each, probably residing in IT nooks that don't communicate much."

"For example, different sales, inventory or manufacturing systems at a clothing retailer might track the same item by different names. A central database - if there is one - might include "extra large," "XL" and "TG" (for the French term tres grande). But they all refer to the same thing."

"And then there's the attic problem familiar to most homeowners: Toss in enough boxes of seasonal clothes, holiday trim, family history documents and other important items, and soon there's a stored mess that's too big to manage. That can happen at companies, too. Multiple operating units, manufacturing plants and other facilities may all run different vendors' applications to do sales, human resources and other tasks. That mix of disparate data makes for a mass of unsorted and unreconciled information." Merging Data Silos  Computerworld, April 15, 2002


Something to think about:

Your collaborator in Transylvania wants to share his information with you, but you've never heard of the DBMS he uses. Your collaborator in Dog Patch, USA, can only accept data specially formatted for his home-made DBMS. Your collaborator down the hall wants to give you data, but she's using an older version of your DBMS that will require reformatting before you load it in. Plus if your System Administrator shifts everyone in your company to a new operating system next year, you will have to jump to a different DBMS. Sound familiar?

Don't Believe It?  Terry Brooks wrote a DBMS for himself in 1992 in Turbo Prolog and he uses it everyday to keep track of book reviews for JASIST. He no longer has the source code for his DBMS (don't bother to ask how this happened), and furthermore, he no longer even owns a working copy of Turbo Prolog. (And that's another mystery, but even if I were to find the floppies, my current office machine doesn't have a floppy drive.). Question: Is Terry Brooks living on the edge?