Storing Lots of Stuff: Databases and Data SilosThe essence of information technology, information management, information retrieval, information processing, etc., is having information easily available; hence, the need of some device that can store vast amounts of information permanently and also provide quick retrieval of any single piece of information.
The very first production hard disk was the IBM 305 RAMAC (Random Access Method of Accounting and Control), introduced on September 13, 1956. It stored 5 million characters on 50 disks, each 24 inches in diameter. The data transfer rate of the RAMAC was 8,800 bytes per second.
Something to think about:
So what are you going to do? Just dump your data on to the your hard drive the way you throw beer cans on the ground? If you are going to retrieve data, it will have to be organized, i.e., given an address where the hard drive can find it. [This is organization of data at the physical layer, i.e. addresses and pointers]
But, if you were really clever you would write a computer program that enhanced your power to retrieve data, i.e., "Give me all the chevrolets that are blue in color and that have black leather interiors", etc. [This implies that you can point to "all the chevrolets" and then point to "all the chevrolets AND color is blue", etc. Hence your data must have some logical organization. Jargon alert! Your data has a "schema"] Let's call such a program, a database management program. So What Metaphor Will You Use To Model Your Data? More Jargon Alert! Your DBMS has a Schema LevelIn 1961, Charles Bachman at General Electric Co. developed the first successful database management system. Bachman's integrated data store (IDS) featured data schemas and logging. But it ran only on GE mainframes, could use only a single file for the database, and all generation of data tables had to be hand-coded. Something to read: The Programmer as Navigator by Charles Bachman 1973 ACM Turing Award Lecture
The Table Metaphor: Relational Database SystemsIn the early 1970s databases used either a hierarchical structure or a complex set of pointers to the physical location of data. This strategy was efficient for specific queries, but new queries required complex reprogramming and adding new types of data forced a total redesign of the database. During 1970-72 Ted Codd proposed the relational model which disconnects the schema (logical organization) of a database from the physical storage methods. Codd's idea was that relationships between data items should be based on the item's values, and not on the item's physical address.
Relational Tables Conform To The "Normalization Process"Significance check for the inattentive: Normalization means that the "construction of information" becomes a science! (or perhaps, an engineering science) (or at least, a mechanical process) (Ok, so now we have some "rules") (Well, the rules that engineers would use) (But, you have to admit, some rules are better than no rules) (Well, maybe)But would you let an engineer design your information?
Database Has Been An Enormously Successful Computer Application
In 2000, Oracle owned 33.8% of the worldwide database market while IBM commanded 30.1%, according to Gartner Dataquest. Microsoft was a distant third at 14.9%, according to the firm, though strong sales of its SQL Server database product have been hurting Oracle as well. Oracle Finds Competition at the Core TheStreet.com, 10/26/2001 Oracle chief executive Larry Ellison--Gates’ longtime nemesis in the ego-driven technology industry--now holds the top spot. Based on the closing prices of Oracle and Microsoft stock, Ellison is worth $53 billion, compared to Gates’ $51.75 billion. Gates loses title as world's richest man News.com April 28, 2000
Database Has Been Facilitated By Ever Cheaper Memory
Consider the price of a SECOND hard drive.
Finally Everyone Can Have His/Her Own Database (Or Six Of Them)But is more necessarily better? Can there be too much database? Consider the Size Effect when you can't remember all the stuff that's supposed to be in this database, or that database, or what it was called, and when you (or your assistant) put it in there, and maybe it's under another name, etc., etc. Have you considered getting an assistant for your assistant? How about a database for your assistant to keep track of your database?
Rummaging Around In A Database (Or Six Of Them)Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. Principles of Data Mining, David Hand, et al, 2001 Example of data mining: Don R. Swanson "Undiscovered Public Knowledge" Example of data mining: Bill James "Sabermetrics"
Something to think about:
If you are forced to rummage around in one (or six) databases, then this is an indication of your failure to plan your database, or a grand opportunity to discover the non-obvious? What's your opinion? Parallel question that is closer to home: Your clothes closet: It represents your failure to plan your wardrobe or a grand opportunity to "rediscover" things you've forgotten you own? Your garage, etc., etc. The preceding examples of Swanson and James are extraordinary, or they're the sort of thing that could happen everyday if we spent our time rummaging around in our databases? What's your opinion? If you are a rummager, why not save time by planning your database in the first place? In A Distributed World, Database Becomes Data Silo"Redundant data, wrong data, missing data, miscoded data. Every company has some of each, probably residing in IT nooks that don't communicate much." "For example, different sales, inventory or manufacturing systems at a clothing retailer might track the same item by different names. A central database - if there is one - might include "extra large," "XL" and "TG" (for the French term tres grande). But they all refer to the same thing." "And then there's the attic problem familiar to most homeowners: Toss in enough boxes of seasonal clothes, holiday trim, family history documents and other important items, and soon there's a stored mess that's too big to manage. That can happen at companies, too. Multiple operating units, manufacturing plants and other facilities may all run different vendors' applications to do sales, human resources and other tasks. That mix of disparate data makes for a mass of unsorted and unreconciled information." Merging Data Silos Computerworld, April 15, 2002
Something to think about:
Your collaborator in Transylvania wants to share his information with you, but you've never heard of the DBMS he uses. Your collaborator in Dog Patch, USA, can only accept data specially formatted for his home-made DBMS. Your collaborator down the hall wants to give you data, but she's using an older version of your DBMS that will require reformatting before you load it in. Plus if your System Administrator shifts everyone in your company to a new operating system next year, you will have to jump to a different DBMS. Sound familiar? Don't Believe It? Terry Brooks wrote a DBMS for himself in 1992 in Turbo Prolog and he uses it everyday to keep track of book reviews for JASIST. He no longer has the source code for his DBMS (don't bother to ask how this happened), and furthermore, he no longer even owns a working copy of Turbo Prolog. (And that's another mystery, but even if I were to find the floppies, my current office machine doesn't have a floppy drive.). Question: Is Terry Brooks living on the edge? |