Tuesday, 17 July 2012

Edgar Frank Codd and his Rules

Theories of how we should organize databases are thin on the ground. The one exception is the work of E.F. Codd, the originator of the commandment-like “Codd’s Rules”.

Database is the strangest of computer applications. It is vital - every computer manages some sort of database, it is difficult - real world database quickly exceeds any reasonable complexity - and it is lucrative - more programmers are occupied with database work than any other.
Strangely, though, theories of how we should organize databases have been very thin on the ground. For such an important subject it has been relatively neglected.
The one exception is perhaps the work of E.F. Codd. Immortalized to a generation or two of database creators, he is revered as the originator of the commandment-like “Codd’s Rules”.

Edgar Frank Codd (1923-2003)

Codd was a mathematician and this is an important fact that explains much about his work.
He was born in the UK and studied math at Oxford before moving to the States in 1949 to do some work at the University of Tennessee. Before very long he was seduced by the new subject of computing and was working for IBM as a mathematician and programmer.
The first machine that he worked on was the SSEC - one of the early valve machines. He moved on to help with the logic design of the IBM 701, one of the first modern computers, and then the 702. At this point (1953) he took a four-year leave and went to study for his PhD in Canada, at the University of Michigan.
Codd’s work at Michigan was on a relatively new area, Cellular Automata. A cellular automaton is a collection of relatively simple computing units, all of which run the same program and interact with their neighbors on a rectangular grid. This was highly theoretical stuff at the time but it was connected with areas of AI and Artificial Life that are now very important. His thesis was on self reproduction of cellular automata.
Later the work was published as a book, “Cellular Automata”, in 1968. Many theoretical computer scientists know this book but often never suspect that the Codd who wrote it is the same one who did the eminently practical database work!

Database relationships

However Codd’s main work began in 1969 when he became interested in database software.
Being a mathematician he saw database in a very different way to the average programmer. A programmer tends to see the physical way that data is coded and stored. He sees records composed of fields of different types and the whole lot is stored as a file with some type of structure. The operations that are performed on a database are seen as a general programming problem. To work with a database you write a program that reads the file and works with it in any way that the user/customer considers necessary. If you need to access two records from different parts of the file at that same time it is up to the programmer to get inside and dig around the file in anyway they like.
This is a recipe for complexity to get out of hand! You can well imagine that the database file could be come altered in ways that are incorrect - even to the extent of it not being a valid database file any longer!
What Codd did was to see that the simplest example of a database was nothing more than a table of values. The rows of the table are records and the columns fields and the table itself can be implemented as a file. This change in viewpoint is, however, more than just a relabeling of the parts. A table, or set of multiple values in mathematical terms, is known as a “relationship”.
For example, suppose I ask you to tabulate the relationship x<y for the integers 1 to 5. What you end up with is a table of pairs - or in general “tuples”:
  x  y
  1  2
  1  3
  1  4
  1  5
  2  3
  2  4
  2  5
  3  4
and so on…
Any relationship, not just the nice neat regular ones like =, <>, < etc., can be represented by a set of tuples and this is exactly the way that Codd saw a database table and hence the name of the new approach “relational database”.

Relational not related

It is amusing to note that over the years lots of proponents of Codd’s methods have misunderstood the use of the term “relational”, thinking that it comes from the way separate tables are “related” in some way.
If you know some math then the definition “a relation is a subset of the Cartesian product of a number of sets”, will say it all.

Of course just calling a database a table or relation isn’t enough to change the way programmers work. What Codd did next is to work out a theory of how traditional database operations could be implemented using an algebra of sets rather than programs. The first thing is that becomes apparent is that, as a relation is a set, you can’t have duplicates.
Also to make relations more like real databases you need to add the idea of key fields, i.e. a column in the table that identifies each record uniquely that is a given key value belongs to one and only one record.
According to Codd a relational database consists of more than one relation or table and our familiar notion of “relating” tables comes from the idea that one table has a key field that is a field in another table - this is known as a “foreign key”.
For example, Name might be a field in an invoice database and a key field in an address database. Records in each table are loosely speaking “connected” by the foreign key.
As an example of a database operation, consider the union of two tables. A union of two tables is just the table that results if you write out the rows of the second table after the rows of the first. Notice that you can form the union between two tables with different fields. Similarly the intersection between two tables is just the table that consists of the rows that they have in common.
You can see that you could carry on like this and define operations that combine one table with another to give you a third. It is this way of thinking that changes the “For first record To last Do xxx” way of processing a database. Now we are thinking non-procedural A+B=C and this is what the relational model and algebra is all about.
This description would be incomplete without mentioning the operation that really makes real world database manipulations possible - the join. If you have two database tables A and B to form the join A*B you first create the Cartesian product A X B. The Cartesian product is just the table which has the columns of A and B and lines that are created by taking every possible combination of a line from A and a line from B. For example, if A is:
 x     y
 tom   29
 fred  30
 codd  32
and B is:
 x     z
 tom   france
 codd  spain
then the Cartesian product of the two tables A X B is:
 x     x’    y   z  
 tom   tom   29  france
 tom   codd  29  spain
 fred  tom   30  france
 fred  codd  30  spain
 codd  tom   32  france
 codd  codd  32  spain
You should be able to see how this works but notice that the field x from table B has been renamed x’ to make it distinct.
The second step in forming the join is to remove all of the lines that do not have identical values in the common fields, i.e. x and x’. So A*B is
  x    x’   y   z
  tom  tom  29  france
  codd codd 32  spain
If you also take out the duplicate x’ column then you get what has come to be called an “equi-join” if you don't take out the duplicates you get a “natural-join” or just a “join”.
By using joins it is possible to break a database down into a number of smaller tables which can be put back together. Exactly how you break a database down is a question of which “normal form” you opt to use and here we start getting into the depths of database design.
Needless to say Codd’s approach was seen as very attractive - although not at first by his employer IBM. In 1982 IBM finally caught on and announced SQL and their new database, DB2, both based on Codd’s relational theories. The whole subject became so heated and confused that in 1985 he published his (in)famous 12 rules which were the principles that a relational database should obey. Interestingly Codd’s rules have become a stick that the “database thought police” use to beat the innocent programmers rather than a guiding light - for this reason you will find them reproduced on the next page.
His book, The Relational Model for Data Base Management" covers the practical aspects of the design of relational databases and defines the twelve rules and the systems that need to be followed in order to be described as truly relational with the motivation behind these rules in over 500 pages.
Codd attempted to remove the “procedural” approach from database and many think that this isn’t possible using a theory based on relations. Even more radical, some go so far as to think that it isn’t desirable and the mathematician’s phobia of procedure, i.e. dynamic processes, shouldn’t be foisted onto the programmer.
In 1981 Codd was awarded the Turing Award and in 1982 the ACM chose his 1970 paper as one of the 25 most important contributions to the industry. Whatever the true and long term value of the relational model, Codd never gave up the 12-rule approach and defined 12 rules for On-line Analytical Processing (OLAP)! He retired from IBM in 1984 and set up two companies to provide consultancy to the database world.
Of the 12 rules only the first 6 have to be satisfied for a database to be called “relational” but there is a rule 0 which has to be obeyed - perhaps they should be called Codd’s 13 rules?

Rule 0: Relational Database Management
For any system that is advertised as, or claimed to be, a relational database management system that system must be able to manage database entirely through its relational capabilities.

Rule 1: Representation of information
All information in a relational database is represented explicitly at the logical level and in exactly one way - by values in tables.

Rule 2: Guaranteed logical accessibility
Each and every datum in a relational database is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

Rule 3: Systematic representation of missing information
Null values (distinct from the empty character string or a string of blank characters and distinct from zero or any other number) are supported in fully relational database management systems for representing missing information and inapplicable information in a systematic way, independent of the data type.

Rule 4: Dynamic online catalog
The database description is represented at the local level in the same way as ordinary data, so that authorized users can apply the same relational language to its interrogation as they apply to the regular data.

Rule 5: Comprehensive data sub-language
A relation system may support several languages and various modes of terminal use (for example, the “fill in the blanks mode”). There must be, however, at least one language whose statements are expressible, per some well defined syntax, as character strings, and that is comprehensive in supporting all of the following items:
  • Data definition
  • View definition
  • Data manipulation
  • Integrity constraints
  • Authorization
  • Transaction boundaries (begin, commit, rollback)
Rule 6: Updatable views
All views that are theoretically updatable are also updatable by the system.

Rule 7: High level insert, update and delete
The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data, but also to the insertion, update and the deletion of data.

Rule 8: Physical data independence
Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods.

Rule 9: Logical data independence
Application program and terminal activities remain logically unimpaired when information-preserving changes of any kind that theoretically permit unimpairment are made to the base tables.

Rule 10: Integrity independence
Integrity constraints specific to a particular database must be definable in the relational data sub language and storage in the catalog, not in the applications program.

Rule 11: Distributed independence
Whether or not a system supports database distribution, it must have a data sublanguage that can support distributed database without impairing application programs or terminal activities.

Rule 12: The nonsubversion rule
If a relational system has a low-level (single-record-at-a-time) language, that low-level cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher level relational language (multiple-records-at-a-time)

No comments:

Post a Comment