Database normalization
Database normalization is the process of eliminating duplicated data in a relational database. The key idea is to store data in one location, and provide links to it wherever needed. Well-normalized databases have a design that reflects the true dependencies between tracked quantities—allowing quick updates to data with little risk of introducing inconsistencies.
There are formal methods for quantifying "how normalized" a relational database is, and these classifications are called normal forms (or NF). Though algorithms exist for converting a given database between forms, these may require splitting existing tables into ones that are re-joined each time a query is issued. This can lead to poor performance, so intentional denormalization is sometimes preferred—especially in systems that do not use the relational model (such as OLAP).
History
Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form in his paper A Relational Model of Data for Large Shared Data Banks Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387[1].
Codd stated:
- There is, in fact, a very simple elimination* procedure which we shall call normalization. Through decomposition nonsimple domains are replaced by "domains whose elements are atomic (nondecomposable) values."
- * His term eliminate is misleading, as nothing is "lost" in normalization.
In his paper, Codd used the term "nonsimple" domains to describe a heterogeneous data structure, but later researchers would refer to such a structure as an abstract data type.
Normal Forms
One can only describe a database as having a normal form if the relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood, but most database designers model the relationships in terms of an "idealized schema". (The mathematical supports come back into play in proofs regarding the process of transforming from one form to another.)
Edgar Frank Codd originally established three normal forms: 1NF, 2NF and 3NF. There are now others that are generally accepted, but 3NF is widely considered to be sufficient for many practical applications. Most tables when reaching 3NF are also in BCNF. 4NF and 5NF are further extensions, and 6NF only applies to temporal databases. Normalizing beyond 3NF can be tricky with current SQL technology as of 2005, but a non-fully normalized database may be vulnerable to data corruption (referred to as update anomalies). Full normalization, even when not fully implemented in the target technology, is considered a good exercise to help discover all potential internal database consistency problems.
Non First Normal Form (NF²)
The non first normal form extends the first normal form as it "allows sets and sets of sets to be attribute domains" (Schek 1982). It is therefore non first normal as the first normal form states that attribute domains must be atomic. This extension introduces hierarchies in relations.
Consider the following table:
Person | Favorite Colors
| |
---|---|---|
Bob | blue red | |
Jane | green yellow red |
Assume a Person has several favorite colors. Obviously favorite colors consist of a set of colors which is modelled by the given table.
To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational algebra of the higher normal forms. The reverse operator is called "nest" which is not always the mathematical inverse of "unnest", although "unnest" is the mathematical inverse to "nest". Another constraint is required for the operators to be bijective which is covered by the Partitioned Normal Form (PNF)
First normal form
To understand first normal form (1NF), consider these two examples of things you might know:
- "What is your favorite color?"
- "What food will you not eat?"
A difference between these two questions is that, while you can have only one favorite color, there may be many foods you do not eat.
Data that has a single value such as "person's favorite color" is inherently in first normal form. Such data can be stored in a single table with a simple key/value combination. Data that has multiple values, however, must be stored differently.
Codd argued that there was one best way to keep multi-valued data such as "food a person will not eat." He suggested that the database should contain a separate table for the multi-value data and then store each food as a separate row in that table. Known as first normal form, this approach has been a standard for decades. An example of data in proper first normal form follows:
Person | Favorite Color |
---|---|
Bob | blue |
Jane | green |
Person | Foods Not Eaten |
---|---|
Bob | okra |
Bob | brussel sprouts |
Jane | peas |
Database systems when Codd was writing were relatively primitive. Newer databases now support abstract data types and other data-storage methods that usually offer better performance for the management of such data. However, such methods could be considered denormalization, as they often require the hierarchical model, which was abandoned mostly due to its inflexibility, to be reintroduced into the architecture of the system.
It is almost never a good idea to store data like this:
Person | Favorite Color | Foods Not Eaten 1 | Foods Not Eaten 2 | Foods Not Eaten 3 |
---|
This schema is not in the 1NF, and does not accurately represent the relationship as it exists in the real world. Even if the database designer believes that users will not need to store more than three foods not eaten, this system makes it difficult to add a fourth. For instance:
Person | Website | AIM | Yahoo! Screenname | ICQ |
---|
This schema is not in the 1NF since people can have more than one AIM screenname. Most websites choose to ignore this, however, and only allow a user to register one screenname with their account.
Second normal form
Second normal form (2NF) prescribes full functional dependency on the primary key. It most commonly applies to tables that have composite primary keys, where two or more attributes comprise the primary key. It requires that there are no non-trivial functional dependencies of a non-key attribute on a part (subset) of a candidate key. A table is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key (i.e. not partially dependent on candidate key).
Consider a table named part
describing machine parts with the following attributes:
PART_NUMBER (PRIMARY KEY) SUPPLIER_NAME (PRIMARY KEY) PRICE SUPPLIER_ADDRESS
The PART_NUMBER
and SUPPLIER_NAME
form the composite primary key, because the same part can be supplied by multiple suppliers. In this example, PRICE
is correctly placed on the part
table, because it is fully dependent on the primary key i.e. different suppliers will charge a different price for the same part.
SUPPLIER_ADDRESS
, however, is only dependent on the SUPPLIER_NAME
, and therefore this table breaks 2NF. This attribute should be placed on a second table named supplier
comprising:
SUPPLIER_NAME (PRIMARY KEY) SUPPLIER_ADDRESS
In order to find if a table is in 2NF, ask whether any of the non-key attributes of the table could be derived from a subset of the composite key, rather than the whole composite key. If the answer is yes, it's not in 2NF. This is solved sometimes by using a correlation file, such as the supplier
table above.
Easily understood definition: A unique key. A column of values that uniquely identify each row in each table.
Third normal form
- See main article Third normal form
Third normal form (3NF) requires that the table is in 2NF, and that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A table is in 3NF if none of the non-primary key attributes is a fact about any other non-primary key attribute. In summary, all non-key attributes are mutually independent (i.e. there should not be transitive dependencies).
Consider a table that defines a machine part as having the following attributes.
PART_NUMBER (PRIMARY KEY) MANUFACTURER_NAME MANUFACTURER_ADDRESS
In this case, the manufacturer address does not belong on this table, because it is a fact about the manufacturer of the part, rather than the part itself. MANUFACTURER_ADDRESS
should therefore be moved into a separate table with the attributes:
MANUFACTURER_NAME (PRIMARY KEY) MANUFACTURER_ADDRESS
...and the original table should be redefined as:
PART_NUMBER (PRIMARY KEY) MANUFACTURER_NAME
The problem with a table not being in 3NF is that for every MANUFACTURER_NAME we have to maintain a redundant MANUFACTURER_ADDRESS (i.e. an address for each part_number, rather than one for each MANUFACTURER_NAME).
Easily understood definition: Ensures that each table contains unique data. In other words, it ensures that a table of customer identification data does not contain order data, and so on.
Boyce-Codd normal form
Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey). At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A table is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a table is in BCNF if it is in 3NF and the only determinants are the candidate keys.
Example of a relation that is in 3NF form but not in BCNF:
Relation: {A,B,C,D} AB is a candidate key, BC is candidate key and A->C.
Fourth normal form
Fourth normal form (or 4NF) requires that there are no non-trivial multi-valued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.
Consider a case where we need the record of an employee with any professional qualifications they have gained, and internal training courses they have been on (an employee may have none, one or multiple of each of these). One way to record this information would be to create a relation as follows:
EMPLOYEE_ID QUALIFICATION_ID TRAINING_COURSE_ID
The problem with this design is that we might have to enter employee's qualification id more than once (or leave the qualification field blank) if they have been on more than one training course, which causes ambiguity on data storage and retrieval. Therefore this table is not in 4NF.
There are actually two distinct many-to-many relationships here: one between Employee and Qualification ID, and one between Employee and Training Course ID. So two separate tables should be created to capture these.
employee_qualification table: EMPLOYEE_ID QUALIFICATION_ID employee_training_course table: EMPLOYEE_ID TRAINING_COURSE_ID
This example has assumed that the two fields being recorded for the employees are independent; contrast a different case where for example the information to be recorded is the academic qualifications possessed by each employee, and the academic institution where the qualification was achieved. In this case the two values are not independent - it is necessary to record both the qualification and the institution (as a pair of values) in each case, and so a relation such as the following would be correct:
EMPLOYEE_ID DEGREE_ID UNIVERSITY_ID
And this would require no changes to fit the fourth normal form requirements.
Ronald Fagin demonstrated that it is always possible to achieve 4NF (but not always desirable). Rissanen's theorem is also applicable on multi-valued dependencies.
Fifth normal form
Fifth normal form (5NF and also PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and every join dependency in it is implied by the candidate keys.
Fifth Normal form Example
(Adapted from "A Simple Guide to Five Normal Forms in Relational Database Theory" see Further Reading )
Consider the following example:
Musician | Instrument | Genre |
---|---|---|
James | Piano | Classical |
James | Trumpet | Classical |
Kate | Drums | Jazz |
Kate | Piano | Jazz |
Kate | Trumpet | Jazz |
Kate | Clarinet | Jazz |
Lois | Saxophone | Jazz |
Lois | Piano | Classical |
Lois | Violin | Classical |
Lois | Guitar | Rock |
Here we have a list of musicians and the instruments they play under a certain genre. We can take as a model that a musician plays a certain kind of music in a certain genre only if the following three conditions hold:
- The musician plays the instrument.
- The musician plays the genre.
- The instrument is part of that genre.
With these constraints it is possible to split the relation into three parts.
|
|
|
|
|
Joining these three tables together will return the original relation.
Note how this setup helps to remove redundancy. Suppose that Kate learns Classical. In the previous setup we would have to add four new entries since each of the four instruments that Kate plays (Drums, Piano, Trumpet, and Clarinet) are all classical instruments, while with the new setup we only need to make one addition.
Kent's paper provides more details
Domain/key normal form
Domain/key normal form (or DKNF) requires that each key uniquely identifies each row in a table. A domain is the set of permissible values for an attribute. By enforcing key and domain restrictions, the database is assured of being freed from modification anomalies.
While sometimes called the 6NF, the DKNF should not be considered together with the seven other normal forms (1–6 and Boyce-Codd), because contrary to them it is not always achievable; furthermore, tables in the real 6NF are not always in the DKNF.
Sixth normal form
This normal form was, as of 2005, only recently conjectured: the sixth normal form (6NF) was only defined when extending the relational model to take into account the temporal dimension. Unfortunately, most current SQL technologies as of 2005 do not take into account this work, and most temporal extensions to SQL are not relational.
Further Reading
- Rules Of Data Normalization
- Date, C. J., & Lorentzos, N., & Darwen, H. (2002). Temporal Data & the Relational Model (1st ed.). Morgan Kaufmann. ISBN 1-55860-855-9.
- Date, C. J. (1999), An Introduction to Database Systems (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4.
- Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory, Communications of the ACM, vol. 26, pp. 120-125 (
- Date, C.J., & Darwen, H., & Pascal, F. Database Debunkings
- H.-J. Schek, P.Pistor Data Structures for an Integrated Data Base Management and Information Retrival System