Database normalization

Database normalization is the process of eliminating duplicated data in a relational database. The key idea is to store data in one location, and provide links to it wherever needed. Well-normalized databases have a design that reflects the true dependencies between tracked quantities—allowing quick updates to data with little risk of introducing inconsistencies.

There are formal methods for quantifying "how normalized" a relational database is, and these classifications are called normal forms (or NF). Though algorithms exist for converting a given database between forms, these may require splitting existing tables into ones that are re-joined each time a query is issued. This can lead to poor performance, so intentional denormalization is sometimes preferred—especially in systems that do not use the relational model (such as OLAP).

History

Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form in his paper A Relational Model of Data for Large Shared Data Banks Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387[1].

Codd stated:

There is, in fact, a very simple elimination* procedure which we shall call normalization. Through decomposition nonsimple domains are replaced by "domains whose elements are atomic (nondecomposable) values."

* His term eliminate is misleading, as nothing is "lost" in normalization.

In his paper, Codd used the term "nonsimple" domains to describe a heterogeneous data structure, but later researchers would refer to such a structure as an abstract data type.

Normal Forms

One can only describe a database as having a normal form if the relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood, but most database designers model the relationships in terms of an "idealized schema". (The mathematical supports come back into play in proofs regarding the process of transforming from one form to another.)

Edgar Frank Codd originally established three normal forms: 1NF, 2NF and 3NF. There are now others that are generally accepted, but 3NF is widely considered to be sufficient for many practical applications. Most tables when reaching 3NF are also in BCNF. 4NF and 5NF are further extensions, and 6NF only applies to temporal databases. Normalizing beyond 3NF can be tricky with current SQL technology as of 2005, but a non-fully normalized database may be vulnerable to data corruption (referred to as update anomalies). Full normalization, even when not fully implemented in the target technology, is considered a good exercise to help discover all potential internal database consistency problems.

Non First Normal Form (NF²)

The non first normal form extends the first normal form as it "allows sets and sets of sets to be attribute domains" (Schek 1982). It is therefore non first normal as the first normal form states that attribute domains must be atomic. This extension introduces hierarchies in relations.

Consider the following table:

Non First Normal Form

Person

Favorite Colors

Color

Bob

blue
red

Jane

green
yellow
red

Assume a Person has several favorite colors. Obviously favorite colors consist of a set of colors which is modelled by the given table.

To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational algebra of the higher normal forms. The reverse operator is called "nest" which is not always the mathematical inverse of "unnest", although "unnest" is the mathematical inverse to "nest". Another constraint is required for the operators to be bijective which is covered by the Partitioned Normal Form (PNF)

First normal form

To understand first normal form (1NF), consider these two examples of things you might know:

"What is your favorite color?"

"What food will you not eat?"

A difference between these two questions is that, while you can have only one favorite color, there may be many foods you do not eat.

Data that has a single value such as "person's favorite color" is inherently in first normal form. Such data can be stored in a single table with a simple key/value combination. Data that has multiple values, however, must be stored differently.

Codd argued that there was one best way to keep multi-valued data such as "food a person will not eat." He suggested that the database should contain a separate table for the multi-value data and then store each food as a separate row in that table. Known as first normal form, this approach has been a standard for decades. An example of data in proper first normal form follows:

Table 1
Person	Favorite Color
Bob	blue
Jane	green

Table 2
Person	Foods Not Eaten
Bob	okra
Bob	brussel sprouts
Jane	peas

Database systems when Codd was writing were relatively primitive. Newer databases now support abstract data types and other data-storage methods that usually offer better performance for the management of such data. However, such methods could be considered denormalization, as they often require the hierarchical model, which was abandoned mostly due to its inflexibility, to be reintroduced into the architecture of the system.

It is almost never a good idea to store data like this:

Table 3
Person	Favorite Color	Foods Not Eaten 1	Foods Not Eaten 2	Foods Not Eaten 3

This schema is not in the 1NF, and does not accurately represent the relationship as it exists in the real world. Even if the database designer believes that users will not need to store more than three foods not eaten, this system makes it difficult to add a fourth. For instance:

Table 4
Person	Website	AIM	Yahoo! Screenname	ICQ

This schema is not in the 1NF since people can have more than one AIM screenname. Most websites choose to ignore this, however, and only allow a user to register one screenname with their account.

Second normal form

Second normal form (2NF) prescribes full functional dependency on the primary key. It most commonly applies to tables that have composite primary keys, where two or more attributes comprise the primary key. It requires that there are no non-trivial functional dependencies of a non-key attribute on a part (subset) of a candidate key. A table is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key (i.e. not partially dependent on candidate key).

Consider a table named part describing machine parts with the following attributes:

PART_NUMBER (PRIMARY KEY)
SUPPLIER_NAME (PRIMARY KEY)
PRICE
SUPPLIER_ADDRESS

The PART_NUMBER and SUPPLIER_NAME form the composite primary key, because the same part can be supplied by multiple suppliers. In this example, PRICE is correctly placed on the part table, because it is fully dependent on the primary key i.e. different suppliers will charge a different price for the same part.

SUPPLIER_ADDRESS, however, is only dependent on the SUPPLIER_NAME, and therefore this table breaks 2NF. This attribute should be placed on a second table named supplier comprising:

SUPPLIER_NAME (PRIMARY KEY)
SUPPLIER_ADDRESS

In order to find if a table is in 2NF, ask whether any of the non-key attributes of the table could be derived from a subset of the composite key, rather than the whole composite key. If the answer is yes, it's not in 2NF. This is solved sometimes by using a correlation file, such as the supplier table above.

Easily understood definition: A unique key. A column of values that uniquely identify each row in each table.

Third normal form

See main article Third normal form

Third normal form (3NF) requires that the table is in 2NF, and that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A table is in 3NF if none of the non-primary key attributes is a fact about any other non-primary key attribute. In summary, all non-key attributes are mutually independent (i.e. there should not be transitive dependencies).

Consider a table that defines a machine part as having the following attributes.

PART_NUMBER (PRIMARY KEY)
MANUFACTURER_NAME
MANUFACTURER_ADDRESS

In this case, the manufacturer address does not belong on this table, because it is a fact about the manufacturer of the part, rather than the part itself. MANUFACTURER_ADDRESS should therefore be moved into a separate table with the attributes:

MANUFACTURER_NAME (PRIMARY KEY)
MANUFACTURER_ADDRESS

...and the original table should be redefined as:

PART_NUMBER (PRIMARY KEY)
MANUFACTURER_NAME

The problem with a table not being in 3NF is that for every MANUFACTURER_NAME we have to maintain a redundant MANUFACTURER_ADDRESS (i.e. an address for each part_number, rather than one for each MANUFACTURER_NAME).

Easily understood definition: Ensures that each table contains unique data. In other words, it ensures that a table of customer identification data does not contain order data, and so on.

Boyce-Codd normal form

Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey). At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A table is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a table is in BCNF if it is in 3NF and the only determinants are the candidate keys.

Example of a relation that is in 3NF form but not in BCNF:

Relation: {A,B,C,D} AB is a candidate key, BC is candidate key and A->C.

Fourth normal form

Fourth normal form (or 4NF) requires that there are no non-trivial multi-valued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.

Consider a case where we need the record of an employee with any professional qualifications they have gained, and internal training courses they have been on (an employee may have none, one or multiple of each of these). One way to record this information would be to create a relation as follows:

EMPLOYEE_ID
QUALIFICATION_ID
TRAINING_COURSE_ID

The problem with this design is that we might have to enter employee's qualification id more than once (or leave the qualification field blank) if they have been on more than one training course, which causes ambiguity on data storage and retrieval. Therefore this table is not in 4NF.

There are actually two distinct many-to-many relationships here: one between Employee and Qualification ID, and one between Employee and Training Course ID. So two separate tables should be created to capture these.

employee_qualification table:
EMPLOYEE_ID
QUALIFICATION_ID

employee_training_course table:
EMPLOYEE_ID
TRAINING_COURSE_ID

This example has assumed that the two fields being recorded for the employees are independent; contrast a different case where for example the information to be recorded is the academic qualifications possessed by each employee, and the academic institution where the qualification was achieved. In this case the two values are not independent - it is necessary to record both the qualification and the institution (as a pair of values) in each case, and so a relation such as the following would be correct:

EMPLOYEE_ID
DEGREE_ID
UNIVERSITY_ID

And this would require no changes to fit the fourth normal form requirements.

Ronald Fagin demonstrated that it is always possible to achieve 4NF (but not always desirable). Rissanen's theorem is also applicable on multi-valued dependencies.

Fifth normal form

Fifth normal form (5NF and also PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and every join dependency in it is implied by the candidate keys.

Fifth Normal form Example

(Adapted from "A Simple Guide to Five Normal Forms in Relational Database Theory" see Further Reading )

Consider the following example:


Musician	Instrument	Genre
James	Piano	Classical
James	Trumpet	Classical
Kate	Drums	Jazz
Kate	Piano	Jazz
Kate	Trumpet	Jazz
Kate	Clarinet	Jazz
Lois	Saxophone	Jazz
Lois	Piano	Classical
Lois	Violin	Classical
Lois	Guitar	Rock

Here we have a list of musicians and the instruments they play under a certain genre. We can take as a model that a musician plays a certain kind of music in a certain genre only if the following three conditions hold:

The musician plays the instrument.
The musician plays the genre.
The instrument is part of that genre.

With these constraints it is possible to split the relation into three parts.


Musician	Instrument
James	Piano
James	Trumpet
Kate	Drums
Kate	Piano
Kate	Trumpet
Kate	Clarinet
Lois	Saxophone
Lois	Piano
Lois	Violin
Lois	Guitar


Musician	Genre
James	Classical
Kate	Jazz
Lois	Jazz
Lois	Classical
Lois	Rock


Instrument	Genre
Piano	Classical
Piano	Jazz
Trumpet	Classical
Trumpet	Jazz
Drums	Classical
Drums	Jazz
Drums	Rock
Clarinet	Classical
Clarinet	Jazz
Saxophone	Jazz
Guitar	Rock

Joining these three tables together will return the original relation.

Note how this setup helps to remove redundancy. Suppose that Kate learns Classical. In the previous setup we would have to add four new entries since each of the four instruments that Kate plays (Drums, Piano, Trumpet, and Clarinet) are all classical instruments, while with the new setup we only need to make one addition.

Kent's paper provides more details

Domain/key normal form

Domain/key normal form (or DKNF) requires that each key uniquely identifies each row in a table. A domain is the set of permissible values for an attribute. By enforcing key and domain restrictions, the database is assured of being freed from modification anomalies.

While sometimes called the 6NF, the DKNF should not be considered together with the seven other normal forms (1–6 and Boyce-Codd), because contrary to them it is not always achievable; furthermore, tables in the real 6NF are not always in the DKNF.

Sixth normal form

This normal form was, as of 2005, only recently conjectured: the sixth normal form (6NF) was only defined when extending the relational model to take into account the temporal dimension. Unfortunately, most current SQL technologies as of 2005 do not take into account this work, and most temporal extensions to SQL are not relational.