Database normalization: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
{{expert}} |
{{expert}} |
||
In [[relational database]]s, '''normalization''' is a process that eliminates [[redundancy]], organizes data efficiently, and reduces the potential for anomalies during [[CRUD (acronym)|data operations]] and improves data consistency. The formal classifications for quantifying "how normalized" a relational database are called '''normal forms''' (abbrev. '''NF'''). |
In [[relational database]]s, '''normalization''' is a process that eliminates [[redundancy]], organizes data efficiently, and reduces the potential for anomalies during [[CRUD (acronym)|data operations]] and improves data consistency. The formal classifications used for quantifying "how normalized" a relational database are called '''normal forms''' (abbrev. '''NF'''). |
||
A non-normalized database is vulnerable to data anomalies because it stores data redundantly. If data is stored in two locations, but later is updated in only one of the locations, then the data is inconsistent; this is referred to as an "update anomaly". A normalized database stores non-[[Primary key|primary key]] data in only one location. |
A non-normalized database is vulnerable to data anomalies because it stores data redundantly. If data is stored in two locations, but later is updated in only one of the locations, then the data is inconsistent; this is referred to as an "update anomaly". A normalized database stores non-[[Primary key|primary key]] data in only one location. |
Revision as of 09:24, 30 May 2006
This article needs attention from an expert on the subject. Please add a reason or a talk parameter to this template to explain the issue with the article. |
In relational databases, normalization is a process that eliminates redundancy, organizes data efficiently, and reduces the potential for anomalies during data operations and improves data consistency. The formal classifications used for quantifying "how normalized" a relational database are called normal forms (abbrev. NF).
A non-normalized database is vulnerable to data anomalies because it stores data redundantly. If data is stored in two locations, but later is updated in only one of the locations, then the data is inconsistent; this is referred to as an "update anomaly". A normalized database stores non-primary key data in only one location.
Normalized databases have a design that reflects the true dependencies between tracked quantities, allowing quick updates to data with little risk of introducing inconsistencies. Instead of attempting to lump all information into one table, data is spread out logically into many tables.
Denormalization
Databases intended for Online Transaction Processing (OLTP) are normalized. OLTP Applications are characterized by a large number of small transactions like updating a sales record at a super market checkout counter. The expectation is to complete the transaction as quickly as possible. By contrast, databases intended for On Line Analytical Processing (OLAP) operations are primarily "read only" databases. OLAP applications tend to extract historical data, that has accumulated in the project for quite a long time. For such databases, redundant or "denormalized" data may facilitate Business Intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during ETL processing, and users should not be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the snowflake schema.
Also denormalization is often used to improve performance on smaller computers as in computerized cash-registers. Since these use the data for look-up only, no changes are to be made in the table (PLU) and a swift response is crucial. Therefore these computers use a "flat file", i.e. a denormalized database (or at least, a part of one).
History
Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form in his paper A Relational Model of Data for Large Shared Data Banks Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387[1].
Codd stated:
- There is, in fact, a very simple elimination* procedure which we shall call normalization. Through decomposition nonsimple domains are replaced by "domains whose elements are atomic (nondecomposable) values."
- * His term eliminate is misleading, as nothing is "lost" in normalization. He probably described it in a mathematical sense to mean elimination of complexity
In his paper, Codd used the term "nonsimple" domains to describe a heterogeneous data structure, but later researchers would refer to such a structure as an abstract data type.
Normal Forms
One can only describe a database as having a normal form if the relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood, but most database designers model the relationships in terms of an "idealized schema". (The mathematical support came back into play in proofs regarding the process of transforming from one form to another.)
Edgar Frank Codd originally established three normal forms:
- First normal form (1NF)
- Second normal form (2NF)
- Third normal form (3NF)
The first normal form basically states that an attribute can only store one value. The second and third normal forms deal with the relationship of non-key attributes to the primary key.
In practice, most applications in 3NF are fully normalized. However, later research identified potential update anomalies in 3NF databases. BCNF is one such further refinement.
The fourth and fifth normal forms (4NF and 5NF) deal specifically with the relationship of attributes comprising a multi-attribute primary key. Sixth normal form (6NF) only applies to temporal databases.
Non-First Normal Form (NF²)
The non-first normal form extends the first normal form as it "allows sets and sets of sets to be attribute domains" (Schek 1982). It is therefore non-first normal as the first normal form states that attribute domains must be atomic. This extension introduces hierarchies in relations.
Consider the following table:
Non-First Normal Form | |
Person | Favorite Colors |
---|---|
Bob | blue, red |
Jane | green, yellow, red |
Assume a Person has several favorite colors. Obviously favorite colors consist of a set of colors which is modelled by the given table.
To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational algebra of the higher normal forms. The reverse operator is called "nest" which is not always the mathematical inverse of "unnest", although "unnest" is the mathematical inverse to "nest". Another constraint is required for the operators to be bijective which is covered by the Partitioned Normal Form (PNF)
First normal form
What is 1NF?
- The domain of attribute must include only atomic (simple, indivisible) values.
To understand first normal form (1NF), consider these two examples of things you might know:
- "What is your favorite color?"
- "What food will you not eat?"
A difference between these two questions is that, while you can have only one favorite color, there may be many foods you do not eat.
In "1NF"; every attribute in a relation must be atomic. That is to say that there can be no composite attributes in the relation. Data that has a single value such as "person's favorite color" is inherently in first normal form. Such data can be stored in a single table with a simple key/value combination. Data that has multiple values, however, must be stored differently.
Codd argued that there was one best way to keep multi-valued data such as "food a person will not eat." He suggested that the database should contain a separate table for the multi-value data and then store each food as a separate row in that table. Known as first normal form, this approach has been a standard for decades. An example of data in proper first normal form follows:
Person | Favorite Color |
---|---|
Bob | blue |
Jane | green |
Person | Foods Not Eaten |
---|---|
Bob | okra |
Bob | brussels sprouts |
Jane | peas |
Database systems when Codd was writing were relatively primitive. Newer databases now support abstract data types and other data-storage methods that usually offer better performance for the management of such data. However, such methods could be considered denormalization, as they often require the hierarchical model, which was abandoned mostly due to its inflexibility, to be reintroduced into the architecture of the system.
It is almost never a good idea to store data like this:
Person | Favorite Color | Foods Not Eaten 1 | Foods Not Eaten 2 | Foods Not Eaten 3 |
---|
This schema is not in the 1NF, and does not accurately represent the relationship as it exists in the real world. Even if the database designer believes that users will not need to store more than three foods not eaten, this system makes it difficult to add a fourth. For instance:
Person | Website | AIM | Yahoo! Screenname | ICQ |
---|
This schema is not in the 1NF since people can have more than one AIM screenname. Most websites choose to ignore this, however, and only allow a user to register one screenname with their account.
Second normal form
- A relation schema R is in 2NF if it is in 1NF and every non-prime attribute A in R is fully functionally dependent on primary key.
Second normal form (2NF) describes full functional dependency on the primary key. It most commonly applies to tables that have composite primary keys, where two or more attributes comprise the primary key. It requires that there are no non-trivial functional dependencies of a non-key attribute on a part (subset) of a candidate key. A table is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key (i.e. not partially dependent on candidate key). Thus, if all keys are singleton keys, the relation is guaranteed to be in at least 2NF.
Consider a table describing the source of machine parts, with the following attributes:
Part's ID (primary key) | Supplier's ID (primary key) | Supplier's name | Price | Supplier's address |
---|---|---|---|---|
65 | 2 | Stylized Parts | $59.99 | VA |
73 | 2 | Stylized Parts | $20.00 | VA |
65 | 1 | ACME Industries | $59.99 | CA |
This table is in the first normal form, as all values in the table are atomic (although, in a production system, the currency sign would not be included). The part's ID and the supplier's ID form the composite primary key because the same part can be supplied by multiple suppliers.
The relationship of the primary keys to price is correct: price is fully dependent on the primary key since different suppliers may charge a different price for the same part.
However, the supplier's name and address are only dependent on the supplier's ID, and thus this table breaks 2NF. Note the redundancy, where the name "Stylized Parts" and address "VA" is recorded twice. What if they relocate or merge with another company? This attribute should be placed in a second table like:
ID | Name | Address |
---|---|---|
1 | ACME Industries | CA |
2 | Stylized Parts | VA |
In order to find if a table is in 2NF, examine each non-key column of the table and ask whether or not it is fully dependent on every key in the composite key. If it isn't, it should be moved to its own table. Note that the next step, making sure that each non-key column is not dependent on any other non-key column, constitutes the third normal form.
Third normal form
- See main article Third normal form
Third normal form (3NF) requires that the table be in 2NF, and that there be no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A table is in 3NF if none of the non-primary key attributes is a fact about any other non-primary key attribute. In summary, all non-key attributes are mutually independent (i.e. there should not be transitive dependencies). Thus, any relation in which all the attributes are prime attributes (part of some key) is guaranteed to be in at least 3NF.
Consider a table (in 2NF) that defines a machine part as having the following attributes.
PART_NUMBER (PRIMARY KEY) MANUFACTURER_NAME MANUFACTURER_ADDRESS
In this case, the manufacturer address does not belong on this table, because it is a fact about the manufacturer of the part, rather than the part itself. MANUFACTURER_ADDRESS
should therefore be moved into a separate table with the attributes:
MANUFACTURER_NAME (PRIMARY KEY) MANUFACTURER_ADDRESS
...and the original table should be redefined as:
PART_NUMBER (PRIMARY KEY) MANUFACTURER_NAME (FOREIGN KEY)
The field MANUFACTURER_NAME in this table is a PRIMARY KEY in another table and is called FOREIGN KEY in this table
The problem with a table not being in 3NF is that for every MANUFACTURER_NAME we have to maintain a redundant MANUFACTURER_ADDRESS (i.e. an address for each part_number, rather than one for each MANUFACTURER_NAME).
Boyce-Codd normal form
Boyce-Codd normal form (or BCNF) requires that there be no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey). At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A table is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a table is in BCNF if it is in 3NF and the only determinants are the candidate keys.
Still more memorable is the formulation: The key, the whole key, and nothing but the key - so help me Codd.
Example of a relation that is in 3NF form but not in BCNF:
- Relation: {A,B,C,D} AB is a candidate key, BC is candidate key and A->C.
Example of a relation that is in 3NF form and in BCNF:
- The relation R(A,B) is guaranteed to be in BCNF since its only possible functional dependencies are A->B, B->A and/or the trivial AB->AB.
Fourth normal form
Fourth normal form (or 4NF) requires that there be no non-trivial multi-valued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.
Consider a case where we need the record of an employee with any professional qualifications they have gained, and internal training courses they have been on (an employee may have none, one or multiple of each of these). One way to record this information would be to create a relation as follows:
EMPLOYEE_ID QUALIFICATION_ID TRAINING_COURSE_ID
The problem with this design is that we might have to enter employee's qualification id more than once (or leave the qualification field blank) if they have been on more than one training course, which causes ambiguity on data storage and retrieval. Therefore this table is not in 4NF.
There are actually two distinct many-to-many relationships here: one between Employee and Qualification ID, and one between Employee and Training Course ID. So two separate tables should be created to capture these.
employee_qualification table: EMPLOYEE_ID QUALIFICATION_ID
employee_training_course table: EMPLOYEE_ID TRAINING_COURSE_ID
This example has assumed that the two fields being recorded for the employees are independent; contrast a different case where for example the information to be recorded is the academic qualifications possessed by each employee, and the academic institution where the qualification was achieved. In this case the two values are not independent - it is necessary to record both the qualification and the institution (as a pair of values) in each case, and so a relation such as the following would be correct:
EMPLOYEE_ID DEGREE_ID UNIVERSITY_ID
And this would require no changes to fit the fourth normal form requirements.
Ronald Fagin demonstrated that it is always possible to achieve 4NF (but not always desirable). Rissanen's theorem is also applicable on multi-valued dependencies.
Fifth normal form
Fifth normal form (5NF and also PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and every join dependency in it is implied by the candidate keys.
Fifth Normal form Example
(Adapted from "A Simple Guide to Five Normal Forms in Relational Database Theory"; see Further Reading)
Consider the following example:
Musician | Instrument | Genre |
---|---|---|
James | Piano | Classical |
James | Trumpet | Classical |
Kate | Drums | Jazz |
Kate | Piano | Jazz |
Kate | Trumpet | Jazz |
Kate | Clarinet | Jazz |
Lois | Saxophone | Jazz |
Lois | Piano | Classical |
Lois | Violin | Classical |
Lois | Guitar | Rock |
Here we have a list of musicians and the instruments they play under a certain genre. We can take as a model that a musician plays a certain kind of music in a certain genre only if the following three conditions hold:
- The musician plays the instrument.
- The musician plays the genre.
- The instrument is part of that genre.
With these constraints it is possible to split the relation into three parts.
|
|
|
Joining these three tables together will return the original relation.
Note how this setup helps to remove redundancy. Suppose that Kate learns Jazz. In the previous setup we would have to add four new entries since each of the four instruments that Kate plays (Drums, Piano, Trumpet, and Clarinet) are all jazz instruments, while with the new setup we only need to make one addition.
(Kent's paper provides more details.)
Domain/key normal form
Domain/key normal form (or DKNF) requires that each key uniquely identifies each row in a table. A domain is the set of permissible values for an attribute. By enforcing key and domain restrictions, the database is assured of being freed from modification anomalies.
While sometimes called the 6NF, the DKNF should not be considered together with the seven other normal forms (1–6 and Boyce-Codd), because contrary to them it is not always achievable; furthermore, tables in the real 6NF are not always in the DKNF.
Sixth normal form
This normal form was, as of 2005, only recently proposed: the sixth normal form (6NF) was only defined when extending the relational model to take into account the temporal dimension. Unfortunately, most current SQL technologies as of 2005 do not take into account this work, and most temporal extensions to SQL are not relational. See work by Date, Darwen and Lorentzos [2] for a relational temporal extension, or see TSQL2 for a different approach.
Further reading
- Litt's Tips: Normalization
- Rules Of Data Normalization
- Date, C. J., & Lorentzos, N., & Darwen, H. (2002). Temporal Data & the Relational Model (1st ed.). Morgan Kaufmann. ISBN 1-55860-855-9.
- Date, C. J. (1999), An Introduction to Database Systems (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4.
- Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory, Communications of the ACM, vol. 26, pp. 120-125
- Date, C.J., & Darwen, H., & Pascal, F. Database Debunkings
- H.-J. Schek, P.Pistor Data Structures for an Integrated Data Base Management and Information Retrieval System
External links
- Database Normalization Basics by Mike Chapple (About.com)
- An Introduction to Database Normalization by Mike Hillyer.
- Normalization by ITS, University of Texas.
- Rules of Data Normalization by Data Model.org