Introduction

One of the more complicated topics in the area of database management is the process of normalizing the tables in a relational database. These notes are intended to provide you with an overview of this topic, which I hope will be helpful to you after you have gained some familiarity with the ideas of, and techniques used in, normalization.

The underlying ideas in normalization are simple enough. Through normalization we want to design for our relational database a set of files that (1) contain all the data necessary for the purposes that the database is to serve, (2) have as little redundancy as possible, (3) accommodate multiple values for types of data that require them, (4) permit efficient updates of the data in the database, and (5) avoid the danger of losing data unknowingly.

The primary reason for normalizing databases to at least the level of the 3rd Normal Form (the levels are explained below) is that normalization is a potent weapon against the possible corruption of databases stemming from what are called "insertion anomalies," "deletion anomalies," and "update anomalies." These types of error can creep into databases that are insufficiently normalized.

An "insertion anomaly" is a failure to place information about a new database entry into all the places in the database where information about that new entry needs to be stored. In a properly normalized database, information about a new entry needs to be inserted into only one place in the database; in an inadequately normalized database, information about a new entry may need to be inserted into more than one place, and, human fallibility being what it is, some of the needed additional insertions may be missed.

A "deletion anomaly" is a failure to remove information about an existing database entry when it is time to remove that entry. In a properly normalized database, information about an old, to-be-gotten-rid-of entry needs to be deleted from only one place in the database; in an inadequately normalized database, information about that old entry may need to be deleted from more than one place, and, human fallibility being what it is, some of the needed additional deletions may be missed.

An update of a database involves modifications that may be additions, deletions, or both. Thus "update anomalies" can be either of the kinds of anomalies discussed above.

All three kinds of anomalies are highly undesirable, since their occurrence constitutes corruption of the database. Properly normalized databases are much less susceptible to corruption than are unnormalized databases.

Normalization can be viewed as a series of steps (i.e., levels) designed, one after another, to deal with ways in which tables can be "too complicated for their own good". The purpose of normalization is to reduce the chances for anomalies to occur in a database. The definitions of the various levels of normalization illustrate complications to be eliminated in order to reduce the chances of anomalies. At all levels and in every case of a table with a complication, the resolution of the problem turns out to be the establishment of two or more simpler tables which, as a group, contain the same information as the original table but which, because of their simpler individual structures, lack the complication.

Single-Theme Tables

In practice, accomplishing normalization is often fairly simple. Confining the entries in any one table to data related to a single set of related attributes--what I like to call "single-theme tables"--will usually do the job. By a "single-theme table" I mean a table that concentrates either on one concept in the situation or on one relationship in the situation. The examples later in this lesson concern a hypothetical discussion of how to set up a database dealing with puppies, kennels, and tricks performed by the puppies. In terms of these examples, you will see single-theme tables dealing with one concept, (e.g., with just puppies and their names, with just tricks and the names of tricks) and with one relationship (e.g., pairings of puppies and tricks). You will also see some tables that are not single-theme tables, and you will see some of the problems that ensue from their failing to concentrate on a single theme.

If, instead of using the single-theme approach, you set out to normalize the tables in a database via a definitional approach (i.e., carefully examining tables in terms of the definitions of the various levels of normal forms), you may encounter some difficulty, or at least some tedium, in achieving complete surety that you have achieved a high level of normalization. The formal rules that follow provide a summary of the normalization process, but you will need to study them carefully and to work through several examples before you can start to feel comfortable in your understanding of normalization. Here are the formal rules of normalization, presented primarily for reference and as a summary, useful after you have learned about normalization through more detailed discussions elsewhere.

Note: The examples below (in the section entitled "An Outline of Normalization by Marc Rettig," following the section on "Formal Definitions of the Normal Forms") are intended to present a simplified introduction to the ideas of, and arguments for, normalization of databases. For a more detailed discussion of normalization, one source is my lesson entitled Steps in Normalization.

Formal Definitions of the Normal Forms

1st Normal Form (1NF)

Def: A table (relation) is in 1NF if

1. There are no duplicated rows in the table.

2. Each cell is single-valued (i.e., there are no repeating groups or arrays).

3. Entries in a column (attribute, field) are of the same kind.

Note: The order of the rows is immaterial; the order of the columns is immaterial.

Note: The requirement that there be no duplicated rows in the table means that the table has a key (although the key might be made up of more than one column--even, possibly, of all the columns).

2nd Normal Form (2NF)

Def: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on all of the key.

Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of the (composite) key, the definition of 2NF is sometimes phrased as, "A table is in 2NF if it is in 1NF and if it has no partial dependencies."

3rd Normal Form (3NF)

Def: A table is in 3NF if it is in 2NF and if it has no transitive dependencies.

Boyce-Codd Normal Form (BCNF)

Def: A table is in BCNF if it is in 3NF and if every determinant is a candidate key.

4th Normal Form (4NF)

Def: A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.

5th Normal Form (5NF)

Def: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in 4NF and if every join dependency in the table is a consequence of the candidate keys of the table.

Domain-Key Normal Form (DKNF)

Def: A table is in DKNF if every constraint on the table is a logical consequence of the definition of keys and domains.

An Outline of Normalization by Marc Rettig

Here is an outline of the ideas of normalization. It is adapted from a presentation prepared by Marc Rettig, Technical Editor, Database Programming & Design magazine. He uses examples from a hypothetical discussion of how to set up a database dealing with puppies, kennels, and tricks performed by the puppies. The words that follow are Mr. Rettig's, except for my comments in square brackets.

Rules of Normalization

Rule 1. Eliminate Repeating Groups. Make a separate table for each set of related attributes, and give each table a primary key.

Unnormalized Data Items for Puppies

Puppy number
Puppy name
Kennel Code
Kennel Name
Kennel location
Trick ID 1...n
Trick Name 1...n
Trick Where Learned 1...n
Skill Level 1...n

 

 

 

 

 

 

 

 

 

 

 

 

In the original list of data, each puppy's description is followed by a list of tricks the puppy has learned. Some might know ten tricks, some might not know any. To answer the question, "Can Fifi roll over?", we need first to find Fifi's puppy record, then scan the list of tricks asociated with the record. This is awkward, inefficient, and extremely untidy.

Moving the tricks into a separate table helps considerably. Separating the repeating groups of tricks from the puppy information results in first normal form. The puppy number in the trick table matches the primary key in the puppy table, providing a foreign key for relating the two tables with a join operation. Now we can answer our question with a direct retrieval: look to see if Fifi's puppy number and the trick ID for "roll over" appear together in the trick table.

FIRST NORMAL FORM

Puppy Table
Puppy Number Primary Key
Puppy Name Every puppy gets a unique number.
Kennel Name
Kennel Location

Trick Table
Puppy Number Primary Key [composite]
Trick ID
Trick Name We'll add a row for every trick
Trick Where Learned learned by every puppy.
Skill Level

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[Note that in the Trick Table, the primary key is a multi-valued, or "composite," key. It consists of both Puppy Number and Trick ID.]

Rule 2. Eliminate Redundant Data. If an attribute depends on only part of a multi-valued key, remove it to a separate table.

TRICK TABLE

Puppy Number

Trick ID

Trick Name

Where Learned

Skill Level

52

27

Roll Over

16

9

53

16

Nose Stand

9

9

54

27

Roll Over

9

5

 

 

 

 

 

 

 

 

The trick name (e.g., "Roll Over") appears redundantly for every puppy that knows it. Just Trick ID would do.

[Note that Trick Name depends on only a part (the Trick ID) of the multi-valued, i.e., composite, key.]

SECOND NORMAL FORM

Puppy Table
Puppy Number
Puppy Name
Kennel Code
Kennel Name
Kennel Location

Tricks
Trick ID
Trick Name

Puppy Tricks
Puppy Number
Trick ID
Trick Where Learned
Skill Level

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In the Trick Table, the primary key is made up of the puppy number and the trick ID. This makes sense for the "Where Learned" and "Skill Level" attributes, since they will be different for every puppy-trick combination. But the trick name depends only on the Trick ID. The same name will appear redundantly every time its associated ID appears in the Trick Table.

Suppose you want to reclassify a trick, i.e., to give it a different trick ID. The change has to be made for every puppy that knows the trick! If you miss some of the changes, you will have several puppies with the same trick under different IDs. This is an update anomaly.

Or suppose the last puppy knowing a particular trick gets run over by a car. His records will be removed from the database, and the trick will not be stored anywhere! This is a delete anomaly. To avoid these problems, we need the second normal form.

To achieve this, separate the attributes depending on both parts of the key from those depending only on the Trick ID. This results in two tables: "Tricks," which gives the name for each Trick ID; and "Puppy Tricks," which lists the tricks learned by each puppy.

["Tricks" is clearly a single-theme table. "Puppy Tricks" is a single-theme table because it pairs puppies and their tricks, i.e., it deals with the relationships between individual puppies and the tricks they know. In "Puppy Tricks," the "Trick Where Learned" attribute is clearly dependent on both parts of the key, since the attribute is based not only on which particular trick is being referred to but also on where the particular puppy learned that trick. The same is true of the "Skill Level" attribute, since this attribute is based not only on which particular trick is being referred to but also on what the particular puppy's level of skill on that trick is.]

Now we can reclassify a trick in a single operation: Look up the Trick ID in the "Tricks" table and change its name. The result will instantly be available throughout the application.

Rule 3. Eliminate Columns Not Dependent on Key. If attributes do not contribute to a description of the key, remove them to a separate table.

Puppy Table
Puppy Number
Puppy Name
Kennel Code
Kennel Name

 

 

 

 

 

 

 

The Puppy Table satisfies the first normal form, since it contains no repeating groups. It satisfies the second normal form, since it does not have a multivalued key. But the key is Puppy Number, and the kennel name and kennel location describe only a kennel, not a puppy. To achieve the third normal form, they must be moved into a separate table. Since they describe a kennel, Kennel Code becomes the key of the new "Kennels" table.

THIRD NORMAL FORM

Puppies
Puppy Number
Puppy Name
Kennel Code

Kennels
Kennel Code
Kennel Name
Kennel Location

Tricks
Trick ID
Trick Name

Puppy Tricks
Puppy Number
Trick ID

Trick Where Learned
Skill Level

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The motivation for this is the same as for the second normal form: We want to avoid update and delete anomalies. For example, suppose no puppies from the Daisy Hill Puppy Farm were currently stored in the database. With the previous design, there would be no record of its existence!

["Kennels" is clearly a single-theme table. "Puppies" is close to a single-theme table, in that it concentrates on puppies and their numbers. If the real-life situation is--which is likely--that any given puppy can be identified with one and only one kennel, then we can consider "Puppies" to be a single-theme table. But if--which is not likely--puppies sometimes start life in one kennel and are then transferred, as puppies, to another kennel, we would not be able to consider "Puppies" to be single-theme because we would have to make further modifications to handle the mixed origins.]

Third Normal Form is sufficient for most situations. But if that isn't normal enough for you ...

Rule 4. Isolate Independent Multiple Relationships. No table may contain two or more l:n or n:m relationships that are not directly related.

This applies only to designs that include one-to-many and many-to-many relationships. An example of a one-to-many relationship is that one kennel can hold many puppies. An example of a many-to-many relationship is that a puppy can know many tricks and several puppies can know the same trick.

Puppy Tricks and Costumes
Puppy Number
Trick ID
Trick Where Learned
Skill Level
Costume

 

 

 

 

 

 

 

 

Suppose we want to add a new attribute to the Puppy-Trick table, "Costume." This way we can look for puppies that can both "sit-up-and-beg" and wear a Groucho Marx mask, for example. The fourth normal form dictates against this (i.e., against using the Puppy-Tricks table, not against begging while wearing a Groucho mask). The two attributes do not share a meaningful relationship. A puppy may be able to wear a wet suit. This does not mean it can simultaneously sit up and beg. How will you represent this if you store both attributes in the same table?

FOURTH NORMAL FORM

Puppies
Puppy Number
Puppy Name
Kennel Code

Kennels
Kennel Code
Kennel Name
Kennel Location

Tricks
Trick ID
Trick Name

Puppy Tricks
Puppy Number
Trick ID
Trick Where Learned
Skill Level

Costumes
Costume Number
Costume Name

Puppy Costumes
Puppy Number
Costume Number

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[At left is the Fourth Normal Form of the database. It consists of the Third Normal Form tables plus two new tables, Costumes and Puppy Costumes. Note that the Puppy Costumes table has a composite primary key that is made up of two foreign keys, from the Puppies and Costumes tables, respectively.]

Rule 5. Isolate Semantically Related Multiple Relationships. There may be practical constraints on information that justify separating logically related many-to-many relationships.

Usually, related attributes belong together. For example, if we really wanted to record which tricks each puppy could do in which costume, we would want to keep the Costume attribute in the Puppy-Tricks table. But there are times when special characteristics of the data make it more efficient to separate even logically related attributes.

Imagine that our database will record when breeds are available in each kennel, and which breeder supplies dogs to those kennels. This suggests a Kennel-Breeder-Breeds table, which satisfies fourth normal form. As long as any kennel can supply any breed from any breeder, this works fine.

 

 

 

 

 

 

 

 

Kennel-Breeder-Breeds
Kennel Number
Breeder
Breed

Kennel-Breeder-Breeds

Kennel Number

Breeder

Breed

5

Acme

Spaniel

5

Acme

Dachshund

5

Acme

Banana-Biter

5

Puppy Factory

Spaniel

5

Puppy Factory

Dachshund

5

Puppy Factory

Banana-Biter

5

Whatapuppy

Spaniel

5

Whatapuppy

Dachshund

5

Whatapuppy

Banana-Biter

Now suppose a law is passed to prevent exclusive arrangements: a kennel selling any breed must offer that breed from all breeders it deals with. In other words, if Khabul Khennels sells Afghans and wants to sell any Daisy Hill puppies, it must sell Daisy Hill Afghans.

The need for fifth normal form becomes clear when we consider inserts and deletes. Suppose that a kennel (whose number in the database happens to be 5) decides to offer three new breeds: Spaniels, Dachshunds, and West Indian Banana-Biters. Suppose further that this kennel already deals with three breeders that can supply those breeds. This will require nine new rows in the database, one for each breeder-and-breed combination.

Breaking up the table reduces the number of inserts to six. Here are the tables necessary for fifth normal form, shown with the six newly inserted rows.

FIFTH NORMAL FORM

Puppies
Puppy Number
Puppy Name
Kennel Code

Kennels
Kennel Code
Kennel Name
Kennel Location

Tricks
Trick ID
Trick Name

Puppy Tricks
Puppy Number
Trick ID
Trick Where Learned
Skill Level

Costumes
Costume Number
Costume Name

Puppy Costumes
Puppy Number
Costume Number

Kennel-Breed
Kennel Number
Breed

Kennel-Breeder
Kennel Number
Breeder

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

If an application involves significant update activity, fifth normal form can mean important savings. Note that these combination tables develop naturally out of entity-relationship analysis.

Kennel-Breed

Kennel Number

Breed

5

Spaniel

5

Dachshund

5

Banana-Biter

Kennel-Breeder

Kennel Number

Breeder

5

Acme

5

Puppy Factory

5

Whatapuppy