MySQL Database Design
Date: Feb 7, 2003
Return to the article
The first step in creating and using a database is to establish its structure. The techniques you will learn in this sample chapter will help to ensure the viability, performance, and reliability of your databases.
Whenever you are working with a relational database management system such as
MySQL, the first step in creating and using a database is to establish its
structure. Database design, aka data modeling, is crucial for successful
long-term management of your information. Using a process called
normalization, you carefully eliminate redundancies and other problems
that will undermine the integrity of your data.
The techniques you will learn in this chapter will help to ensure the
viability, performance, and reliability of your databases. The example I will
usea record of business transactions such as invoices and
expenseswill be referred to in later chapters, but the principles of
normalization apply to any database application you might create.
Normalization
Normalization was developed by an IBM researcher named E.F. Codd in the early
1970s (he also invented the relational database). A relational database is
merely a collection of data, organized in a particular manner, and Dr. Codd
created a series of rules called normal forms that help define that
organization. In this chapter I will discuss the first three of the normal
forms, which is sufficient for most database designs.
Before you begin normalizing your database, you must define the role of the
application being developed. Whether it means that you thoroughly discuss the
subject with a client or figure it out for yourself, understanding how the
information will be accessed dictates the modeling. Thus, this chapter will
require paper and pen, rather than the MySQL software itself (for the record,
database design is applicable to any relational database, not just MySQL).
Database design texts commonly use examples such as music or book collections
(indeed, I used the latter in my book PHP Advanced for the World Wide Web:
Visual QuickPro Guide), but I will create a more business-oriented
accounting database here. The primary purpose of the database will be to track
invoices and expenses, but it could easily be modified to log work hours on
projects and so forth. I have created a preliminary listing of the data to
record in Table 3.1.
Table 3.1 Based on my intended usage of this database, all of the
requisite information to be recorded is listed here.
Accounting Database
|
Item
|
Example
|
Invoice Number
|
1
|
Invoice Date
|
4/20/2002
|
Invoice Amount
|
$30.28
|
Invoice Description
|
HTML design Date
|
Invoice Paid
|
5/11/2002
|
Client Information
|
Acme Industries, 100 Main Street, Anytown, NY, 11111,
(800) 555-1234
|
Expense Amount
|
$100.00
|
Expense Category & Description
|
Web Hosting Fees-Annual contract for hosting www.DMCinsights.com
|
Expense Date
|
1/26/2002
|
Keys
Keys are pieces of data that help to identify a row of information in a table
(a row is also called a record). There are two types of keys you will
deal with: primary and foreign. A primary key is a unique
identifier that has to abide by certain rules. They must
Always have a value (it cannot be NULL)
Have a value that remains the same (never changes)
Have a unique value for each record in the table
The best real-world example of a primary key is the U.S. Social Security
number. Although I have heard stories of duplicate numbers being assigned, the
principle is that each individual has a unique Social Security number and that
the number never changes. Just as the Social Security number is an artificial
construct used to identify people, you'll frequently find creating an
arbitrary primary key for each table to be the best design practice.
The second type of keys are foreign keys. Foreign keys are the representation
of the primary key from Table A in Table B. If you have a movies database
with a movie table and a director table, the primary key from
director would be linked as a foreign key in movie. You'll
see better how this works as the normalization process continues.
Currently, MySQL formally implements foreign keys only when using the InnoDB
table type (see Chapter 11, "Advanced MySQL," for more information on
the different table types) but generally ignores their existence otherwise.
Hence, foreign keys in MySQL are more of a theoretical presence than a binding
one, although this should change in later versions of the software.
The accounting database is just a simple table as it stands, but to
start off the normalization process, I'll want to ensure at least one
primary key (the foreign keys will come in later steps).
To assign a primary key:
- Look for any fields that meet the three tests for a primary key.
In this example, the only data that will always be unique, have a value, and
whose value will never change should be the Invoice Number. Mark this
field as the primary key using the (PK) notation (Figure 3.1).
- If no logical primary key exists, invent one.
Frequently you will need to create a primary key because no good solution
presents itself. Even with Social Security numbers and book ISBNs (International
Standardized Book Number)which ought to meet the criteriacreating a
dummy field expressly for being the primary key is a solid idea.
Relationships
When I speak of database relationships, I specifically mean how the data in
one table relates to the data in another. A relationship between two tables can
be one-to-one, one-to-many, or many-to-many.
The relationship is one-to-one if one and only one item in Table A applies to
one and only one item in Table B (e.g., each U.S. citizen has only one Social
Security number, and each Social Security number applies to only one U.S.
citizen; no citizen can have two Social Security numbers, and no Social Security
number can refer to two citizens).
A relationship is one-to-many if one item in Table A can apply to multiple
items in Table B. The terms female and male will apply to many
people, but each person can be only one or the other. A one-to-many relationship
is the most common one between tables in databases.
Finally, a relationship is many-to-many if multiple items in Table A can
apply to multiple items in Table B. For example, a record album can contain
songs by multiple artists and artists can make multiple albums. You should try
to avoid many-to-many relationships in your design because they lead to data
redundancy and integrity problems.
Relationships and keys work together in that a key in one table will normally
relate to a field in another, as I mentioned earlier. Once you grasp the basics
of unique identifiers and relationships, you can begin to normalize your
database.
First Normal Form
For a database to be in First Normal Form (1NF), each column must contain
only one value (this is sometimes described as being atomic). A table
containing one field for an address would not be in 1NF because it stores the
street address, city, state, ZIP code, and possibly countryfive different
bits of informationin one field. Similarly, a field containing a
person's first and last name would also fail this test (although some would
suggest that a person's full name is sufficiently atomic as is).
I'll continue the normalization process by checking the existing
structure for 1NF compliance.
To make a database 1NF compliant:
- Identify any field that contains multiple pieces
of information.
Looking back at Table 3.1, two columns are not 1NF compliant: Client
Information and Expense Category & Description. The date fields
contain a day, month, and a year, but subdividing past that level of specificity
is really not warranted.
- Break up any fields found in step 1 into separate
fields (Figure 3.3).
To fix this problem, I'll separate Client Information into Client Name,
Client Street Address, Client City, Client State, Client Zip, and Client Phone.
Next, I'll turn Expense Category & Description into Expense Category
and Expense Description.
Figure 3.3 After running through the 1NF rules, I've
separated two fields into more logical subfields.
Double-check that all new fields created in step 2 pass the 1NF
test.
Second Normal Form
In simplest terms, for a database to be in Second Normal Form (2NF), the
database must already be in 1NF (you must normalize in order), and every column
in a table that is not a key has to relate only to the primary key. The most
obvious indication that a database is not 2NF is if multiple records in a table
might have the exact same value for a column. As an example, if you listed a
producer along with each record album, this value could be repeated over the
course of the album's table.
Looking at the accounting database (Figure 3.3), there are a number of
problems. For starters, the client information will not necessarily be
particular to any one invoice (a client could be billed several times). Second,
the expense information is not tied to the invoices either.
To put this database into 2NF, I'll need to separate out these columns
into their own tables, where each value will be represented only once. In fact,
normalization could be summarized as the process of creating more and more
tables until potential redundancies have been eliminated.
To make a database 2NF compliant:
- Identify any fields that do not relate directly to the primary
key.
As I stated above, all of the client information and expense information are
not Second Normal Form invoice-particular.
- Create new tables accordingly (Figure 3.4).
The most logical modification for the existing structure is to make separate
Clients, Invoices, and Expenses tables. In my visual
representation of the database, I create a box for each table, with the table
name as a header and all of its columns (or attributes) underneath.
- Assign or create new primary keys (Figure
3.5).
Using the techniques described earlier in the chapter, ensure that each new
table has a primary key. Because both the Clients and Expenses
tables do not have good unique identifiers, I'll create artificial
ones: Client ID and Expense ID. Arguably, the Client Name
field should be unique and therefore could be the primary key, but it's
always best to use integers for this purpose.
- Repeat steps 13.
Since I've created new tables with new primary keys, I should
double-check to see if there are any 2NF problems. In the example (Figure
3.5),
there is one glaring issuethe Expense Category field may apply to
multiple expenses. Therefore, I'll make a new Expense Categories
table (Figure 3.6).
- Create the requisite foreign keys indicating the relationships
(Figure 3.7). The final step in achieving 2NF compliance is to
incorporate foreign keys and relationships to identify how all of the data and
tables are associated. Remember that a primary key in one table will most likely
be a foreign key in another. If you find that the primary key in one table is
not represented as a foreign key in another, you may have missed something (but
not necessarily).
Third Normal Form
A database is in Third Normal Form (3NF) if it is in 2NF and every nonkey
column is independent of every other nonkey column. In other words, the fields
of a table other than the keys should be mutually independent.
If you followed the first two normalization steps properly, you will not
necessarily need to make any changes at this stage. However, if you made a
number of changes along the way (as can happen), this could be a last check. For
example, say I wanted to record a contact name and email address with each
invoice (Figure 3.8). The problem is that this information relates not to
an invoice but to the client and, therefore, the database would fail the 3NF
test. The correct structure would be to add these fields to the Clients
table (Figure 3.9).
MySQL Data Types
Once you have identified all of the tables and columns that the database will
need, you should determine each field's MySQL data type. When creating the
database, as you will do in the next chapter, MySQL requires that you define
what sort of information each field will contain. There are three primary
categories, which is true for almost every database software:
- Text
- Numbers
- Dates and times
Within each of these, there are a number of variantssome of which are
MySQL-specificyou can use. Choosing your column types correctly not only
dictates what information can be stored and how, but also affects the
database's overall performance. Table 3.2 lists most of the
available types for MySQL, how much space they take up, and a brief description.
Table 3.2 Here are most of the available column types for use with
MySQL databases.
MySQL Datatypes
|
Ty p e
|
S i z e
|
D e s c r i p t i o n
|
CHAR[Length]
|
Length bytes
|
A fixed-length field from 0 to 255 characters long.
|
VARCHAR(Length)
|
String length + 1 bytes
|
A fixed-length field from 0 to 255 characters long.
|
TINYTEXT
|
String length + 1 bytes
|
A string with a maximum length of 255 characters.
|
TEXT
|
String length + 2 bytes
|
A string with a maximum length of 65,535 characters.
|
MEDIUMTEXT
|
String length + 3 bytes
|
A string with a maximum length of 16,777,215 characters.
|
LONGTEXT
|
String length + 4 bytes
|
A string with a maximum length of 4,294,967,295 characters.
|
TINYINT[Length]
|
1 byte
|
Range of -128 to 127 or 0 to 255 unsigned.
|
SMALLINT[Length]
|
2 bytes
|
Range of -32,768 to 32,767 or 0 to 65535 unsigned.
|
MEDIUMINT[Length]
|
3 bytes
|
Range of -8,388,608 to 8,388,607 or 0 to 16,777,215
unsigned.
|
INT[Length]
|
4 bytes
|
Range of -2,147,483,648 to 2,147,483,647 or 0 to 4,294,967,295
unsigned.
|
BIGINT[Length]
|
8 bytes
|
Range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
or 0 to 18,446,744,073,709,551,615 unsigned.
|
FLOAT
|
4 bytes
|
A small number with a floating decimal point.
|
DOUBLE[Length, Decimals]
|
8 bytes
|
A large number with a floating decimal point.
|
DECIMAL[Length, Decimals]
|
Length + 1 or Length + 2 bytes
|
A DOUBLE stored as a string, allowing for a fixed decimal
point.
|
DATE
|
3 bytes
|
In the format of YYYY-MM-DD.
|
DATETIME
|
8 bytes
|
In the format of YYYY-MM-DD HH:MM:SS.
|
TIMESTAMP
|
4 bytes
|
In the format of YYYYMMDDHHMMSS; acceptable range ends
inthe year 2037.
|
TIME
|
3 bytes
|
In the format of HH:MM:SS
|
ENUM
|
1 or 2 bytes
|
Short for enumeration, which means that each column
can haveone of several possible values.
|
SET
|
1, 2, 3, 4, or 8 bytes
|
Like ENUM except that each column can have more than
one ofseveral possible values.
|
Many of the types can take an optional Length attribute, limiting
their size (the square brackets, [], indicate an optional parameter to be put in
parentheses, while parentheses themselves indicate required arguments). Further,
the number types can be UNSIGNEDlimiting the column to positive numbers or
zeroor be defined as ZEROFILL, which means that any extra room will be
padded with zeroes (ZEROFILLs are also automatically UNSIGNED). The various date
types have all sorts of unique behaviors, which are documented in the manual at
www.mysql.com/doc/D/A/DATETIME.html. You'll primarily use the DATE and
TIME fields without modification, so you need not worry too much about their
intricacies. There are also two extensions of the TEXT types that result in a
different behaviorENUM and SETwhich allow you to define a series of
acceptable values when creating the table. An ENUM field can have only one of a
possible several thousand values, while SET allows for several of up to 64
possible values. There are two caveats with ENUM and SET: These types are not
supported by other databases, and their usage undermines normalization.
To choose your data types:
Identify whether a column should be a text,
number, or date type.
This is normally an easy and obvious step. You will find that numbers such as
ZIP codes and dollar amounts should be text fields if you include their
corresponding punctuation (dollar signs, commas, and hyphens), but you'll
get better results if you store them as numbers and address the formatting
elsewhere.
Choose the most appropriate subtype for each
column.
For improved performance, keep in mind two considerations:
Fixed-length fields (such as CHAR) are generally faster than
variable-length fields (such as VARCHAR), but they also take up more disk space.
See the side-bar for more information.
The size of any field should be restricted to the smallest possible
value, based upon what the largest possible input could be. For example, if the
largest a number such as Client ID could be is in the hundreds, set the column
as an unsigned three-digit SMALLINT (allowing for up to 999 values).
You should keep in mind that if you insert a string five characters long
into a CHAR(2) field, the final three characters will be truncated. This is true
for any field in which the length is set (CHAR, VARCHAR, INT, etc.).
Set the maximum length for text and number columns
as well as other attributes such as UNSIGNED (Table 3.3).
Rather than going over how I defined all 21 columns and why, I've listed
the properties I came up with in Table 3.3. Different developers have different
preferences, but the most important factor is to tailor each setting to the
information at hand rather than using generic (and inefficient) TEXT and INT
types at all times.
Table 3.3 An often overlooked aspect of database design is defining
the optimal type for each field.
Accounting Database
|
C o l um n N a m e
|
Ta b l e
|
Co lum n Ty pe
|
Invoice Number
|
Invoices
|
SMALLINT(4) UNSIGNED
|
Client ID
|
Invoices
|
SMALLINT(3) UNSIGNED
|
Invoice Date
|
Invoices
|
DATE
|
Invoice Amount
|
Invoices
|
DECIMAL(10,2) UNSIGNED
|
Invoice Description
|
Invoices
|
TINYTEXT
|
Client ID
|
Clients
|
SMALLINT(3) UNSIGNED
|
Client Name
|
Clients
|
VARCHAR(40)
|
Client Street Address
|
Clients
|
VARCHAR(80)
|
Client City
|
Clients
|
VARCHAR(30)
|
Client State
|
Clients
|
CHAR(2)
|
Client Zip
|
Clients
|
MEDIUMINT(5) UNSIGNED
|
Client Phone
|
Clients
|
VARCHAR(14)
|
Contact Name
|
Clients
|
VARCHAR(40)
|
Contact Email Address
|
Clients
|
VARCHAR(60)
|
Expense ID
|
Expenses
|
SMALLINT(4) UNSIGNED
|
Expense Category ID
|
Expenses
|
TINYINT(3) UNSIGNED
|
Expense Amount
|
Expenses
|
DECIMAL(10,2) UNSIGNED
|
Expense Description
|
Expenses
|
TINYTEXT
|
Expense Date
|
Expenses
|
DATE
|
Expense Category ID
|
Expense Categories
|
TINYINT(3) UNSIGNED
|
Expense Category
|
Expense Categories
|
VARCHAR(30)
|
NULL and Default Values
As you have already seen, there are a few attributes you can assign when
defining your data types, including UNSIGNED and ZEROFILL. Two more options are
to dictate whether or not the value of a column can be NULL and to set a default
value.
The NULL value, in databases and programming, is the equivalent of saying
that the field has no value (or it is unknown). Ideally, every record in a
database should have value, but that is rarely the case in practicality. To
enforce this limitation on a field, you add the NOT NULL description to its
column type. For example, a primary key might now be described as client_id
SMALLINT(3) UNSIGNED NOT NULL and Default Values NULL
When creating a table you can also specify a default value. In cases where a
large portion of the records will have the same contents, presetting a default
will save you from having to specify a value when inserting new rows, unless
that value is different from the norm. One example might be gender
ENUM('M', 'F') DEFAULT 'F'
Table 3.4 incorporates these two new ideas.
Table 3.4 I've added NOT NULL descriptions and DEFAULT values
for a few of my columns to further improve the database design.
Accounting Database
|
Column Name
|
Table
|
Column Type
|
Invoice Number
|
Invoices
|
SMALLINT(4) UNSIGNED NOT NULL DEFAULT 0
|
Client ID
|
Invoices
|
SMALLINT(3) UNSIGNED
|
Invoice Date
|
Invoices
|
DATE NOT NULL
|
Invoice Amount
|
Invoices
|
DECIMAL(10,2) UNSIGNED NOT NULL
|
Invoice Description
|
Invoices
|
TINYTEXT
|
Client ID
|
Clients
|
SMALLINT(3) UNSIGNED NOT NULL DEFAULT 0
|
Client Name
|
Clients
|
VARCHAR(40) NOT NULL
|
Client Street Address
|
Clients
|
VARCHAR(80)
|
Client City
|
Clients
|
VARCHAR(30)
|
Client State
|
Clients
|
CHAR(2)
|
Client Zip
|
Clients
|
MEDIUMINT(5) UNSIGNED
|
Client Phone
|
Clients
|
VARCHAR(14)
|
Contact Name
|
Clients
|
VARCHAR(40)
|
Contact Email Address
|
Clients
|
VARCHAR(60)
|
Expense ID
|
Expenses
|
SMALLINT(4) UNSIGNED NOT NULL DEFAULT 0
|
Expense Category ID
|
Expenses
|
TINYINT(3) UNSIGNED
|
Expense Amount
|
Expenses
|
DECIMAL(10,2) UNSIGNED NOT NULL
|
Expense Description
|
Expenses
|
TINYTEXT
|
Expense Date
|
Expenses
|
DATE
|
Expense Category ID
|
Expense Categories
|
TINYINT(3) UNSIGNED
|
Expense Category
|
Expense Categories
|
VARCHAR(30)
|
Indexes
Indexes are a special system that databases use to improve the overall
performance. By setting indexes on your tables, you are telling MySQL to pay
particular attention to that column (in layman's terms). In fact, MySQL
creates extra files to store and track indexes efficiently.
MySQL allows for up to 32 indexes for each table, and each index can
incorporate up to 16 columns. While a multicolumn index may not seem obvious, it
will come in handy for searches frequently performed on the same set of multiple
columns (e.g., first and last name, city and state, etc.)
On the other hand, one should not go overboard with indexing. While it does
improve the speed of reading from databases, it slows down the process of
altering data in a database (because the changes need to be recorded in the
index). Indexes are best used on columns
- That are frequently used in the WHERE part of a query
- That are frequently used in an ORDER BY part of a query
- That have many different values (columns with numerous repeating values
ought not to be indexed)
Note that, in MySQL, a primary key column is automatically indexed for
efficiency.
MySQL has three types of indexes: INDEX, UNIQUE (which requires each row to
have a unique value), and PRIMARY KEY (which is just a particular UNIQUE index).
Table 3.5 lists the indexes I propose for the accounting database.
Table 3.5 To improve the performance of my database, I add a few (but
not too many) indexes to help MySQL access the stored information.
Accounting Indexes
|
Column
|
Index Type
|
Invoice Number
|
PRIMARY KEY
|
Client ID
|
PRIMARY KEY
|
Expense ID
|
PRIMARY KEY
|
Expense Category ID
|
PRIMARY KEY
|
Invoice Date
|
INDEX
|
Client Name
|
INDEX (or UNIQUE)
|
One final attribute a column can have that frequently works in conjunction
with an index is AUTO_INCREMENT. When you define a field with this property
using client_id SMALLINT(3) UNSIGNED NOT NULL AUTO_INCREMENT you are effectively
telling MySQL to set the value of this column to the next logical value in the
series. If the column is an integer, the next highest integer will be used when
no value is set when a new record is inserted.
Final Design Steps
The final step in designing your database is to adhere to certain naming
conventions. While MySQL is very flexible on how you name your databases,
tables, and columns, here are some good rules to go by (some of which are
required):
Use alphanumeric characters.
Limit yourself to less than 64 characters (this is a MySQL
restriction).
Use the underscore (_) to separate words.
Use entirely lowercase words (this is definitely a personal preference
rather than a rule).
Use plural table names (to indicate multiple values stored) and singular
column names.
End primary and foreign key columns with id (or ID).
List the primary key first in a table, followed by foreign keys.
Field names should be descriptive.
Field names should be unique across every table, except for the
keys.
These are largely my recommendations and are therefore not absolute, except
for limiting yourself to alphanumeric names without spaces. Some developers
prefer to use capital letters to break up words (instead of underscores). Others
like to indicate the column type in its name. The most important consideration
is that you remain consistent with your conventions.
Table 3.6 shows the final database design, which will be created in
the next chapter.
Table 3.6 The final database design step incorporates certain naming
conventions that I try to adhere to.
accounting
|
Column Name
|
Table
|
Column Type
|
invoice_id
|
invoices
|
SMALLINT(4) UNSIGNED NOT NULL DEFAULT 0
|
client_id
|
invoices
|
SMALLINT(3) UNSIGNED
|
invoice_date
|
invoices
|
DATE NOT NULL
|
invoice_amount
|
invoices
|
DECIMAL(10,2) UNSIGNED NOT NULL
|
invoice_description
|
invoices
|
TINYTEXT
|
client_id
|
clients
|
SMALLINT(3) UNSIGNED NOT NULL DEFAULT 0
|
client_name
|
clients
|
VARCHAR(40) NOT NULL
|
client_street
|
clients
|
VARCHAR(80)
|
client_city
|
clients
|
VARCHAR(30)
|
client_state
|
clients
|
CHAR(2)
|
client_zip
|
clients
|
MEDIUMINT(5) UNSIGNED
|
client_phone
|
clients
|
VARCHAR(14)
|
contact_name
|
clients
|
VARCHAR(40)
|
contact_email
|
clients
|
VARCHAR(60)
|
expense_id
|
expenses
|
SMALLINT(4) UNSIGNED NOT NULL DEFAULT 0
|
expense_category_id
|
expenses
|
TINYINT(3) UNSIGNED
|
expense_amount
|
Expenses
|
DECIMAL(10,2) UNSIGNED NOT NULL
|
expense_description
|
expenses
|
TINYTEXT
|
expense_date
|
Expenses
|
DATE
|
expense_category_id
|
expense_categories
|
TINYINT(3) UNSIGNED
|
expense_category
|
expense_categories
|
VARCHAR(30)
|
Tips
Database and table names are case-sensitive on Unix systems but
insensitive under Windows. Column names are always case-insensitive.
By strictly adhering to any set of database design principles, you
minimize errors that could occur when programming a database interface, as you
will in Chapters 6, 7, and 8.