Peachpit Press

MySQL Database Design

Date: Feb 7, 2003

Return to the article

The first step in creating and using a database is to establish its structure. The techniques you will learn in this sample chapter will help to ensure the viability, performance, and reliability of your databases.

Whenever you are working with a relational database management system such as MySQL, the first step in creating and using a database is to establish its structure. Database design, aka data modeling, is crucial for successful long-term management of your information. Using a process called normalization, you carefully eliminate redundancies and other problems that will undermine the integrity of your data.

The techniques you will learn in this chapter will help to ensure the viability, performance, and reliability of your databases. The example I will use—a record of business transactions such as invoices and expenses—will be referred to in later chapters, but the principles of normalization apply to any database application you might create.

Normalization

Normalization was developed by an IBM researcher named E.F. Codd in the early 1970s (he also invented the relational database). A relational database is merely a collection of data, organized in a particular manner, and Dr. Codd created a series of rules called normal forms that help define that organization. In this chapter I will discuss the first three of the normal forms, which is sufficient for most database designs.

Before you begin normalizing your database, you must define the role of the application being developed. Whether it means that you thoroughly discuss the subject with a client or figure it out for yourself, understanding how the information will be accessed dictates the modeling. Thus, this chapter will require paper and pen, rather than the MySQL software itself (for the record, database design is applicable to any relational database, not just MySQL).

Database design texts commonly use examples such as music or book collections (indeed, I used the latter in my book PHP Advanced for the World Wide Web: Visual QuickPro Guide), but I will create a more business-oriented accounting database here. The primary purpose of the database will be to track invoices and expenses, but it could easily be modified to log work hours on projects and so forth. I have created a preliminary listing of the data to record in Table 3.1.

Table 3.1 Based on my intended usage of this database, all of the requisite information to be recorded is listed here.

Accounting Database

Item

Example

Invoice Number

1

Invoice Date

4/20/2002

Invoice Amount

$30.28

Invoice Description

HTML design Date

Invoice Paid

5/11/2002

Client Information

Acme Industries, 100 Main Street, Anytown, NY, 11111, (800) 555-1234

Expense Amount

$100.00

Expense Category & Description

Web Hosting Fees-Annual contract for hosting www.DMCinsights.com

Expense Date

1/26/2002


Keys

Keys are pieces of data that help to identify a row of information in a table (a row is also called a record). There are two types of keys you will deal with: primary and foreign. A primary key is a unique identifier that has to abide by certain rules. They must

The best real-world example of a primary key is the U.S. Social Security number. Although I have heard stories of duplicate numbers being assigned, the principle is that each individual has a unique Social Security number and that the number never changes. Just as the Social Security number is an artificial construct used to identify people, you'll frequently find creating an arbitrary primary key for each table to be the best design practice.

The second type of keys are foreign keys. Foreign keys are the representation of the primary key from Table A in Table B. If you have a movies database with a movie table and a director table, the primary key from director would be linked as a foreign key in movie. You'll see better how this works as the normalization process continues.

Currently, MySQL formally implements foreign keys only when using the InnoDB table type (see Chapter 11, "Advanced MySQL," for more information on the different table types) but generally ignores their existence otherwise. Hence, foreign keys in MySQL are more of a theoretical presence than a binding one, although this should change in later versions of the software.

The accounting database is just a simple table as it stands, but to start off the normalization process, I'll want to ensure at least one primary key (the foreign keys will come in later steps).

To assign a primary key:

  1. Look for any fields that meet the three tests for a primary key.
  2. In this example, the only data that will always be unique, have a value, and whose value will never change should be the Invoice Number. Mark this field as the primary key using the (PK) notation (Figure 3.1).

    Figure 3.1

    Figure 3.1 The first step I took in normalizing my database was to create the initial primary key—the Invoice Number.

  3. If no logical primary key exists, invent one. Frequently you will need to create a primary key because no good solution presents itself. Even with Social Security numbers and book ISBNs (International Standardized Book Number)—which ought to meet the criteria—creating a dummy field expressly for being the primary key is a solid idea.

Relationships

When I speak of database relationships, I specifically mean how the data in one table relates to the data in another. A relationship between two tables can be one-to-one, one-to-many, or many-to-many.

The relationship is one-to-one if one and only one item in Table A applies to one and only one item in Table B (e.g., each U.S. citizen has only one Social Security number, and each Social Security number applies to only one U.S. citizen; no citizen can have two Social Security numbers, and no Social Security number can refer to two citizens).

A relationship is one-to-many if one item in Table A can apply to multiple items in Table B. The terms female and male will apply to many people, but each person can be only one or the other. A one-to-many relationship is the most common one between tables in databases.

Finally, a relationship is many-to-many if multiple items in Table A can apply to multiple items in Table B. For example, a record album can contain songs by multiple artists and artists can make multiple albums. You should try to avoid many-to-many relationships in your design because they lead to data redundancy and integrity problems.

Relationships and keys work together in that a key in one table will normally relate to a field in another, as I mentioned earlier. Once you grasp the basics of unique identifiers and relationships, you can begin to normalize your database.

First Normal Form

For a database to be in First Normal Form (1NF), each column must contain only one value (this is sometimes described as being atomic). A table containing one field for an address would not be in 1NF because it stores the street address, city, state, ZIP code, and possibly country—five different bits of information—in one field. Similarly, a field containing a person's first and last name would also fail this test (although some would suggest that a person's full name is sufficiently atomic as is).

I'll continue the normalization process by checking the existing structure for 1NF compliance.

To make a database 1NF compliant:

  1. Identify any field that contains multiple pieces of information.

    Looking back at Table 3.1, two columns are not 1NF compliant: Client Information and Expense Category & Description. The date fields contain a day, month, and a year, but subdividing past that level of specificity is really not warranted.

  2. Break up any fields found in step 1 into separate fields (Figure 3.3).

    To fix this problem, I'll separate Client Information into Client Name, Client Street Address, Client City, Client State, Client Zip, and Client Phone. Next, I'll turn Expense Category & Description into Expense Category and Expense Description.

    Figure 3.3

    Figure 3.3 After running through the 1NF rules, I've separated two fields into more logical subfields.

  • Double-check that all new fields created in step 2 pass the 1NF test.
  • Second Normal Form

    In simplest terms, for a database to be in Second Normal Form (2NF), the database must already be in 1NF (you must normalize in order), and every column in a table that is not a key has to relate only to the primary key. The most obvious indication that a database is not 2NF is if multiple records in a table might have the exact same value for a column. As an example, if you listed a producer along with each record album, this value could be repeated over the course of the album's table.

    Looking at the accounting database (Figure 3.3), there are a number of problems. For starters, the client information will not necessarily be particular to any one invoice (a client could be billed several times). Second, the expense information is not tied to the invoices either.

    To put this database into 2NF, I'll need to separate out these columns into their own tables, where each value will be represented only once. In fact, normalization could be summarized as the process of creating more and more tables until potential redundancies have been eliminated.

    To make a database 2NF compliant:

    1. Identify any fields that do not relate directly to the primary key.

      As I stated above, all of the client information and expense information are not Second Normal Form invoice-particular.

    2. Create new tables accordingly (Figure 3.4). The most logical modification for the existing structure is to make separate Clients, Invoices, and Expenses tables. In my visual representation of the database, I create a box for each table, with the table name as a header and all of its columns (or attributes) underneath.
      Figure 3.4

      Figure 3.4 To normalize the database, I must move redundant information—such as the client and expense data—to their own tables.


    3. Assign or create new primary keys (Figure 3.5).
      Figure 3.5

      Figure 3.5 Each table in the database should have its own primary key, whether it's a dummy field such as Client ID or a necessary one such as Invoice Number.


      Using the techniques described earlier in the chapter, ensure that each new table has a primary key. Because both the Clients and Expenses tables do not have good unique identifiers, I'll create artificial ones: Client ID and Expense ID. Arguably, the Client Name field should be unique and therefore could be the primary key, but it's always best to use integers for this purpose.

    4. Repeat steps 1–3.

      Since I've created new tables with new primary keys, I should double-check to see if there are any 2NF problems. In the example (Figure 3.5), there is one glaring issue—the Expense Category field may apply to multiple expenses. Therefore, I'll make a new Expense Categories table (Figure 3.6).

      Figure 3.6

      Figure 3.6 The Expense Category field, which was part of Expenses, should be its own table as well.

    5. Create the requisite foreign keys indicating the relationships (Figure 3.7). The final step in achieving 2NF compliance is to incorporate foreign keys and relationships to identify how all of the data and tables are associated. Remember that a primary key in one table will most likely be a foreign key in another. If you find that the primary key in one table is not represented as a foreign key in another, you may have missed something (but not necessarily).
      Figure 3.7

      Figure 3.7 For the new primary keys, I've added corresponding foreign keys and indicated the relationships (both one-to-many).

    Third Normal Form

    A database is in Third Normal Form (3NF) if it is in 2NF and every nonkey column is independent of every other nonkey column. In other words, the fields of a table other than the keys should be mutually independent.

    If you followed the first two normalization steps properly, you will not necessarily need to make any changes at this stage. However, if you made a number of changes along the way (as can happen), this could be a last check. For example, say I wanted to record a contact name and email address with each invoice (Figure 3.8). The problem is that this information relates not to an invoice but to the client and, therefore, the database would fail the 3NF test. The correct structure would be to add these fields to the Clients table (Figure 3.9).

    Figure 3.8

    Figure 3.8 ltering the requirements of the database can muddle the design, as it is now no longer normalized with the Contact Name and Email Address additions.


    Figure 3.9

    Figure 3.9 To correctly incorporate the new information (Figure 3.8), I've moved it to the Clients table.

    MySQL Data Types

    Once you have identified all of the tables and columns that the database will need, you should determine each field's MySQL data type. When creating the database, as you will do in the next chapter, MySQL requires that you define what sort of information each field will contain. There are three primary categories, which is true for almost every database software:

    Within each of these, there are a number of variants—some of which are MySQL-specific—you can use. Choosing your column types correctly not only dictates what information can be stored and how, but also affects the database's overall performance. Table 3.2 lists most of the available types for MySQL, how much space they take up, and a brief description.

    Table 3.2 Here are most of the available column types for use with MySQL databases.

    MySQL Datatypes

    Ty p e

    S i z e

    D e s c r i p t i o n

    CHAR[Length]

    Length bytes

    A fixed-length field from 0 to 255 characters long.

    VARCHAR(Length)

    String length + 1 bytes

    A fixed-length field from 0 to 255 characters long.

    TINYTEXT

    String length + 1 bytes

    A string with a maximum length of 255 characters.

    TEXT

    String length + 2 bytes

    A string with a maximum length of 65,535 characters.

    MEDIUMTEXT

    String length + 3 bytes

    A string with a maximum length of 16,777,215 characters.

    LONGTEXT

    String length + 4 bytes

    A string with a maximum length of 4,294,967,295 characters.

    TINYINT[Length]

    1 byte

    Range of -128 to 127 or 0 to 255 unsigned.

    SMALLINT[Length]

    2 bytes

    Range of -32,768 to 32,767 or 0 to 65535 unsigned.

    MEDIUMINT[Length]

    3 bytes

    Range of -8,388,608 to 8,388,607 or 0 to 16,777,215 unsigned.

    INT[Length]

    4 bytes

    Range of -2,147,483,648 to 2,147,483,647 or 0 to 4,294,967,295 unsigned.

    BIGINT[Length]

    8 bytes

    Range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 or 0 to 18,446,744,073,709,551,615 unsigned.

    FLOAT

    4 bytes

    A small number with a floating decimal point.

    DOUBLE[Length, Decimals]

    8 bytes

    A large number with a floating decimal point.

    DECIMAL[Length, Decimals]

    Length + 1 or Length + 2 bytes

    A DOUBLE stored as a string, allowing for a fixed decimal point.

    DATE

    3 bytes

    In the format of YYYY-MM-DD.

    DATETIME

    8 bytes

    In the format of YYYY-MM-DD HH:MM:SS.

    TIMESTAMP

    4 bytes

    In the format of YYYYMMDDHHMMSS; acceptable range ends inthe year 2037.

    TIME

    3 bytes

    In the format of HH:MM:SS

    ENUM

    1 or 2 bytes

    Short for enumeration, which means that each column can haveone of several possible values.

    SET

    1, 2, 3, 4, or 8 bytes

    Like ENUM except that each column can have more than one ofseveral possible values.


    Many of the types can take an optional Length attribute, limiting their size (the square brackets, [], indicate an optional parameter to be put in parentheses, while parentheses themselves indicate required arguments). Further, the number types can be UNSIGNED—limiting the column to positive numbers or zero—or be defined as ZEROFILL, which means that any extra room will be padded with zeroes (ZEROFILLs are also automatically UNSIGNED). The various date types have all sorts of unique behaviors, which are documented in the manual at www.mysql.com/doc/D/A/DATETIME.html. You'll primarily use the DATE and TIME fields without modification, so you need not worry too much about their intricacies. There are also two extensions of the TEXT types that result in a different behavior—ENUM and SET—which allow you to define a series of acceptable values when creating the table. An ENUM field can have only one of a possible several thousand values, while SET allows for several of up to 64 possible values. There are two caveats with ENUM and SET: These types are not supported by other databases, and their usage undermines normalization.

    To choose your data types:

    1. Identify whether a column should be a text, number, or date type.

    2. This is normally an easy and obvious step. You will find that numbers such as ZIP codes and dollar amounts should be text fields if you include their corresponding punctuation (dollar signs, commas, and hyphens), but you'll get better results if you store them as numbers and address the formatting elsewhere.

    3. Choose the most appropriate subtype for each column.

    4. For improved performance, keep in mind two considerations:

    5. Set the maximum length for text and number columns as well as other attributes such as UNSIGNED (Table 3.3).

      Rather than going over how I defined all 21 columns and why, I've listed the properties I came up with in Table 3.3. Different developers have different preferences, but the most important factor is to tailor each setting to the information at hand rather than using generic (and inefficient) TEXT and INT types at all times.

    Table 3.3 An often overlooked aspect of database design is defining the optimal type for each field.

    Accounting Database

    C o l um n N a m e

    Ta b l e

    Co lum n Ty pe

    Invoice Number

    Invoices

    SMALLINT(4) UNSIGNED

    Client ID

    Invoices

    SMALLINT(3) UNSIGNED

    Invoice Date

    Invoices

    DATE

    Invoice Amount

    Invoices

    DECIMAL(10,2) UNSIGNED

    Invoice Description

    Invoices

    TINYTEXT

    Client ID

    Clients

    SMALLINT(3) UNSIGNED

    Client Name

    Clients

    VARCHAR(40)

    Client Street Address

    Clients

    VARCHAR(80)

    Client City

    Clients

    VARCHAR(30)

    Client State

    Clients

    CHAR(2)

    Client Zip

    Clients

    MEDIUMINT(5) UNSIGNED

    Client Phone

    Clients

    VARCHAR(14)

    Contact Name

    Clients

    VARCHAR(40)

    Contact Email Address

    Clients

    VARCHAR(60)

    Expense ID

    Expenses

    SMALLINT(4) UNSIGNED

    Expense Category ID

    Expenses

    TINYINT(3) UNSIGNED

    Expense Amount

    Expenses

    DECIMAL(10,2) UNSIGNED

    Expense Description

    Expenses

    TINYTEXT

    Expense Date

    Expenses

    DATE

    Expense Category ID

    Expense Categories

    TINYINT(3) UNSIGNED

    Expense Category

    Expense Categories

    VARCHAR(30)


    NULL and Default Values

    As you have already seen, there are a few attributes you can assign when defining your data types, including UNSIGNED and ZEROFILL. Two more options are to dictate whether or not the value of a column can be NULL and to set a default value.

    The NULL value, in databases and programming, is the equivalent of saying that the field has no value (or it is unknown). Ideally, every record in a database should have value, but that is rarely the case in practicality. To enforce this limitation on a field, you add the NOT NULL description to its column type. For example, a primary key might now be described as client_id SMALLINT(3) UNSIGNED NOT NULL and Default Values NULL

    When creating a table you can also specify a default value. In cases where a large portion of the records will have the same contents, presetting a default will save you from having to specify a value when inserting new rows, unless that value is different from the norm. One example might be gender ENUM('M', 'F') DEFAULT 'F'

    Table 3.4 incorporates these two new ideas.

    Table 3.4 I've added NOT NULL descriptions and DEFAULT values for a few of my columns to further improve the database design.

    Accounting Database

    Column Name

    Table

    Column Type

    Invoice Number

    Invoices

    SMALLINT(4) UNSIGNED NOT NULL DEFAULT 0

    Client ID

    Invoices

    SMALLINT(3) UNSIGNED

    Invoice Date

    Invoices

    DATE NOT NULL

    Invoice Amount

    Invoices

    DECIMAL(10,2) UNSIGNED NOT NULL

    Invoice Description

    Invoices

    TINYTEXT

    Client ID

    Clients

    SMALLINT(3) UNSIGNED NOT NULL DEFAULT 0

    Client Name

    Clients

    VARCHAR(40) NOT NULL

    Client Street Address

    Clients

    VARCHAR(80)

    Client City

    Clients

    VARCHAR(30)

    Client State

    Clients

    CHAR(2)

    Client Zip

    Clients

    MEDIUMINT(5) UNSIGNED

    Client Phone

    Clients

    VARCHAR(14)

    Contact Name

    Clients

    VARCHAR(40)

    Contact Email Address

    Clients

    VARCHAR(60)

    Expense ID

    Expenses

    SMALLINT(4) UNSIGNED NOT NULL DEFAULT 0

    Expense Category ID

    Expenses

    TINYINT(3) UNSIGNED

    Expense Amount

    Expenses

    DECIMAL(10,2) UNSIGNED NOT NULL

    Expense Description

    Expenses

    TINYTEXT

    Expense Date

    Expenses

    DATE

    Expense Category ID

    Expense Categories

    TINYINT(3) UNSIGNED

    Expense Category

    Expense Categories

    VARCHAR(30)


    Indexes

    Indexes are a special system that databases use to improve the overall performance. By setting indexes on your tables, you are telling MySQL to pay particular attention to that column (in layman's terms). In fact, MySQL creates extra files to store and track indexes efficiently.

    MySQL allows for up to 32 indexes for each table, and each index can incorporate up to 16 columns. While a multicolumn index may not seem obvious, it will come in handy for searches frequently performed on the same set of multiple columns (e.g., first and last name, city and state, etc.)

    On the other hand, one should not go overboard with indexing. While it does improve the speed of reading from databases, it slows down the process of altering data in a database (because the changes need to be recorded in the index). Indexes are best used on columns

    Note that, in MySQL, a primary key column is automatically indexed for efficiency.

    MySQL has three types of indexes: INDEX, UNIQUE (which requires each row to have a unique value), and PRIMARY KEY (which is just a particular UNIQUE index). Table 3.5 lists the indexes I propose for the accounting database.

    Table 3.5 To improve the performance of my database, I add a few (but not too many) indexes to help MySQL access the stored information.

    Accounting Indexes

    Column

    Index Type

    Invoice Number

    PRIMARY KEY

    Client ID

    PRIMARY KEY

    Expense ID

    PRIMARY KEY

    Expense Category ID

    PRIMARY KEY

    Invoice Date

    INDEX

    Client Name

    INDEX (or UNIQUE)


    One final attribute a column can have that frequently works in conjunction with an index is AUTO_INCREMENT. When you define a field with this property using client_id SMALLINT(3) UNSIGNED NOT NULL AUTO_INCREMENT you are effectively telling MySQL to set the value of this column to the next logical value in the series. If the column is an integer, the next highest integer will be used when no value is set when a new record is inserted.

    Final Design Steps

    The final step in designing your database is to adhere to certain naming conventions. While MySQL is very flexible on how you name your databases, tables, and columns, here are some good rules to go by (some of which are required):

    These are largely my recommendations and are therefore not absolute, except for limiting yourself to alphanumeric names without spaces. Some developers prefer to use capital letters to break up words (instead of underscores). Others like to indicate the column type in its name. The most important consideration is that you remain consistent with your conventions.

    Table 3.6 shows the final database design, which will be created in the next chapter.

    Table 3.6 The final database design step incorporates certain naming conventions that I try to adhere to.

    accounting

    Column Name

    Table

    Column Type

    invoice_id

    invoices

    SMALLINT(4) UNSIGNED NOT NULL DEFAULT 0

    client_id

    invoices

    SMALLINT(3) UNSIGNED

    invoice_date

    invoices

    DATE NOT NULL

    invoice_amount

    invoices

    DECIMAL(10,2) UNSIGNED NOT NULL

    invoice_description

    invoices

    TINYTEXT

    client_id

    clients

    SMALLINT(3) UNSIGNED NOT NULL DEFAULT 0

    client_name

    clients

    VARCHAR(40) NOT NULL

    client_street

    clients

    VARCHAR(80)

    client_city

    clients

    VARCHAR(30)

    client_state

    clients

    CHAR(2)

    client_zip

    clients

    MEDIUMINT(5) UNSIGNED

    client_phone

    clients

    VARCHAR(14)

    contact_name

    clients

    VARCHAR(40)

    contact_email

    clients

    VARCHAR(60)

    expense_id

    expenses

    SMALLINT(4) UNSIGNED NOT NULL DEFAULT 0

    expense_category_id

    expenses

    TINYINT(3) UNSIGNED

    expense_amount

    Expenses

    DECIMAL(10,2) UNSIGNED NOT NULL

    expense_description

    expenses

    TINYTEXT

    expense_date

    Expenses

    DATE

    expense_category_id

    expense_categories

    TINYINT(3) UNSIGNED

    expense_category

    expense_categories

    VARCHAR(30)


    Tips

    1301 Sansome Street, San Francisco, CA 94111