Peachpit Press

Summarizing and Grouping Data in SQL

Date: Jan 31, 2003

Return to the article

This sample chapter introduces SQL’s aggregate functions, or set functions, which operate on a group of values to produce a single, summarizing value.

The preceding chapter described scalar functions, which operate on individual row values. This chapter introduces SQL’s aggregate functions, or set functions, which operate on a group of values to produce a single, summarizing value. You apply an aggregate to a set of rows, which may be:

No matter how many rows the set contains, an aggregate function returns a single statistic: a sum, minimum, or average, for example.

In this chapter, I’ll also cover the SELECT statement’s GROUP BY clause, which groups rows, and HAVING clause, which filters groups.

Using Aggregate Functions

Table 6.1 lists SQL’s standard aggregate functions. The important characteristics of the aggregate functions are:

Table 6.1 Aggregate Functions

Function Returns
MIN(expr) Minimum value in expr
MAX(expr) Maximum value in expr
SUM(expr) Sum of the values in expr
AVG(expr) Average (arithmetic mean) of the values in expr
COUNT(expr) The number of non-null values in expr
COUNT(*) The number of rows in a table or set

Finding a Minimum with MIN()

Use the aggregate function MIN() to find the minimum of a set of values.

To find the minimum of a set of values:

Listing 6.1 and Figure 6.1 show some queries that involve MIN(). The first query returns the price of the lowest-priced book. The second query returns the earliest publication date. The third query returns the number of pages in the shortest history book.

Listing 6.1 Some MIN() queries. See Figure 6.1 for the results.

SELECT MIN(price) AS "Min price"
  FROM titles;


SELECT MIN(pubdate) AS "Earliest pubdate"
  FROM titles;


SELECT MIN(pages) AS "Min history pages"
  FROM titles
  WHERE type = 'history';

Figure 6.1Figure 6.1 Results of Listing 6.1.


TIPS

Finding a Maximum with MAX()

Use the aggregate function MAX() to find the maximum of a set of values.

To find the maximum of a set of values:

Listing 6.2 and Figure 6.2 show some queries that involve MAX(). The first query returns the author’s last name that is last alphabetically. The second query returns the prices of the cheapest and most expensive books, and the price range. The third query returns the highest revenue (= price ´ sales) among the history books.

Listing 6.2 Some MAX() queries. See Figure 6.2 for the results.

SELECT MAX(au_lname) AS "Max last name"
  FROM authors;


SELECT
    MIN(price) AS "Min price",
    MAX(price) AS "Max price",
    MAX(price) - MIN(price) AS "Range"
  FROM titles;


SELECT MAX(price * sales)
         AS "Max history revenue"
  FROM titles
  WHERE type = 'history';

Figure 6.2Figure 6.2 Results of Listing 6.2.


TIPS

  • MAX() works with character, numeric, and datetime data types.

  • With character data columns, MAX() finds the value that is highest in the sort sequence; see “Sorting Rows with ORDER BY” in Chapter 4.

  • DISTINCT isn’t meaningful with MAX(); see “Aggregating Distinct Values with DISTINCT” later in this chapter.

  • String comparisons may be case-insensitive or case-sensitive, depending on your DBMS; see the DBMS Tip in “Filtering Rows with WHERE” in Chapter 4.

    When comparing two VARCHAR strings for equality, your DBMS may right-pad the shorter string with spaces and compare the strings position-by-position. In this case, the strings ‘Jack’ and ‘Jack ‘ are equal. Refer to your DBMS documentation (or experiment) to determine which string MAX() returns.

Calculating a Sum with SUM()

Use the aggregate function SUM() to find the sum (total) of a set of values.

To calculate the sum of a set of values:

  • Type:

    SUM(expr)

    expr is a column name, literal, or numeric expression. The result’s data type is at least as precise as the most precise data type used in expr.

Listing 6.3 and Figure 6.3 show some queries that involve SUM(). The first query returns the total advances paid to all authors. The second query returns the total sales of books published in 2000. The third query returns the total price, sales, and revenue (= price ´ sales) of all books. Note a mathematical chestnut in action here: “The sum of the products doesn’t (necessarily) equal the product of the sums.”

Listing 6.3 Some SUM() queries. See Figure 6.3 for the results.

SELECT SUM(advance) AS "Total advances"
  FROM royalties;


SELECT SUM(sales)
         AS "Total book sales for 2000"
  FROM titles
  WHERE pubdate
    BETWEEN DATE '2000-01-01'
      AND DATE '2000-12-31';


SELECT
    SUM(price) AS "Total price",
    SUM(sales) AS "Total sales",
    SUM(price * sales) AS "Total revenue"
  FROM titles;

Figure 6.3Figure 6.3 Results of Listing 6.3.


TIPS

  • SUM() works with only numeric data types.

  • The sum of no rows is null—not zero as you might expect.

  • In Microsoft Access date literals, omit the DATE keyword, and surround the literal with # characters instead of quotes. To run Listing 6.3 in Access, change the date literals in the second query to #2000-01-01# and #2000-12-31#. In Microsoft SQL Server date literals, omit the DATE keyword. To run Listing 6.3 in SQL Server, change the date literals to ‘2000-01-01’ and ‘2000-12-31’.

Finding an Average with AVG()

Use the aggregate function AVG() to find the average, or arithmetic mean, of a set of values. The arithmetic mean is the sum of a set of quantities divided by the number of quantities in the set.

To calculate the average of a set of values:

  • Type:

    AVG(expr)

    expr is a column name, literal, or numeric expression. The result’s data type is at least as precise as the most precise data type used in expr.

Listing 6.4 and Figure 6.4 shows some queries that involve AVG(). The first query returns the average price of all books if prices were doubled. The second query returns the average and total sales for business books; both calculations are null (not zero), because the table contains no business books. The third query uses a subquery (see Chapter 8) to list the books with above-average sales.

Listing 6.4 Some AVG() queries. See Figure 6.4 for the results.

SELECT AVG(price * 2) AS "AVG(price * 2)"
FROM titles;
SELECT AVG(sales) AS "AVG(sales)",
SUM(sales) AS "SUM(sales)"
FROM titles
WHERE type = 'business';
SELECT title_id, sales
FROM titles
WHERE sales >
(SELECT AVG(sales) FROM titles)
ORDER BY sales DESC;

Figure 6.4Figure 6.4 Results of Listing 6.4.


TIPS

  • AVG() works with only numeric data types.

  • The average of no rows is null—not zero, as you might expect.

  • If you’ve used, say, 0 or –1 instead of null to represent missing values, the inclusion of those numbers in AVG() calculations yields an incorrect result. Use NULLIF() to convert the missing-value numbers to nulls, so they’ll be excluded from calculations; see “Comparing Expressions with NULLIF()” in Chapter 5.

  • MySQL 4.0 and earlier versions lack subquery support and won’t run the third query in Listing 6.4.

Counting Rows with COUNT()

Use the aggregate function COUNT() to count the number of rows in a set of values. COUNT() has two forms:

  • COUNT(expr) returns the number of rows in which expr is not null.

  • COUNT(*) returns the count of all rows in a set, including nulls and duplicates.

To count non-null rows:

  • Type:

    COUNT(expr)

    expr is a column name, literal, or expression. The result is an integer greater than or equal to zero.

To count all rows, including nulls:

  • Type:

    COUNT(*)

    COUNT(*) returns an integer greater than or equal to zero.

Listing 6.5 and Figure 6.5 show some queries that involve COUNT(expr) and COUNT(*). The three queries count rows in the table titles and are identical except for the WHERE clause. The row counts in the first query differ because the column price contains a null. In the second query, the row counts are identical because the WHERE clause eliminates the row with the null price before the count. The third query shows the row-count differences between the results of first two queries.

Listing 6.5 Some COUNT() queries. See Figure 6.5 for the results.

SELECT
    COUNT(title_id) AS "COUNT(title_id)",
    COUNT(price) AS "COUNT(price)",
    COUNT(*) AS "COUNT(*)"
  FROM titles;


SELECT
    COUNT(title_id) AS "COUNT(title_id)",
    COUNT(price) AS "COUNT(price)",
    COUNT(*) AS "COUNT(*)"
  FROM titles
  WHERE price IS NOT NULL;


SELECT
    COUNT(title_id) AS "COUNT(title_id)",
    COUNT(price) AS "COUNT(price)",
    COUNT(*) AS ""COUNT(*)"
  FROM titles
  WHERE price IS NULL;

Figure 6.5Figure 6.5 Results of Listing 6.5.


TIPS

  • COUNT(expr) and COUNT(*) work with alldata types.

  • COUNT(expr) and COUNT(*) neverreturn null.

  • DISTINCT isn’t meaningful with COUNT(*); see “Aggregating Distinct Values with DISTINCT” later in this chapter.

Aggregating Distinct Values with DISTINCT

You can use DISTINCT to eliminate duplicate values in aggregate function calculations; see “Eliminating Duplicate Rows with DISTINCT” in Chapter 4. The general syntax of an aggregate function is:

agg_func([ALL | DISTINCT] expr)

agg_func is MIN, MAX, SUM, AVG, or COUNT. expr is a column name, literal, or expression. ALL applies the aggregate function to all values, and DISTINCT specifies that each unique value is considered. ALL is the default and rarely is seen in practice.

With SUM(), AVG(), and COUNT(expr), DISTINCT eliminates duplicate values before the sum, average, or count is calculated. DISTINCT isn’t meaningful with MIN() and MAX(); you can use it, but it won’t change the result. You can’t use DISTINCT with COUNT(*).

To calculate the sum of a set of distinct values:

  • Type:

    SUM(DISTINCT expr)

    expr is a column name, literal, or numeric expression. The result’s data type is at least as precise as the most precise data type used in expr.

To calculate the average of a set of distinct values:

  • Type:

    AVG(DISTINCT expr)

    expr is a column name, literal, or numeric expression. The result’s data type is at least as precise as the most precise data type used in expr.

To count distinct non-null rows:

  • Type:

    COUNT(DISTINCT expr)

    expr is a column name, literal, or expression. The result is an integer greater than or equal to zero.

The queries in Listing 6.6 return the count, sum, and average of book prices. The non-DISTINCT and DISTINCT results in Figure 6.6 differ because the DISTINCT results eliminate the duplicates of prices $12.99 and $19.95 from calculations.

Listing 6.6 Some DISTINCT aggregate queries. See Figure 6.6 for the results.

SELECT
    COUNT(*) AS "COUNT(*)",
    COUNT(price) AS "COUNT(price)",
    SUM(price) AS "SUM(price)",
    AVG(price) AS "AVG(price)"
  FROM titles;


SELECT
    COUNT(DISTINCT price)
      AS "COUNT(DISTINCT)",
    SUM(DISTINCT price)
      AS "SUM(DISTINCT)",
    AVG(DISTINCT price)
      AS "AVG(DISTINCT)"
  FROM titles;

Figure 6.6Figure 6.6 Results of Listing 6.6.

TIPS

  • DISTINCT in a SELECT clause andDISTINCT in an aggregate function don’t return the same result.

    The three queries in Listing 6.7 count the author IDs in the tabletitle_authors. Figure 6.7 shows theresults. The first query counts all the author IDs in the table. The second query returns the same result as the first query because COUNT() already has done its work and returned a value in single row before DISTINCT is applied. In the third query, DISTINCT is applied to the author IDs before COUNT() starts counting.

    Listing 6.7 DISTINCT in a SELECT clause and DISTINCT in an aggregate function differ in meaning. See Figure 6.7 for the results.

    SELECT COUNT(au_id)
             AS "COUNT(au_id)"
      FROM title_authors;
    
    
    SELECT DISTINCT COUNT(au_id)
             AS "DISTINCT COUNT(au_id)"
      FROM title_authors;
    
    
    SELECT COUNT(DISTINCT au_id)
             AS "COUNT(DISTINCT au_id)"
      FROM title_authors;

    Figure 6.7Figure 6.7 Results of Listing 6.7.


  • Mixing non-DISTINCT and DISTINCT aggregates in the same SELECT clause can produce misleading results.

    The four queries in Listing 6.8 (following page) show the four combinations of non-DISTINCT and DISTINCT sums and counts. Of the four results in Figure 6.8 (following page) , only the first result (no DISTINCTs) and final result (all DISTINCTs) are consistent mathematically, which you can verify with AVG(price) and AVG(DISTINCTprice). In the second and third queries (mixed non-DISTINCTs and DISTINCTs), you can’t calculate a valid average by dividing the sum by the count.

    Listing 6.8 Mixing non-DISTINCT and DISTINCT aggregates in the same SELECT clause can produce misleading results. See Figure 6.8 for the results.

    SELECT
        COUNT(price)
          AS "COUNT(price)",
        SUM(price)
          AS "SUM(price)"
      FROM titles;
    
    
    SELECT
        COUNT(price)
          AS "COUNT(price)",
        SUM(DISTINCT price)
          AS "SUM(DISTINCT price)"
      FROM titles;
    
    
    SELECT
        COUNT(DISTINCT price)
          AS "COUNT(DISTINCT price)",
        SUM(price)
          AS "SUM(price)"
      FROM titles;
    
    
    SELECT
        COUNT(DISTINCT price)
          AS "COUNT(DISTINCT price)",
        SUM(DISTINCT price)
          AS "SUM(DISTINCT price)"
      FROM titles;

    Figure 6.8Figure 6.8 Results of Listing 6.8.


  • Microsoft Access doesn’t support DISTINCT aggregate functions. This statement, for example, is illegal in Access:

    SELECT SUM(DISTINCT price)
      FROM titles;   --Illegal in Access

    But you can replicate it with this subquery (see the Tips in “Using Subqueries as Column Expressions” in Chapter 8):

    SELECT SUM(price)
      FROM (SELECT DISTINCT price 
              FROM titles);

    This Access workaround won’t let you mix non-DISTINCT and DISTINCT aggregates, however, as in the second and third queries in Listing 6.8.

    In Microsoft SQL Server, if you use DISTINCT, expr must be a column name only. It can’t include an arithmetic expression:

    SELECT COUNT(DISTINCT price * sales) 
      FROM titles; --Illegal in SQL Server

    MySQL supports COUNT(DISTINCT expr) but not SUM(DISTINCT expr) and AVG(DISTINCT expr). Listings 6.6 and 6.8 won’t run in MySQL.

Grouping Rows with GROUP BY

To this point, I’ve used aggregate functions to summarize all the values in a column or just those values that matched a WHERE search condition. You can use the GROUP BY clause to divide a table into logical groups (categories) and calculate aggregate statistics for each group.

An example will clarify the concept. Listing 6.9 uses GROUP BY to count the number of books that each author wrote (or co-wrote). In the SELECT clause, the column au_id identifies each author, and the derived column num_books counts each author’s books. The GROUP BY clause causes num_books to be calculated for every unique au_id instead of only once for the entire table. Figure 6.9 shows the result. In this example, au_id is called the grouping column.

Listing 6.9 List the number of books each author wrote (or co-wrote). See Figure 6.9 for the result.

SELECT
    au_id,
    COUNT(*) AS "num_books"
  FROM title_authors
  GROUP BY au_id;

Figure 6.9Figure 6.9 Results of Listing 6.9.


The GROUP BY clause’s important characteristics are:

  • The GROUP BY clause comes after the WHERE clause and before the ORDER BY clause.

  • Grouping columns can be column names or derived columns.

  • Every nonaggregate column in the SELECT clause must appear in the GROUP BY clause. This statement is illegal because pub_id isn’t in the GROUP BY clause:

    SELECT type,  pub_id , COUNT(*) 
      FROM titles
      GROUP BY type;             --Illegal

    Because the GROUP BY can return only one row for each value of type, there’s no way to return multiple values of pub_id that are associated with any particular value of type.

  • If the SELECT clause contains a complex nonaggregate expression (more than just a simple column name), the GROUP BY expression must match the SELECT expression exactly.

  • Specify multiple grouping columns in the GROUP BY clause to nest groups. Data is summarized at the last specified group.

  • If a grouping column contains a null, that row becomes a group in the result. If a grouping column contains more than one null, the nulls are put into a single group. A group that contains multiple nulls doesn’t imply that the nulls equal one another.

  • Use a WHERE clause in a query containing a GROUP BY clause to eliminate rows before grouping occurs.

  • You can’t use a column alias in the GROUP BY clause, though table aliases are allowed as qualifiers; see “Creating Table Aliases with AS” in Chapter 7.

  • Without an ORDER BY clause, groups returned by GROUP BY aren’t in any particular order. To sort the result of Listing 6.9 by the descending number of books, for example, add the clause ORDER BY “num_books” DESC.

To group rows:

  • Type:

    SELECT columns 
      FROM table
      [WHERE search_condition] 
      GROUP BY grouping_columns
      HAVING search_condition]
      [ORDER BY sort_columns];

    columns and grouping_columns are one or more comma-separated column names, and table is the name of the table that contains columns and grouping_columns. The nonaggregate columns that appear in columns also must appear in grouping_columns. The order of the column names in grouping_columns determines the grouping levels, from the highest to the lowest level of grouping.

    The GROUP BY clause restricts the rows of the result; only one row appears for each distinct value in the grouping column or columns. Each row in the result contains summary data related to the specific value in its grouping columns.

    If the statement includes a WHERE clause, the DBMS groups values after it applies search_condition to the rows in table. If the statement includes an ORDER BY clause, the columns in sort_columns must be drawn from those in columns. The WHERE and ORDER BY clauses are covered in “Filtering Rows with WHERE” and “Sorting Rows with ORDER BY” in Chapter 4. HAVING, which filters grouped rows, is covered in the next section.

Listing 6.10 and Figure 6.10 show the difference between COUNT(expr) and COUNT(*) in a query that contains GROUP BY. The table publishers contains one null in the column state (for publisher P03 in Germany). Recall from “Counting Rows with COUNT()” earlier in this chapter that COUNT(expr) counts non-null values and COUNT(*) counts all values, including nulls. In the result, GROUP BY recognizes the null and creates a null group for it. COUNT(*) finds (and counts) the one null in the column state. But COUNT(state) contains a zero for the null group because COUNT(state) finds only a null in the null group, which it excludes from the count—hence, the zero.

Listing 6.10 This query illustrates the difference between COUNT(expr) and COUNT(*) in a GROUP BY query. See Figure 6.10 for the result.

SELECT
    state,
    COUNT(state) AS "COUNT(state)",
    COUNT(*)     AS "COUNT(*)"
  FROM publishers
  GROUP BY state;

Figure 6.10Figure 6.10 Results of Listing 6.10.

If a nonaggregate column contains nulls, using COUNT(*) rather than COUNT(expr) can produce misleading results. Listing 6.11 and Figure 6.11 show summary sales statistics for each type of book. The sales value for one of the biographies is null, so COUNT(sales) and COUNT(*) differ by 1. The average calculation in the fifth column, SUM/COUNT(sales), is consistent mathematically, whereas the sixth-column average, SUM/COUNT(*), is not. I’ve verified the inconsistency with AVG(sales) in the final column. (Recall a similar situation in Listing 6.8 in “Aggregating Distinct Values with DISTINCT” earlier in this chapter.)

Listing 6.11 For mathematically consistent results, use COUNT(expr), rather than COUNT(*), if expr contains nulls. See Figure 6.11 for the result.

SELECT
    type,
    SUM(sales) AS "SUM(sales)",
    COUNT(sales) AS "COUNT(sales)",
    COUNT(*) AS "COUNT(*)",
    SUM(sales)/COUNT(sales)
      AS "SUM/COUNT(sales)",
    SUM(sales)/COUNT(*)
      AS "SUM/COUNT(*)",
    AVG(sales) AS "AVG(sales)"
  FROM titles
  GROUP BY type;

Figure 6.11Figure 6.11 Results of Listing 6.11.

Listing 6.12 and Figure 6.12 show a simple GROUP BY query that calculates the total sales, average sales, and number of titles for each type of book. In Listing 6.13 and Figure 6.13, I’ve added a WHERE clause to eliminate books priced under $13 before grouping. I’ve also added an ORDER BY clause to sort the result by descending total sales of each book type.

Listing 6.12 This simple GROUP BY query calculates a few summary statistics for each type of book. See Figure 6.12 for the result.

SELECT
    type,
    SUM(sales) AS "SUM(sales)",
    AVG(sales) AS "AVG(sales)",
    COUNT(sales) AS "COUNT(sales)"
  FROM titles
  GROUP BY type;

Figure 6.12Figure 6.12 Results of Listing 6.12.

Listing 6.13 Here, I’ve added WHERE and ORDER BY clauses to Listing 6.12 to cull books priced under $13 and sort the result by descending total sales. See Figure 6.13 for the result.

SELECT
    type,
    SUM(sales) AS "SUM(sales)",
    AVG(sales) AS "AVG(sales)",
    COUNT(sales) AS "COUNT(sales)"
  FROM titles
  WHERE price >= 13
  GROUP BY type
  ORDER BY "SUM(sales)" DESC;

Figure 6.13Figure 6.13 Results of Listing 6.13.

Listing 6.14 and Figure 6.14 use multiple grouping columns to count the number of titles of each type that each publisher publishes.

Listing 6.14 List the number of books of each type for each publisher, sorted by descending count within ascending publisher ID. See Figure 6.14 for the result.

SELECT
    pub_id,
    type,
    COUNT(*) AS "COUNT(*)"
  FROM titles
  GROUP BY pub_id, type
  ORDER BY pub_id ASC, "COUNT(*)" DESC;

Figure 6.14Figure 6.14 Results of Listing 6.14.


In Listing 6.15 and Figure 6.15, I revisit Listing 5.31 in “Evaluating Conditional Values with CASE” in Chapter 5. But instead of listing each book categorized by its sales range, I use GROUP BY to list the number of books in each sales range.

Listing 6.15 List the number of books in each calculated sales range, sorted by ascending sales. See Figure 6.15 for the result.

SELECT
    CASE
      WHEN sales IS NULL
        THEN 'Unknown'
      WHEN sales <= 1000
        THEN 'Not more than 1,000'
      WHEN sales <= 10000
        THEN 'Between 1,001 and 10,000'
      WHEN sales <= 100000
        THEN 'Between 10,001 and 100,000'
      WHEN sales <= 1000000
        THEN 'Between 100,001 and 1,000,000'
      ELSE 'Over 1,000,000'
    END
      AS "Sales category",
    COUNT(*) AS "Num titles"
  FROM titles
  GROUP BY
    CASE
      WHEN sales IS NULL
        THEN 'Unknown'
      WHEN sales <= 1000
        THEN 'Not more than 1,000'
      WHEN sales <= 10000
        THEN 'Between 1,001 and 10,000'
      WHEN sales <= 100000
        THEN 'Between 10,001 and 100,000'
      WHEN sales <= 1000000
        THEN 'Between 100,001 and 1,000,000'
      ELSE 'Over 1,000,000'
    END
  ORDER BY MIN(sales) ASC;

Figure 6.15Figure 6.15 Results of Listing 6.15.


TIPS

  • Use the WHERE clause to exclude rows that you don’t want grouped, and use the HAVING clause to filter rows after they have been grouped. For information about HAVING, see the next section.

  • If used without an aggregate function,GROUP BY acts like DISTINCT (Listing 6.16and Figure 6.16). For information about DISTINCT, see “Eliminating Duplicate Rows with DISTINCT” in Chapter 4.

    Listing 6.16 Both of these queries return the same result. See Figure 6.16 for the result.

    SELECT type
      FROM titles
      GROUP BY type;
    
    SELECT DISTINCT type
      FROM titles;

    Figure 6.16Figure 6.16 Results of Listing 6.16.


  • You can use GROUP BY to look for patterns in your data. In Listing 6.17 and Figure 6.17, I’m looking for a relationship between price categories and average sales.

    Listing 6.17 List the average sales for each price, sorted by ascending price. See Figure 6.17 for the result.

    SELECT price, AVG(sales) AS "AVG(sales)"
      FROM titles
      WHERE price IS NOT NULL
      GROUP BY price
      ORDER BY price ASC;

    Figure 6.17Figure 6.17 Results of Listing 6.17.


  • Don’t rely on GROUP BY to sort your result. I recommend that you include ORDER BY whenever you use GROUP BY (even though I’ve omitted ORDER BY in some examples). In some DBMSes, a GROUP BY implies anORDER BY.

  • The multiple values returned by an aggregate function in a GROUP BY query are called vector aggregates. In a query that lacks a GROUP BY clause, the single value returned by an aggregate function is a scalar aggregate.

  • In Microsoft Access, use the Switch() function instead of the CASE expression in Listing 6.15. See the DBMS Tip in “Evaluating Conditional Values with CASE” in Chapter 5.

    MySQL doesn’t allow CASE in a GROUP BY clause. Listing 6.15 won’t run in MySQL.

    Some DBMSes, such as MySQL and PostgreSQL, permit column aliases in the GROUP BY clause.

Filtering Groups with HAVING

The HAVING clause sets conditions on the GROUP BY clause similar to the way that WHERE interacts with SELECT. The HAVING clause’s important characteristics are:

  • The HAVING clause comes after the GROUP BY clause and before the ORDER BY clause.

  • Just as WHERE limits the number of rows displayed by SELECT, HAVING limits the number of groups displayed by GROUP BY.

  • The WHERE search condition is applied before grouping occurs, and the HAVING search condition is applied after.

  • HAVING syntax is similar to the WHERE syntax, except that HAVING can contain aggregate functions.

  • A HAVING clause can reference any of the items that appear in the SELECT list.

The sequence in which the WHERE, GROUP BY, and HAVING clauses are applied is:

  1. The WHERE clause filters the rows that result from the operations specified in the FROM and JOIN clauses.

  2. The GROUP BY clause groups the output of the WHERE clause.

  3. The HAVING clause filters rows from the grouped result.

To filter groups:

  • Following the GROUP BY clause, type:

    HAVING search_condition

    search_condition is a search condition used to filter groups. search_condition can contain aggregate functions but otherwise is identical to the WHERE search condition, described in “Filtering Rows with WHERE” and subsequent sections in Chapter 4. You may combine and negate multiple HAVING conditions with the logical operators AND, OR, and NOT.

    The HAVING search condition is applied to the rows in the output produced by grouping. Only the groups that meet the search condition appear in the result. You can apply a HAVING clause only to columns that appear in the GROUP BY clause or in an aggregate function.

In Listing 6.18 and Figure 6.18, I revisit Listing 6.9 earlier in this chapter, but instead of listing the number of books that each author wrote (or co-wrote), I use HAVING to list only the authors who have written three or more books.

Listing 6.18 List the number of books written (or co-written) by each author who has written three or more books. See Figure 6.18 for the result.

SELECT
    au_id,
    COUNT(*) AS "num_books"
  FROM title_authors
  GROUP BY au_id
  HAVING COUNT(*) >= 3;

Figure 6.18Figure 6.18 Results of Listing 6.18.

In Listing 6.19 and Figure 6.19, the HAVING condition also is an aggregate expression in the SELECT clause. This query still works if you remove the AVG() expression from the SELECT list (Listing 6.20 and Figure 6.20).

Listing 6.19 List the number of titles and average revenue for the types with average revenue over $1 million. See Figure 6.19 for the result.

SELECT
type,
COUNT(price) AS "COUNT(price)",
AVG(price * sales) AS "AVG revenue"
FROM titles
GROUP BY type
HAVING AVG(price * sales) > 1000000;

Figure 6.19Figure 6.19 Results of Listing 6.19.

Listing 6.20 Listing 6.19 still works without AVG(price * sales) in the SELECT list. See Figure 6.20 for the result.

SELECT
    type,
    COUNT(price) AS "COUNT(price)"
  FROM titles
  GROUP BY type
  HAVING AVG(price * sales) > 1000000;

Figure 6.20Figure 6.20 Results of Listing 6.20.

In Listing 6.21 and Figure 6.21, multiple grouping columns count the number of titles of each type that each publisher publishes. The HAVING condition removes groups in which the publisher has one or fewer titles of a particular type. This query retrieves a subset of the result of Listing 6.14 earlier in this chapter.

Listing 6.21 List the number of books of each type for each publisher, for publishers with more than one title of a type. See Figure 6.21 for the result.

SELECT
    pub_id,
    type,
    COUNT(*) AS "COUNT(*)"
  FROM titles
  GROUP BY pub_id, type
  HAVING COUNT(*) > 1
  ORDER BY pub_id ASC, "COUNT(*)" DESC;

Figure 6.21Figure 6.21 Results of Listing 6.21.

In Listing 6.22 and Figure 6.22, the WHERE clause first removes all rows except for books from publishers P03 and P04. Then the GROUP BY clause groups the output of the WHERE clause by type. Finally, the HAVING clause filters rows from the grouped result.

Listing 6.22 For books from publishers P03 and P04, list the total sales and average price by type, for types with more than $10,000 total sales and less than $20 average price. See Figure 6.22 for the result.

SELECT
    type,
    SUM(sales) AS "SUM(sales)",
    AVG(price) AS "AVG(price)"
  FROM titles
  WHERE pub_id IN ('P03', 'P04')
  GROUP BY type
  HAVING SUM(sales) > 10000
    AND AVG(price) < 20;

Figure 6.22Figure 6.22 Results of Listing 6.22.

TIP

  • Generally, HAVING clauses should involve only aggregates. The only conditions that should be specified in the HAVING clause are those conditions that must be applied after the grouping operation has been performed. It’s more efficient to specify conditions that can be applied before the grouping operation in the WHERE clause. The following statements, for example, are equivalent, but the first statement is preferable because it reduces the number of rows that have to be grouped:

    SELECT pub_id, SUM(sales) --Faster 
      FROM titles 
      WHERE pub_id IN (‘P03’, ‘P04’) 
      GROUP BY pub_id 
      HAVING SUM(sales) > 10000;
    
    SELECT pub_id, SUM(sales) --Slower 
      FROM titles 
      GROUP BY pub_id 
      HAVING SUM(sales) > 10000 
        AND pub_id IN (‘P03’, ‘P04’);

1301 Sansome Street, San Francisco, CA 94111