Summarizing and Grouping Data in SQL
Date: Jan 31, 2003
The preceding chapter described scalar functions, which operate on individual row values. This chapter introduces SQLs aggregate functions, or set functions, which operate on a group of values to produce a single, summarizing value. You apply an aggregate to a set of rows, which may be:
All the rows in a table
Only those rows specified by a WHERE clause
Those rows created by a GROUP BY clause
No matter how many rows the set contains, an aggregate function returns a single statistic: a sum, minimum, or average, for example.
In this chapter, Ill also cover the SELECT statements GROUP BY clause, which groups rows, and HAVING clause, which filters groups.
Using Aggregate Functions
Table 6.1 lists SQLs standard aggregate functions. The important characteristics of the aggregate functions are:
Table 6.1 Aggregate Functions
Function | Returns |
MIN(expr) | Minimum value in expr |
MAX(expr) | Maximum value in expr |
SUM(expr) | Sum of the values in expr |
AVG(expr) | Average (arithmetic mean) of the values in expr |
COUNT(expr) | The number of non-null values in expr |
COUNT(*) | The number of rows in a table or set |
In Table 6.1, expr often is a column name, but it also can be a literal, function, or any combination of column names, literals, and functions coupled by operators.
SUM() and AVG() work with only numeric data types. MIN() and MAX() work with characters, numeric, and datetime datatypes. COUNT(expr) and COUNT(*) workwith all data types.
All aggregate functions except COUNT(*) ignore nulls. (You can use COALESCE() in an aggregate function argument to substitute a value for a null; see Checking for Nulls with COALESCE() in Chapter 5.)
COUNT(expr) and COUNT(*) never returnnull but return either a positive integer or zero. The other aggregate functions return null if the set contains no rows or contains rows with only nulls.
Use the DISTINCT keyword to aggregate distinct values; see Aggregating Distinct Values with DISTINCT later in this chapter.
Aggregate functions often are used with the GROUP BY clause; see Grouping Rows with GROUP BY later in this chapter.
Use a WHERE clause to restrict the rows used in aggregate calculations; see Filtering Rows with WHERE in Chapter 4.
Default column headings for aggregate expressions vary by DBMS; use AS to name the result column. See Creating Column Aliases with AS in Chapter 4.
An aggregate expression cant appear in a WHERE clause. If you want to find the title of the book with the highest sales, you cant use:
SELECT title_id --Illegal FROM titles WHERE sales = MAX(sales);
You cant mix nonaggregate (row-by-row) and aggregate expressions in a SELECT clause. A SELECT clause must contain either all nonaggregate expressions or all aggregate expressions. If you want to find the title of the book with the highest sales, you cant use:
SELECT title_id, MAX(sales) FROM titles; --Illegal
The one exception to this rule is that you can mix nonaggregate and aggregate expressions for grouping columns (see Grouping Rows with GROUP BY later in this chapter):
SELECT type, SUM(sales) FROM titles GROUP BY type; --Legal
You can use more than one aggregate expression in a SELECT clause:
SELECT MIN(sales), MAX(sales) FROM titles; --Legal
You cant nest aggregate functions:
SELECT SUM(AVG(sales)) FROM titles; --Illegal
You can use aggregate expressions in sub-queries. This statement finds the title of the book with the highest sales:
SELECT title_id, price --Legal FROM titles WHERE sales = (SELECT MAX(sales) FROM titles);
You cant use subqueries (see Chapter 8) in aggregate expressions: AVG(SELECT price FROM titles) is illegal.
TIPS
Oracle lets you nest aggregate expressions in GROUP BY queries. The following example calculates the average of the maximum sales of all book types. Oracle evaluates the inner aggregate MAX(sales) for the grouping column type and then aggregates the results again:
SELECT AVG(MAX(sales)) FROM titles GROUP BY type; --Legal in Oracle
MySQL 4.0 and earlier versions dont support subqueries.
DBMSes provide additional aggregate functions to calculate other statistics, such as the standard deviation; search your DBMS documentation for aggregate functions or group functions.
Finding a Minimum with MIN()
Use the aggregate function MIN() to find the minimum of a set of values.
To find the minimum of a set of values:
Type:
MIN(expr)
expr is a column name, literal, or expression. The result has the same data type as expr.
Listing 6.1 and Figure 6.1 show some queries that involve MIN(). The first query returns the price of the lowest-priced book. The second query returns the earliest publication date. The third query returns the number of pages in the shortest history book.
Listing 6.1 Some MIN() queries. See Figure 6.1 for the results.
SELECT MIN(price) AS "Min price" FROM titles; SELECT MIN(pubdate) AS "Earliest pubdate" FROM titles; SELECT MIN(pages) AS "Min history pages" FROM titles WHERE type = 'history';
Figure 6.1 Results of Listing 6.1.
TIPS
MIN() works with character, numeric, and datetime data types.
With character data columns, MIN() finds the value that is lowest in the sort sequence; see Sorting Rows with ORDER BY in Chapter 4.
DISTINCT isnt meaningful with MIN(); see Aggregating Distinct Values with DISTINCT later in this chapter.
String comparisons may be case-insensitive or case-sensitive, depending on your DBMS; see the DBMS Tip in Filtering Rows with WHERE in Chapter 4.
When comparing two VARCHAR strings for equality, your DBMS may right-pad the shorter string with spaces and compare the strings position by position. In this case, the strings Jack and Jack are equal. Refer to your DBMS documentation (or experiment) to determine which string MIN() returns.
Finding a Maximum with MAX()
Use the aggregate function MAX() to find the maximum of a set of values.
To find the maximum of a set of values:
Type:
MAX(expr)
expr is a column name, literal, or expression. The result has the same data type as expr.
Listing 6.2 and Figure 6.2 show some queries that involve MAX(). The first query returns the authors last name that is last alphabetically. The second query returns the prices of the cheapest and most expensive books, and the price range. The third query returns the highest revenue (= price ´ sales) among the history books.
Listing 6.2 Some MAX() queries. See Figure 6.2 for the results.
SELECT MAX(au_lname) AS "Max last name" FROM authors; SELECT MIN(price) AS "Min price", MAX(price) AS "Max price", MAX(price) - MIN(price) AS "Range" FROM titles; SELECT MAX(price * sales) AS "Max history revenue" FROM titles WHERE type = 'history';
Figure 6.2 Results of Listing 6.2.
TIPS
MAX() works with character, numeric, and datetime data types.
With character data columns, MAX() finds the value that is highest in the sort sequence; see Sorting Rows with ORDER BY in Chapter 4.
DISTINCT isnt meaningful with MAX(); see Aggregating Distinct Values with DISTINCT later in this chapter.
String comparisons may be case-insensitive or case-sensitive, depending on your DBMS; see the DBMS Tip in Filtering Rows with WHERE in Chapter 4.
When comparing two VARCHAR strings for equality, your DBMS may right-pad the shorter string with spaces and compare the strings position-by-position. In this case, the strings Jack and Jack are equal. Refer to your DBMS documentation (or experiment) to determine which string MAX() returns.
Calculating a Sum with SUM()
Use the aggregate function SUM() to find the sum (total) of a set of values.
To calculate the sum of a set of values:
Type:
SUM(expr)
expr is a column name, literal, or numeric expression. The results data type is at least as precise as the most precise data type used in expr.
Listing 6.3 and Figure 6.3 show some queries that involve SUM(). The first query returns the total advances paid to all authors. The second query returns the total sales of books published in 2000. The third query returns the total price, sales, and revenue (= price ´ sales) of all books. Note a mathematical chestnut in action here: The sum of the products doesnt (necessarily) equal the product of the sums.
Listing 6.3 Some SUM() queries. See Figure 6.3 for the results.
SELECT SUM(advance) AS "Total advances" FROM royalties; SELECT SUM(sales) AS "Total book sales for 2000" FROM titles WHERE pubdate BETWEEN DATE '2000-01-01' AND DATE '2000-12-31'; SELECT SUM(price) AS "Total price", SUM(sales) AS "Total sales", SUM(price * sales) AS "Total revenue" FROM titles;
Figure 6.3 Results of Listing 6.3.
TIPS
SUM() works with only numeric data types.
The sum of no rows is nullnot zero as you might expect.
In Microsoft Access date literals, omit the DATE keyword, and surround the literal with # characters instead of quotes. To run Listing 6.3 in Access, change the date literals in the second query to #2000-01-01# and #2000-12-31#. In Microsoft SQL Server date literals, omit the DATE keyword. To run Listing 6.3 in SQL Server, change the date literals to 2000-01-01 and 2000-12-31.
Finding an Average with AVG()
Use the aggregate function AVG() to find the average, or arithmetic mean, of a set of values. The arithmetic mean is the sum of a set of quantities divided by the number of quantities in the set.
To calculate the average of a set of values:
Type:
AVG(expr)
expr is a column name, literal, or numeric expression. The results data type is at least as precise as the most precise data type used in expr.
Listing 6.4 and Figure 6.4 shows some queries that involve AVG(). The first query returns the average price of all books if prices were doubled. The second query returns the average and total sales for business books; both calculations are null (not zero), because the table contains no business books. The third query uses a subquery (see Chapter 8) to list the books with above-average sales.
Listing 6.4 Some AVG() queries. See Figure 6.4 for the results.
SELECT AVG(price * 2) AS "AVG(price * 2)" FROM titles; SELECT AVG(sales) AS "AVG(sales)", SUM(sales) AS "SUM(sales)" FROM titles WHERE type = 'business'; SELECT title_id, sales FROM titles WHERE sales > (SELECT AVG(sales) FROM titles) ORDER BY sales DESC;
Figure 6.4 Results of Listing 6.4.
TIPS
AVG() works with only numeric data types.
The average of no rows is nullnot zero, as you might expect.
If youve used, say, 0 or 1 instead of null to represent missing values, the inclusion of those numbers in AVG() calculations yields an incorrect result. Use NULLIF() to convert the missing-value numbers to nulls, so theyll be excluded from calculations; see Comparing Expressions with NULLIF() in Chapter 5.
MySQL 4.0 and earlier versions lack subquery support and wont run the third query in Listing 6.4.
Counting Rows with COUNT()
Use the aggregate function COUNT() to count the number of rows in a set of values. COUNT() has two forms:
COUNT(expr) returns the number of rows in which expr is not null.
COUNT(*) returns the count of all rows in a set, including nulls and duplicates.
To count non-null rows:
Type:
COUNT(expr)
expr is a column name, literal, or expression. The result is an integer greater than or equal to zero.
To count all rows, including nulls:
Type:
COUNT(*)
COUNT(*) returns an integer greater than or equal to zero.
Listing 6.5 and Figure 6.5 show some queries that involve COUNT(expr) and COUNT(*). The three queries count rows in the table titles and are identical except for the WHERE clause. The row counts in the first query differ because the column price contains a null. In the second query, the row counts are identical because the WHERE clause eliminates the row with the null price before the count. The third query shows the row-count differences between the results of first two queries.
Listing 6.5 Some COUNT() queries. See Figure 6.5 for the results.
SELECT COUNT(title_id) AS "COUNT(title_id)", COUNT(price) AS "COUNT(price)", COUNT(*) AS "COUNT(*)" FROM titles; SELECT COUNT(title_id) AS "COUNT(title_id)", COUNT(price) AS "COUNT(price)", COUNT(*) AS "COUNT(*)" FROM titles WHERE price IS NOT NULL; SELECT COUNT(title_id) AS "COUNT(title_id)", COUNT(price) AS "COUNT(price)", COUNT(*) AS ""COUNT(*)" FROM titles WHERE price IS NULL;
Figure 6.5 Results of Listing 6.5.
TIPS
COUNT(expr) and COUNT(*) work with alldata types.
COUNT(expr) and COUNT(*) neverreturn null.
DISTINCT isnt meaningful with COUNT(*); see Aggregating Distinct Values with DISTINCT later in this chapter.
Aggregating Distinct Values with DISTINCT
You can use DISTINCT to eliminate duplicate values in aggregate function calculations; see Eliminating Duplicate Rows with DISTINCT in Chapter 4. The general syntax of an aggregate function is:
agg_func([ALL | DISTINCT] expr)
agg_func is MIN, MAX, SUM, AVG, or COUNT. expr is a column name, literal, or expression. ALL applies the aggregate function to all values, and DISTINCT specifies that each unique value is considered. ALL is the default and rarely is seen in practice.
With SUM(), AVG(), and COUNT(expr), DISTINCT eliminates duplicate values before the sum, average, or count is calculated. DISTINCT isnt meaningful with MIN() and MAX(); you can use it, but it wont change the result. You cant use DISTINCT with COUNT(*).
To calculate the sum of a set of distinct values:
Type:
SUM(DISTINCT expr)
expr is a column name, literal, or numeric expression. The results data type is at least as precise as the most precise data type used in expr.
To calculate the average of a set of distinct values:
Type:
AVG(DISTINCT expr)
expr is a column name, literal, or numeric expression. The results data type is at least as precise as the most precise data type used in expr.
To count distinct non-null rows:
Type:
COUNT(DISTINCT expr)
expr is a column name, literal, or expression. The result is an integer greater than or equal to zero.
The queries in Listing 6.6 return the count, sum, and average of book prices. The non-DISTINCT and DISTINCT results in Figure 6.6 differ because the DISTINCT results eliminate the duplicates of prices $12.99 and $19.95 from calculations.
Listing 6.6 Some DISTINCT aggregate queries. See Figure 6.6 for the results.
SELECT COUNT(*) AS "COUNT(*)", COUNT(price) AS "COUNT(price)", SUM(price) AS "SUM(price)", AVG(price) AS "AVG(price)" FROM titles; SELECT COUNT(DISTINCT price) AS "COUNT(DISTINCT)", SUM(DISTINCT price) AS "SUM(DISTINCT)", AVG(DISTINCT price) AS "AVG(DISTINCT)" FROM titles;
Figure 6.6 Results of Listing 6.6.
TIPS
DISTINCT in a SELECT clause andDISTINCT in an aggregate function dont return the same result.
The three queries in Listing 6.7 count the author IDs in the tabletitle_authors. Figure 6.7 shows theresults. The first query counts all the author IDs in the table. The second query returns the same result as the first query because COUNT() already has done its work and returned a value in single row before DISTINCT is applied. In the third query, DISTINCT is applied to the author IDs before COUNT() starts counting.
Listing 6.7 DISTINCT in a SELECT clause and DISTINCT in an aggregate function differ in meaning. See Figure 6.7 for the results.
SELECT COUNT(au_id) AS "COUNT(au_id)" FROM title_authors; SELECT DISTINCT COUNT(au_id) AS "DISTINCT COUNT(au_id)" FROM title_authors; SELECT COUNT(DISTINCT au_id) AS "COUNT(DISTINCT au_id)" FROM title_authors;
Figure 6.7 Results of Listing 6.7.
Mixing non-DISTINCT and DISTINCT aggregates in the same SELECT clause can produce misleading results.
The four queries in Listing 6.8 (following page) show the four combinations of non-DISTINCT and DISTINCT sums and counts. Of the four results in Figure 6.8 (following page) , only the first result (no DISTINCTs) and final result (all DISTINCTs) are consistent mathematically, which you can verify with AVG(price) and AVG(DISTINCTprice). In the second and third queries (mixed non-DISTINCTs and DISTINCTs), you cant calculate a valid average by dividing the sum by the count.
Listing 6.8 Mixing non-DISTINCT and DISTINCT aggregates in the same SELECT clause can produce misleading results. See Figure 6.8 for the results.
SELECT COUNT(price) AS "COUNT(price)", SUM(price) AS "SUM(price)" FROM titles; SELECT COUNT(price) AS "COUNT(price)", SUM(DISTINCT price) AS "SUM(DISTINCT price)" FROM titles; SELECT COUNT(DISTINCT price) AS "COUNT(DISTINCT price)", SUM(price) AS "SUM(price)" FROM titles; SELECT COUNT(DISTINCT price) AS "COUNT(DISTINCT price)", SUM(DISTINCT price) AS "SUM(DISTINCT price)" FROM titles;
Figure 6.8 Results of Listing 6.8.
Microsoft Access doesnt support DISTINCT aggregate functions. This statement, for example, is illegal in Access:
SELECT SUM(DISTINCT price) FROM titles; --Illegal in Access
But you can replicate it with this subquery (see the Tips in Using Subqueries as Column Expressions in Chapter 8):
SELECT SUM(price) FROM (SELECT DISTINCT price FROM titles);
This Access workaround wont let you mix non-DISTINCT and DISTINCT aggregates, however, as in the second and third queries in Listing 6.8.
In Microsoft SQL Server, if you use DISTINCT, expr must be a column name only. It cant include an arithmetic expression:
SELECT COUNT(DISTINCT price * sales) FROM titles; --Illegal in SQL Server
MySQL supports COUNT(DISTINCT expr) but not SUM(DISTINCT expr) and AVG(DISTINCT expr). Listings 6.6 and 6.8 wont run in MySQL.
Grouping Rows with GROUP BY
To this point, Ive used aggregate functions to summarize all the values in a column or just those values that matched a WHERE search condition. You can use the GROUP BY clause to divide a table into logical groups (categories) and calculate aggregate statistics for each group.
An example will clarify the concept. Listing 6.9 uses GROUP BY to count the number of books that each author wrote (or co-wrote). In the SELECT clause, the column au_id identifies each author, and the derived column num_books counts each authors books. The GROUP BY clause causes num_books to be calculated for every unique au_id instead of only once for the entire table. Figure 6.9 shows the result. In this example, au_id is called the grouping column.
Listing 6.9 List the number of books each author wrote (or co-wrote). See Figure 6.9 for the result.
SELECT au_id, COUNT(*) AS "num_books" FROM title_authors GROUP BY au_id;
Figure 6.9 Results of Listing 6.9.
The GROUP BY clauses important characteristics are:
The GROUP BY clause comes after the WHERE clause and before the ORDER BY clause.
Grouping columns can be column names or derived columns.
Every nonaggregate column in the SELECT clause must appear in the GROUP BY clause. This statement is illegal because pub_id isnt in the GROUP BY clause:
SELECT type, pub_id , COUNT(*) FROM titles GROUP BY type; --Illegal
Because the GROUP BY can return only one row for each value of type, theres no way to return multiple values of pub_id that are associated with any particular value of type.
If the SELECT clause contains a complex nonaggregate expression (more than just a simple column name), the GROUP BY expression must match the SELECT expression exactly.
Specify multiple grouping columns in the GROUP BY clause to nest groups. Data is summarized at the last specified group.
If a grouping column contains a null, that row becomes a group in the result. If a grouping column contains more than one null, the nulls are put into a single group. A group that contains multiple nulls doesnt imply that the nulls equal one another.
Use a WHERE clause in a query containing a GROUP BY clause to eliminate rows before grouping occurs.
You cant use a column alias in the GROUP BY clause, though table aliases are allowed as qualifiers; see Creating Table Aliases with AS in Chapter 7.
Without an ORDER BY clause, groups returned by GROUP BY arent in any particular order. To sort the result of Listing 6.9 by the descending number of books, for example, add the clause ORDER BY num_books DESC.
To group rows:
Type:
SELECT columns FROM table [WHERE search_condition] GROUP BY grouping_columns HAVING search_condition] [ORDER BY sort_columns];
columns and grouping_columns are one or more comma-separated column names, and table is the name of the table that contains columns and grouping_columns. The nonaggregate columns that appear in columns also must appear in grouping_columns. The order of the column names in grouping_columns determines the grouping levels, from the highest to the lowest level of grouping.
The GROUP BY clause restricts the rows of the result; only one row appears for each distinct value in the grouping column or columns. Each row in the result contains summary data related to the specific value in its grouping columns.
If the statement includes a WHERE clause, the DBMS groups values after it applies search_condition to the rows in table. If the statement includes an ORDER BY clause, the columns in sort_columns must be drawn from those in columns. The WHERE and ORDER BY clauses are covered in Filtering Rows with WHERE and Sorting Rows with ORDER BY in Chapter 4. HAVING, which filters grouped rows, is covered in the next section.
Listing 6.10 and Figure 6.10 show the difference between COUNT(expr) and COUNT(*) in a query that contains GROUP BY. The table publishers contains one null in the column state (for publisher P03 in Germany). Recall from Counting Rows with COUNT() earlier in this chapter that COUNT(expr) counts non-null values and COUNT(*) counts all values, including nulls. In the result, GROUP BY recognizes the null and creates a null group for it. COUNT(*) finds (and counts) the one null in the column state. But COUNT(state) contains a zero for the null group because COUNT(state) finds only a null in the null group, which it excludes from the counthence, the zero.
Listing 6.10 This query illustrates the difference between COUNT(expr) and COUNT(*) in a GROUP BY query. See Figure 6.10 for the result.
SELECT state, COUNT(state) AS "COUNT(state)", COUNT(*) AS "COUNT(*)" FROM publishers GROUP BY state;
Figure 6.10 Results of Listing 6.10.
If a nonaggregate column contains nulls, using COUNT(*) rather than COUNT(expr) can produce misleading results. Listing 6.11 and Figure 6.11 show summary sales statistics for each type of book. The sales value for one of the biographies is null, so COUNT(sales) and COUNT(*) differ by 1. The average calculation in the fifth column, SUM/COUNT(sales), is consistent mathematically, whereas the sixth-column average, SUM/COUNT(*), is not. Ive verified the inconsistency with AVG(sales) in the final column. (Recall a similar situation in Listing 6.8 in Aggregating Distinct Values with DISTINCT earlier in this chapter.)
Listing 6.11 For mathematically consistent results, use COUNT(expr), rather than COUNT(*), if expr contains nulls. See Figure 6.11 for the result.
SELECT type, SUM(sales) AS "SUM(sales)", COUNT(sales) AS "COUNT(sales)", COUNT(*) AS "COUNT(*)", SUM(sales)/COUNT(sales) AS "SUM/COUNT(sales)", SUM(sales)/COUNT(*) AS "SUM/COUNT(*)", AVG(sales) AS "AVG(sales)" FROM titles GROUP BY type;
Figure 6.11 Results of Listing 6.11.
Listing 6.12 and Figure 6.12 show a simple GROUP BY query that calculates the total sales, average sales, and number of titles for each type of book. In Listing 6.13 and Figure 6.13, Ive added a WHERE clause to eliminate books priced under $13 before grouping. Ive also added an ORDER BY clause to sort the result by descending total sales of each book type.
Listing 6.12 This simple GROUP BY query calculates a few summary statistics for each type of book. See Figure 6.12 for the result.
SELECT type, SUM(sales) AS "SUM(sales)", AVG(sales) AS "AVG(sales)", COUNT(sales) AS "COUNT(sales)" FROM titles GROUP BY type;
Figure 6.12 Results of Listing 6.12.
Listing 6.13 Here, Ive added WHERE and ORDER BY clauses to Listing 6.12 to cull books priced under $13 and sort the result by descending total sales. See Figure 6.13 for the result.
SELECT type, SUM(sales) AS "SUM(sales)", AVG(sales) AS "AVG(sales)", COUNT(sales) AS "COUNT(sales)" FROM titles WHERE price >= 13 GROUP BY type ORDER BY "SUM(sales)" DESC;
Figure 6.13 Results of Listing 6.13.
Listing 6.14 and Figure 6.14 use multiple grouping columns to count the number of titles of each type that each publisher publishes.
Listing 6.14 List the number of books of each type for each publisher, sorted by descending count within ascending publisher ID. See Figure 6.14 for the result.
SELECT pub_id, type, COUNT(*) AS "COUNT(*)" FROM titles GROUP BY pub_id, type ORDER BY pub_id ASC, "COUNT(*)" DESC;
Figure 6.14 Results of Listing 6.14.
In Listing 6.15 and Figure 6.15, I revisit Listing 5.31 in Evaluating Conditional Values with CASE in Chapter 5. But instead of listing each book categorized by its sales range, I use GROUP BY to list the number of books in each sales range.
Listing 6.15 List the number of books in each calculated sales range, sorted by ascending sales. See Figure 6.15 for the result.
SELECT CASE WHEN sales IS NULL THEN 'Unknown' WHEN sales <= 1000 THEN 'Not more than 1,000' WHEN sales <= 10000 THEN 'Between 1,001 and 10,000' WHEN sales <= 100000 THEN 'Between 10,001 and 100,000' WHEN sales <= 1000000 THEN 'Between 100,001 and 1,000,000' ELSE 'Over 1,000,000' END AS "Sales category", COUNT(*) AS "Num titles" FROM titles GROUP BY CASE WHEN sales IS NULL THEN 'Unknown' WHEN sales <= 1000 THEN 'Not more than 1,000' WHEN sales <= 10000 THEN 'Between 1,001 and 10,000' WHEN sales <= 100000 THEN 'Between 10,001 and 100,000' WHEN sales <= 1000000 THEN 'Between 100,001 and 1,000,000' ELSE 'Over 1,000,000' END ORDER BY MIN(sales) ASC;
Figure 6.15 Results of Listing 6.15.
TIPS
Use the WHERE clause to exclude rows that you dont want grouped, and use the HAVING clause to filter rows after they have been grouped. For information about HAVING, see the next section.
If used without an aggregate function,GROUP BY acts like DISTINCT (Listing 6.16and Figure 6.16). For information about DISTINCT, see Eliminating Duplicate Rows with DISTINCT in Chapter 4.
Listing 6.16 Both of these queries return the same result. See Figure 6.16 for the result.
SELECT type FROM titles GROUP BY type; SELECT DISTINCT type FROM titles;
Figure 6.16 Results of Listing 6.16.
You can use GROUP BY to look for patterns in your data. In Listing 6.17 and Figure 6.17, Im looking for a relationship between price categories and average sales.
Listing 6.17 List the average sales for each price, sorted by ascending price. See Figure 6.17 for the result.
SELECT price, AVG(sales) AS "AVG(sales)" FROM titles WHERE price IS NOT NULL GROUP BY price ORDER BY price ASC;
Figure 6.17 Results of Listing 6.17.
Dont rely on GROUP BY to sort your result. I recommend that you include ORDER BY whenever you use GROUP BY (even though Ive omitted ORDER BY in some examples). In some DBMSes, a GROUP BY implies anORDER BY.
The multiple values returned by an aggregate function in a GROUP BY query are called vector aggregates. In a query that lacks a GROUP BY clause, the single value returned by an aggregate function is a scalar aggregate.
In Microsoft Access, use the Switch() function instead of the CASE expression in Listing 6.15. See the DBMS Tip in Evaluating Conditional Values with CASE in Chapter 5.
MySQL doesnt allow CASE in a GROUP BY clause. Listing 6.15 wont run in MySQL.
Some DBMSes, such as MySQL and PostgreSQL, permit column aliases in the GROUP BY clause.
Filtering Groups with HAVING
The HAVING clause sets conditions on the GROUP BY clause similar to the way that WHERE interacts with SELECT. The HAVING clauses important characteristics are:
The HAVING clause comes after the GROUP BY clause and before the ORDER BY clause.
Just as WHERE limits the number of rows displayed by SELECT, HAVING limits the number of groups displayed by GROUP BY.
The WHERE search condition is applied before grouping occurs, and the HAVING search condition is applied after.
HAVING syntax is similar to the WHERE syntax, except that HAVING can contain aggregate functions.
A HAVING clause can reference any of the items that appear in the SELECT list.
The sequence in which the WHERE, GROUP BY, and HAVING clauses are applied is:
The WHERE clause filters the rows that result from the operations specified in the FROM and JOIN clauses.
The GROUP BY clause groups the output of the WHERE clause.
The HAVING clause filters rows from the grouped result.
To filter groups:
Following the GROUP BY clause, type:
HAVING search_condition
search_condition is a search condition used to filter groups. search_condition can contain aggregate functions but otherwise is identical to the WHERE search condition, described in Filtering Rows with WHERE and subsequent sections in Chapter 4. You may combine and negate multiple HAVING conditions with the logical operators AND, OR, and NOT.
The HAVING search condition is applied to the rows in the output produced by grouping. Only the groups that meet the search condition appear in the result. You can apply a HAVING clause only to columns that appear in the GROUP BY clause or in an aggregate function.
In Listing 6.18 and Figure 6.18, I revisit Listing 6.9 earlier in this chapter, but instead of listing the number of books that each author wrote (or co-wrote), I use HAVING to list only the authors who have written three or more books.
Listing 6.18 List the number of books written (or co-written) by each author who has written three or more books. See Figure 6.18 for the result.
SELECT au_id, COUNT(*) AS "num_books" FROM title_authors GROUP BY au_id HAVING COUNT(*) >= 3;
Figure 6.18 Results of Listing 6.18.
In Listing 6.19 and Figure 6.19, the HAVING condition also is an aggregate expression in the SELECT clause. This query still works if you remove the AVG() expression from the SELECT list (Listing 6.20 and Figure 6.20).
Listing 6.19 List the number of titles and average revenue for the types with average revenue over $1 million. See Figure 6.19 for the result.
SELECT type, COUNT(price) AS "COUNT(price)", AVG(price * sales) AS "AVG revenue" FROM titles GROUP BY type HAVING AVG(price * sales) > 1000000;
Figure 6.19 Results of Listing 6.19.
Listing 6.20 Listing 6.19 still works without AVG(price * sales) in the SELECT list. See Figure 6.20 for the result.
SELECT type, COUNT(price) AS "COUNT(price)" FROM titles GROUP BY type HAVING AVG(price * sales) > 1000000;
Figure 6.20 Results of Listing 6.20.
In Listing 6.21 and Figure 6.21, multiple grouping columns count the number of titles of each type that each publisher publishes. The HAVING condition removes groups in which the publisher has one or fewer titles of a particular type. This query retrieves a subset of the result of Listing 6.14 earlier in this chapter.
Listing 6.21 List the number of books of each type for each publisher, for publishers with more than one title of a type. See Figure 6.21 for the result.
SELECT pub_id, type, COUNT(*) AS "COUNT(*)" FROM titles GROUP BY pub_id, type HAVING COUNT(*) > 1 ORDER BY pub_id ASC, "COUNT(*)" DESC;
Figure 6.21 Results of Listing 6.21.
In Listing 6.22 and Figure 6.22, the WHERE clause first removes all rows except for books from publishers P03 and P04. Then the GROUP BY clause groups the output of the WHERE clause by type. Finally, the HAVING clause filters rows from the grouped result.
Listing 6.22 For books from publishers P03 and P04, list the total sales and average price by type, for types with more than $10,000 total sales and less than $20 average price. See Figure 6.22 for the result.
SELECT type, SUM(sales) AS "SUM(sales)", AVG(price) AS "AVG(price)" FROM titles WHERE pub_id IN ('P03', 'P04') GROUP BY type HAVING SUM(sales) > 10000 AND AVG(price) < 20;
Figure 6.22 Results of Listing 6.22.
TIP
Generally, HAVING clauses should involve only aggregates. The only conditions that should be specified in the HAVING clause are those conditions that must be applied after the grouping operation has been performed. Its more efficient to specify conditions that can be applied before the grouping operation in the WHERE clause. The following statements, for example, are equivalent, but the first statement is preferable because it reduces the number of rows that have to be grouped:
SELECT pub_id, SUM(sales) --Faster FROM titles WHERE pub_id IN (P03, P04) GROUP BY pub_id HAVING SUM(sales) > 10000; SELECT pub_id, SUM(sales) --Slower FROM titles GROUP BY pub_id HAVING SUM(sales) > 10000 AND pub_id IN (P03, P04);