Tuesday, September 14, 2010

Group by (SQL)

The group by clause divides the output of a query into groups. You can group by one or more column names, or by the results of computed columns using numeric datatypes in an expression. When used with aggregates, group by retrieves the calculations in each subgroup, and may return multiple rows. The maximum number of columns or expressions you can use in a group by clause is 16.
You cannot group by columns of text or image datatypes.
While you can use group by without aggregates, such a construction has limited functionality and may produce confusing results. The following example groups the results by title type:
select type, advance
from titles
group by type
type             advance
------------ ---------
business 5,000.00
business 5,000.00
business 10,125.00
business 5,000.00
mod_cook 0.00
mod_cook 15,000.00
UNDECIDED NULL
popular_comp 7,000.00
popular_comp 8,000.00
popular_comp NULL
psychology 7,000.00
psychology 2,275.00
psychology 6,000.00
psychology 2,000.00
psychology 4,000.00
trad_cook 7,000.00
trad_cook 4,000.00
trad_cook 8,000.00

(18 rows affected)
With an aggregate for the advance column, the query returns the sum for each group:
select type, sum(advance)
from titles
group by type
type                                  
------------ ------------------------
UNDECIDED NULL
business 25,125.00
mod_cook 15,000.00
popular_comp 15,000.00
psychology 21,275.00
trad_cook 19,000.00

(6 rows affected)
The summary values in a group by clause using aggregates are called vector aggregates, as opposed to scalar aggregates, which result when only one row is returned (see "Using aggregate functions").

group by syntax

The complete syntax of the select statement is repeated here so that you can see the group by clause in context:
select [all | distinct] select_list 
[into [[database.]owner.]table_name]
[from [[database.]owner.]{view_name | table_name [(index {index_name | table_name}
[parallel [degree_of_parallelism]]
[prefetch size ] [lru | mru])]}
[holdlock | noholdlock] [shared]
[,[[database.]owner.]{view_name | table_name [(index {index_name | table_name}
[parallel [degree_of_parallelism]]
[prefetch size] [lru | mru])]}
[holdlock | noholdlock] [shared]]...]

[where search_conditions]

[group by [all] aggregate_free_expression [, aggregate_free_expression]...]
[having search_conditions]

[order by
{[[[database.]owner.] {table_name. | view_name.}]
column_name | select_list_number | expression} [asc | desc]
[, {[[[database.]owner.]{table_name|view_name.}] column_name | select_list_number | expression}
[asc | desc]]...]

[compute row_aggregate (column_name)
[, row_aggregate(column_name)]...
[by column_name [, column_name]...]]

[for {read only | update [of column_name_list]}]

[at isolation {read uncommitted | read committed | serializable}]

[for browse]
Remember that the order of the clauses in the select statement is significant. You can omit any of the optional clauses, but when you use them, they must appear in the order shown above.
group by and SQL standardsThe SQL standards for group by are more restrictive than Sybase's standard. The SQL standard requires that:
  • The columns in a select list must be in the group by expression or they must be arguments of aggregate functions.
  • A group by expression can only contain column names in the select list, but not those used only as arguments for vector aggregates.
Several Transact-SQL extensions (described in the following sections) relax these restrictions, however the more complex result sets may be more difficult to understand. If you set the fipsflagger option as follows, you will receive a warning message stating that Transact-SQL extensions are used:
set fipsflagger on
For more information about the fipsflagger option, see the set command in the Reference Manual.
Nesting groups with group byYou can list more than one column in the group by clause to nest groups. Once the sets are established with group by, the aggregates are applied. This statement finds the average price and the sum of book sales, grouped first by publisher identification number and then by type:
select pub_id, type, avg(price), sum(total_sales)
from titles
group by pub_id, type
pub_id  type
------ ------------ ------ -------
0736 business 2.99 18,722
0736 psychology 11.48 9,564
0877 UNDECIDED NULL NULL
0877 mod_cook 11.49 24,278
0877 psychology 21.59 375
0877 trad_cook 15.96 19,566
1389 business 17.31 12,066
1389 popular_comp 21.48 12,875

(8 rows affected)
You can nest many groups within groups, up to the maximum of 16 columns or expressions specified with group by.

Referencing other columns in queries using group by

SQL standards state that the group by must contain items from the select list. However, Transact-SQL allows you to specify any valid column name in either the group by or select list, whether they employ aggregates or not.
Through the following extensions, Sybase lifts restrictions on what you can include or omit in the select list of a query that includes group by.
  • The columns in the select list are not limited to the grouping columns and columns used with the vector aggregates.
  • The columns specified by group by are not limited to those non-aggregate columns in the select list.
A vector aggregate must be accompanied by a group by clause. The SQL standards require that the non-aggregate columns in the select list match the group by columns. However, the first bulleted item described above allows you to specify additional, extended columns in the select list of the query.
For example, many versions of SQL do not allow the inclusion of the extended title_id column in the select list, but it is legal in Transact-SQL:
select type, title_id, avg(price), avg(advance) 
from titles
group by type
type           title_id 
------------ -------- ----- -------
business BU1032 13.73 6,281.25
business BU1111 13.73 6,281.25
business BU2075 13.73 6,281.25
business BU7832 13.73 6,281.25
mod_cook MC2222 11.49 7,500.00
mod_cook MC3021 11.49 7,500.00
UNDECIDED MC3026 NULL NULL
popular_comp PC1035 21.48 7,500.00
popular_comp PC8888 21.48 7,500.00
popular_comp PC9999 21.48 7,500.00
psychology PS1372 13.50 4,255.00
psychology PS2091 13.50 4,255.00
psychology PS2106 13.50 4,255.00
psychology PS3333 13.50 4,255.00
psychology PS7777 13.50 4,255.00
trad_cook TC3218 15.96 6,333.33
trad_cook TC4203 15.96 6,333.33
trad_cook TC7777 15.96 6,333.33

(18 rows affected)
The above example still aggregates the price and advance columns based on the type column, but its results also display the title_id for the books included in each group.
The second extension described above allows you to group columns that are not specified as columns in the select list of the query. These columns do not appear in the results, but the vector aggregates still compute their summary values. For example:
select state, count(au_id)
from authors
group by state, city
state
----- --------
CA 2
CA 1
CA 5
CA 5
CA 2
CA 1
CA 1
CA 1
CA 1
IN 1
KS 1
MD 1
MI 1
OR 1
TN 1
UT 2

(16 rows affected)
This example groups the vector aggregate results by both state and city, even though it does not display which city belongs to each group. Therefore, results are potentially misleading.
You may think the following query should produce similar results to the previous query, since only the vector aggregate seems to tally the number of each city for each row:
select state, count(au_id)
from authors
group by city
However, its results are much different. By not using group by with both the state and city columns, the query tallies the number of each city, but it displays the tally for each row of that city in authors rather than grouping them into one result row per city.
state
----- -----------
CA              1 
CA 5
CA 2
CA 1
CA 5
KS 1
CA 2
CA 2
CA 1
CA 1
TN 1
OR 1
CA 1
MI 1
IN 1
CA 5
CA 5
CA 5
MD 1
CA 2
CA 1
UT 2
UT 2

(23 rows affected)
When you use the Transact-SQL extensions in complex queries that include the where clause or joins, the results may become even more difficult to understand. To avoid confusing or misleading results with group by, Sybase suggests that you use the fipsflagger option to identify queries that use Transact-SQL extensions. See "group by and SQL standards" for details.
For more information about Transact-SQL extensions to group by and how they work, see the Reference Manual.

Expressions and group by

Another Transact-SQL extension allows you to group by an expression that does not include aggregate functions. For example:
select avg(total_sales), total_sales * price 
from titles
group by total_sales * price
---------      -------------
NULL NULL
111 777.00
375 7,856.25
375 8,096.25
2045 22,392.75
3336 26,654.64
2032 40,619.68
3876 46,318.20
18722 55,978.78
4095 61,384.05
22246 66,515.54
4072 81,399.28
4095 81,859.05
4095 81,900.00
15096 180,397.20
8780 201,501.00

(16 rows affected)
The expression "total_sales * price" is allowed.
You cannot group by a column heading, also known as an alias, although you can still use one in your select list. This statement produces an error message:
select Category = type, title_id, avg(price), avg(advance) 
from titles
group by Category
The group by clause should be "group by type", not "group by Category".
select Category = type, title_id, avg(price), avg(advance) 
from titles
group by type

Nesting aggregates with group by

Another Transact-SQL extension allows you to nest a vector aggregate inside a scalar aggregate. For example, to find the average price of all types of books using a non-nested aggregate, enter:
select avg(price) 
from titles
group by type
--------------- 
NULL
13.73
11.49
21.48
13.50
15.96

(6 rows affected)
Nesting the average price inside the max function produces the highest average price of a group of books, grouped by type:
select max(avg(price)) 
from titles
group by type
------------- 
21.48

(1 row affected)
By definition, the group by clause applies to the innermost aggregate--in this case, avg.

Null values and group by

If the grouping column contains a null value, that row becomes its own group in the results. If the grouping column contains more than one null value, the null values form a single group. Here is an example that uses group by and the advance column, which contains some null values:
select advance, avg(price * 2) 
from titles
group by advance
advance                                            
------------------ -----------------
NULL NULL
0.00 39.98
2000.00 39.98
2275.00 21.90
4000.00 19.94
5000.00 34.62
6000.00 14.00
7000.00 43.66
8000.00 34.99
10125.00 5.98
15000.00 5.98

(11 rows affected)
If you are using the count(column_name) aggregate function, grouping by a column that contains null values will return a count of zero for the grouping row, since count(column_name) does not count null values. In most cases, you should use count(*) instead. This example groups and counts on the price column from the titles table, which contains null values, and shows count(*) for comparison:
select price, count(price), count(*) 
from titles
group by price
price   
------------- ----- -----
NULL 0 2
2.99 2 2
7.00 1 1
7.99 1 1
10.95 1 1
11.95 2 2
14.99 1 1
19.99 4 4
20.00 1 1
20.95 1 1
21.59 1 1
22.95 1 1

(12 rows affected)

where clause and group by

You can use a where clause in a statement with group by. Rows that do not satisfy the conditions in the where clause are eliminated before any grouping is done. Here is an example:
select type, avg(price)
from titles
where advance > 5000
group by type
type 
------------- --------
business 2.99
mod_cook 2.99
popular_comp 21.48
psychology 14.30
trad_cook 17.97

(5 rows affected)
Only the rows with advances of more than $5000 are included in the groups that are used to produce the query results.
However, the way that Adaptive Server handles extra columns in the select list and the where clause may seem contradictory. For example:
select type, advance, avg(price) 
from titles
where advance > 5000
group by type
type           advance
------------- --------- --------
business 5,000.00 2.99
business 5,000.00 2.99
business 10,125.00 2.99
business 5,000.00 2.99
mod_cook 0.00 2.99
mod_cook 15,000.00 2.99
popular_comp 7,000.00 21.48
popular_comp 8,000.00 21.48
popular_comp NULL 21.48
psychology 7,000.00 14.30
psychology 2,275.00 14.30
psychology 6,000.00 14.30
psychology 2,000.00 14.30
psychology 4,000.00 14.30
trad_cook 7,000.00 17.97
trad_cook 4,000.00 17.97
trad_cook 8,000.00 17.97

(17 rows affected)
It only seems as if the query is ignoring the where clause when you look at the results for the advance (extended) column. Adaptive Server still computes the vector aggregate using only those rows that satisfy the where clause, but it also displays all rows for any extended columns that you include in the select list. To further restrict these rows from the results, you must use a having clause (described later in this chapter).

group by and all

The keyword all in the group by clause is a Transact-SQL enhancement. It is meaningful only if the select statement in which it is used also includes a where clause.
If you use all, the query results include all the groups produced by the group by clause, even if some groups do not have any rows that meet the search conditions. Without all, a select statement that includes group by does not show groups for which no rows qualify.
Here is an example:
select type, avg(advance)  
from titles
where advance > 1000 and advance < 10000
group by type
type                                    
------------ ------------------------
business 5,000.00
popular_comp 7,500.00
psychology 4,255.00
trad_cook 6,333.33

(4 rows affected)
select type, avg(advance)  
from titles
where advance > 1000 and advance < 10000
group by all type
type                                    
------------ ------------------------
UNDECIDED NULL
business 5,000.00
mod_cook NULL
popular_comp 7,500.00
psychology 4,255.00
trad_cook 6,333.33

(6 rows affected)
The first statement produces groups only for those books that commanded advances of more than $1000 but less than $10,000. Since no modern cooking books have an advance within that range, there is no group in the results for the mod_cook type.
The second statement produces groups for all types, including modern cooking and "UNDECIDED," even though the modern cooking group does not include any rows that meet the qualification specified in the where clause. Adaptive Server returns a NULL result for all groups that lack qualifying rows.

Using aggregates without group by

By definition, scalar aggregates apply to all rows in a table, producing a single value for the whole table for each function. The Transact-SQL extension that allows you to include extended columns with vector aggregates also allows you to include extended columns with scalar aggregates. For example, look at the publishers table:
pub_id pub_name             city           state 
------ ------------------   -------------- ----- 
0736 New Age Books Boston MA
0877 Binnet & Hardley Washington DC
1389 Algodata Infosystems Berkeley CA
It contains three rows. The following query produces a three-row scalar aggregate based on each row of the table:
select pub_id, count(pub_id)
from publishers
pub_id 
---------- ---------
0736 3
0877 3
1389 3

(3 rows affected)
Because Adaptive Server treats publishers as a single group, the scalar aggregate applies to the (single-group) table. The results display every row of the table for each column you include in the select list, in addition to the scalar aggregate.
The where clause behaves the same way for scalar aggregates as with vector aggregates. The where clause restricts the columns included in the aggregate summary values, but it does not affect the rows that appear in the results for each extended column you specify in the select list. For example:
select pub_id, count(pub_id) 
from publishers
where pub_id < "1000"
pub_id 
-------------- -----------
0736 2
0877 2
1389 2

(3 rows affected)
Like the other Transact-SQL extensions to group by, this extension provides results that may be difficult to understand, especially for queries on large tables or queries with multitable joins.

No comments:

Post a Comment