DB diablo: Group by (SQL)

The group by clause divides the output of a query into groups. You can group by one or more column names, or by the results of computed columns using numeric datatypes in an expression. When used with aggregates, group by retrieves the calculations in each subgroup, and may return multiple rows. The maximum number of columns or expressions you can use in a group by clause is 16.
You cannot group by columns of text or image datatypes.
While you can use group by without aggregates, such a construction has limited functionality and may produce confusing results. The following example groups the results by title type:

select type, advance
from titles
group by type

type             advance
------------     ---------
business          5,000.00
business          5,000.00
business         10,125.00
business          5,000.00
mod_cook              0.00
mod_cook         15,000.00
UNDECIDED             NULL
popular_comp      7,000.00
popular_comp      8,000.00
popular_comp          NULL
psychology        7,000.00
psychology        2,275.00
psychology        6,000.00
psychology        2,000.00
psychology        4,000.00
trad_cook         7,000.00
trad_cook         4,000.00
trad_cook         8,000.00
 
(18 rows affected)

With an aggregate for the advance column, the query returns the sum for each group:

select type, sum(advance)
from titles
group by type

type                                  
------------ ------------------------ 
UNDECIDED                        NULL 
business                    25,125.00 
mod_cook                    15,000.00 
popular_comp                15,000.00 
psychology                  21,275.00 
trad_cook                   19,000.00 

(6 rows affected)

The summary values in a group by clause using aggregates are called vector aggregates, as opposed to scalar aggregates, which result when only one row is returned (see "Using aggregate functions").

group by syntax

The complete syntax of the select statement is repeated here so that you can see the group by clause in context:

select [all | distinct] select_list 
     [into [[database.]owner.]table_name] 
     [from [[database.]owner.]{view_name | table_name          [(index {index_name | table_name} 
               [parallel [degree_of_parallelism]]
               [prefetch size ] [lru | mru])]}
          [holdlock | noholdlock] [shared]
     [,[[database.]owner.]{view_name | table_name          [(index {index_name | table_name} 
               [parallel [degree_of_parallelism]]
               [prefetch size] [lru | mru])]}
          [holdlock | noholdlock] [shared]]...] 
 
     [where search_conditions] 
 
     [group by [all] aggregate_free_expression          [, aggregate_free_expression]...]
     [having search_conditions]
 
     [order by 
     {[[[database.]owner.] {table_name. | view_name.}]
          column_name | select_list_number | expression} [asc | desc]
     [, {[[[database.]owner.]{table_name|view_name.}]          column_name | select_list_number | expression} 
               [asc | desc]]...]
 
     [compute row_aggregate (column_name)
               [, row_aggregate(column_name)]...
          [by column_name [, column_name]...]] 
 
     [for {read only | update [of column_name_list]}]
 
     [at isolation {read uncommitted | read committed | serializable}]
 
     [for browse]

Remember that the order of the clauses in the select statement is significant. You can omit any of the optional clauses, but when you use them, they must appear in the order shown above.
group by and SQL standardsThe SQL standards for group by are more restrictive than Sybase's standard. The SQL standard requires that:

The columns in a select list must be in the group by expression or they must be arguments of aggregate functions.
A group by expression can only contain column names in the select list, but not those used only as arguments for vector aggregates.

Several Transact-SQL extensions (described in the following sections) relax these restrictions, however the more complex result sets may be more difficult to understand. If you set the fipsflagger option as follows, you will receive a warning message stating that Transact-SQL extensions are used:

set fipsflagger on

For more information about the fipsflagger option, see the set command in the Reference Manual.
Nesting groups with group byYou can list more than one column in the group by clause to nest groups. Once the sets are established with group by, the aggregates are applied. This statement finds the average price and the sum of book sales, grouped first by publisher identification number and then by type:

select pub_id, type, avg(price), sum(total_sales)
from titles
group by pub_id, type

pub_id  type
------  ------------  ------  ------- 
0736    business       2.99  18,722 
0736    psychology    11.48   9,564 
0877    UNDECIDED      NULL    NULL 
0877    mod_cook      11.49  24,278 
0877    psychology    21.59     375 
0877    trad_cook     15.96  19,566 
1389    business      17.31  12,066 
1389    popular_comp  21.48  12,875 
 
(8 rows affected)

You can nest many groups within groups, up to the maximum of 16 columns or expressions specified with group by.

Referencing other columns in queries using group by

SQL standards state that the group by must contain items from the select list. However, Transact-SQL allows you to specify any valid column name in either the group by or select list, whether they employ aggregates or not.
Through the following extensions, Sybase lifts restrictions on what you can include or omit in the select list of a query that includes group by.

The columns in the select list are not limited to the grouping columns and columns used with the vector aggregates.
The columns specified by group by are not limited to those non-aggregate columns in the select list.

A vector aggregate must be accompanied by a group by clause. The SQL standards require that the non-aggregate columns in the select list match the group by columns. However, the first bulleted item described above allows you to specify additional, extended columns in the select list of the query.
For example, many versions of SQL do not allow the inclusion of the extended title_id column in the select list, but it is legal in Transact-SQL:

select type, title_id, avg(price), avg(advance) 
from titles 
group by type

type           title_id 
------------   --------    -----   ------- 
business       BU1032      13.73  6,281.25 
business       BU1111      13.73  6,281.25 
business       BU2075      13.73  6,281.25 
business       BU7832      13.73  6,281.25 
mod_cook       MC2222      11.49  7,500.00 
mod_cook       MC3021      11.49  7,500.00 
UNDECIDED      MC3026      NULL       NULL 
popular_comp   PC1035      21.48  7,500.00 
popular_comp   PC8888      21.48  7,500.00 
popular_comp   PC9999      21.48  7,500.00 
psychology     PS1372      13.50  4,255.00 
psychology     PS2091      13.50  4,255.00 
psychology     PS2106      13.50  4,255.00 
psychology     PS3333      13.50  4,255.00 
psychology     PS7777      13.50  4,255.00 
trad_cook      TC3218      15.96  6,333.33 
trad_cook      TC4203      15.96  6,333.33 
trad_cook      TC7777      15.96  6,333.33 
 
(18 rows affected)

The above example still aggregates the price and advance columns based on the type column, but its results also display the title_id for the books included in each group.
The second extension described above allows you to group columns that are not specified as columns in the select list of the query. These columns do not appear in the results, but the vector aggregates still compute their summary values. For example:

select state, count(au_id)
from authors
group by state, city

state
-----  --------
CA            2
CA            1
CA            5
CA            5
CA            2
CA            1
CA            1
CA            1
CA            1
IN            1
KS            1
MD            1
MI            1
OR            1
TN            1
UT            2
 
(16 rows affected)

This example groups the vector aggregate results by both state and city, even though it does not display which city belongs to each group. Therefore, results are potentially misleading.
You may think the following query should produce similar results to the previous query, since only the vector aggregate seems to tally the number of each city for each row:

select state, count(au_id)
from authors
group by city

However, its results are much different. By not using group by with both the state and city columns, the query tallies the number of each city, but it displays the tally for each row of that city in authors rather than grouping them into one result row per city.

state
 ----- -----------

CA              1 
 CA              5 
 CA              2 
 CA              1 
 CA              5 
 KS              1 
 CA              2 
 CA              2 
 CA              1 
 CA              1 
 TN              1 
 OR              1 
 CA              1 
 MI              1 
 IN              1 
 CA              5 
 CA              5 
 CA              5 
 MD              1 
 CA              2 
 CA              1 
 UT              2 
 UT              2 

(23 rows affected)

When you use the Transact-SQL extensions in complex queries that include the where clause or joins, the results may become even more difficult to understand. To avoid confusing or misleading results with group by, Sybase suggests that you use the fipsflagger option to identify queries that use Transact-SQL extensions. See "group by and SQL standards" for details.
For more information about Transact-SQL extensions to group by and how they work, see the Reference Manual.

Expressions and group by

Another Transact-SQL extension allows you to group by an expression that does not include aggregate functions. For example:

select avg(total_sales), total_sales * price 
from titles 
group by total_sales * price

---------      -------------
     NULL               NULL
      111             777.00  
      375           7,856.25  
      375           8,096.25 
     2045          22,392.75  
     3336          26,654.64  
     2032          40,619.68  
     3876          46,318.20  
    18722          55,978.78  
     4095          61,384.05  
    22246          66,515.54  
     4072          81,399.28  
     4095          81,859.05  
     4095          81,900.00  
    15096         180,397.20  
     8780         201,501.00 
 
(16 rows affected)

The expression "total_sales * price" is allowed.
You cannot group by a column heading, also known as an alias, although you can still use one in your select list. This statement produces an error message:

select Category = type, title_id, avg(price), avg(advance) 
from titles 
group by Category

The group by clause should be "group by type", not "group by Category".

select Category = type, title_id, avg(price), avg(advance) 
from titles 
group by type

Nesting aggregates with group by

Another Transact-SQL extension allows you to nest a vector aggregate inside a scalar aggregate. For example, to find the average price of all types of books using a non-nested aggregate, enter:

select avg(price) 
from titles 
group by type

--------------- 
 NULL 
13.73 
11.49 
21.48 
13.50 
15.96 
 
(6 rows affected)

Nesting the average price inside the max function produces the highest average price of a group of books, grouped by type:

select max(avg(price)) 
from titles 
group by type

------------- 
        21.48 
 
(1 row affected)

By definition, the group by clause applies to the innermost aggregate--in this case, avg.

Null values and group by

If the grouping column contains a null value, that row becomes its own group in the results. If the grouping column contains more than one null value, the null values form a single group. Here is an example that uses group by and the advance column, which contains some null values:

select advance, avg(price * 2) 
from titles 
group by advance

advance                                            
------------------  -----------------  
             NULL               NULL  
             0.00              39.98  
          2000.00              39.98  
          2275.00              21.90  
          4000.00              19.94  
          5000.00              34.62  
          6000.00              14.00  
          7000.00              43.66  
          8000.00              34.99  
         10125.00               5.98  
         15000.00               5.98  
 
(11 rows affected)

If you are using the count(column_name) aggregate function, grouping by a column that contains null values will return a count of zero for the grouping row, since count(column_name) does not count null values. In most cases, you should use count(*) instead. This example groups and counts on the price column from the titles table, which contains null values, and shows count(*) for comparison:

select price, count(price), count(*) 
from titles 
group by price

price   
------------- ----- ----- 
        NULL     0     2 
        2.99     2     2 
        7.00     1     1 
        7.99     1     1 
       10.95     1     1 
       11.95     2     2 
       14.99     1     1 
       19.99     4     4 
       20.00     1     1 
       20.95     1     1 
       21.59     1     1 
       22.95     1     1 
 
(12 rows affected)

where clause and group by

You can use a where clause in a statement with group by. Rows that do not satisfy the conditions in the where clause are eliminated before any grouping is done. Here is an example:

select type, avg(price)
from titles
where advance > 5000
group by type

type 
-------------    -------- 
business            2.99 
mod_cook            2.99 
popular_comp       21.48 
psychology         14.30 
trad_cook          17.97 
 
(5 rows affected)

Only the rows with advances of more than $5000 are included in the groups that are used to produce the query results.
However, the way that Adaptive Server handles extra columns in the select list and the where clause may seem contradictory. For example:

select type, advance, avg(price) 
from titles 
where advance > 5000
group by type

type           advance
-------------  ---------  --------
business        5,000.00      2.99
business        5,000.00      2.99
business       10,125.00      2.99
business        5,000.00      2.99
mod_cook            0.00      2.99
mod_cook       15,000.00      2.99
popular_comp    7,000.00     21.48
popular_comp    8,000.00     21.48
popular_comp        NULL     21.48
psychology      7,000.00     14.30
psychology      2,275.00     14.30
psychology      6,000.00     14.30
psychology      2,000.00     14.30
psychology      4,000.00     14.30
trad_cook       7,000.00     17.97
trad_cook       4,000.00     17.97
trad_cook       8,000.00     17.97
 
(17 rows affected)

It only seems as if the query is ignoring the where clause when you look at the results for the advance (extended) column. Adaptive Server still computes the vector aggregate using only those rows that satisfy the where clause, but it also displays all rows for any extended columns that you include in the select list. To further restrict these rows from the results, you must use a having clause (described later in this chapter).

group by and all

The keyword all in the group by clause is a Transact-SQL enhancement. It is meaningful only if the select statement in which it is used also includes a where clause.
If you use all, the query results include all the groups produced by the group by clause, even if some groups do not have any rows that meet the search conditions. Without all, a select statement that includes group by does not show groups for which no rows qualify.
Here is an example:

select type, avg(advance)  
from titles  
where advance > 1000 and advance < 10000 
group by type

type                                    
------------  ------------------------   
business                      5,000.00   
popular_comp                  7,500.00   
psychology                    4,255.00   
trad_cook                     6,333.33   
 
(4 rows affected)

select type, avg(advance)  
from titles  
where advance > 1000 and advance < 10000
group by all type

type                                    
------------  ------------------------   
UNDECIDED                         NULL   
business                      5,000.00   
mod_cook                          NULL   
popular_comp                  7,500.00   
psychology                    4,255.00   
trad_cook                     6,333.33   
 
(6 rows affected)

The first statement produces groups only for those books that commanded advances of more than $1000 but less than $10,000. Since no modern cooking books have an advance within that range, there is no group in the results for the mod_cook type.
The second statement produces groups for all types, including modern cooking and "UNDECIDED," even though the modern cooking group does not include any rows that meet the qualification specified in the where clause. Adaptive Server returns a NULL result for all groups that lack qualifying rows.

Using aggregates without group by

By definition, scalar aggregates apply to all rows in a table, producing a single value for the whole table for each function. The Transact-SQL extension that allows you to include extended columns with vector aggregates also allows you to include extended columns with scalar aggregates. For example, look at the publishers table:

pub_id pub_name             city           state

------ ------------------   -------------- ----- 
0736   New Age Books        Boston         MA    
0877   Binnet & Hardley     Washington     DC    
1389   Algodata Infosystems Berkeley       CA

It contains three rows. The following query produces a three-row scalar aggregate based on each row of the table:

select pub_id, count(pub_id)
from publishers

pub_id 
---------- ---------
0736               3
0877               3
1389               3
 
(3 rows affected)

Because Adaptive Server treats publishers as a single group, the scalar aggregate applies to the (single-group) table. The results display every row of the table for each column you include in the select list, in addition to the scalar aggregate.
The where clause behaves the same way for scalar aggregates as with vector aggregates. The where clause restricts the columns included in the aggregate summary values, but it does not affect the rows that appear in the results for each extended column you specify in the select list. For example:

select pub_id, count(pub_id) 
from publishers 
where pub_id < "1000"

pub_id 
-------------- ----------- 
0736                    2 
0877                    2 
1389                    2 
 
(3 rows affected)

Like the other Transact-SQL extensions to group by, this extension provides results that may be difficult to understand, especially for queries on large tables or queries with multitable joins.

DB diablo

Tuesday, September 14, 2010

Group by (SQL)