Optimizer statistics are a collection of data that describe the database, and the objects in the database. These statistics are used by the Optimizer to choose the best execution plan for each SQL statement. Statistics are stored in the data dictionary, and can be accessed using data dictionary views such as USER_TAB_STATISTICS. Optimizer statistics are different from the performance statistics visible through V$ views. The information in the V$ views relates to the state of the system and the SQL workload executing on it.

Figure 1. Optimizer Statistics stored in the data dictionary used by the Optimizer to determine execution plans

Table statistics include information on the number of rows in the table, the number of data blocks used for the table, as well as the average row length in the table. The Optimizer uses this information, in conjunction with other statistics, to compute the cost of various operations in an execution plan, and to estimate the number of rows the operation will produce. For example, the cost of a table access is calculated using the number of data blocks combined with the value of the parameter DB_FILE_MULTIBLOCK_READ_COUNT. You can view table statistics in the dictionary view USER_TAB_STATISTICS.

Column statistics include information on the number of distinct values in a column (NDV) as well as the minimum and maximum value found in the column. You can view column statistics in the dictionary view USER_TAB_COL_STATISTICS. The Optimizer uses the column statistics information in conjunction with the table statistics (number of rows) to estimate the number of rows that will be Understanding Optimizer Statistics

returned by a SQL operation. For example, if a table has 100 records, and the table access evaluates an equality predicate on a column that has 10 distinct values, then the Optimizer, assuming uniform data distribution, estimates the cardinality to be the number of rows in the table divided by the number of distinct values for the column or 100/10 = 10.

Figure 2. Cardinality calculation using basic table and column statisticsBasic table and column statistics tell the optimizer a great deal but they don’t provide a mechanism to tell the Optimizer about the nature of the data in the table or column. For example, these statistics can’t tell the Optimizer if there is a data skew in a column, or if there is a correlation between columns in a table. Information on the nature of the data can be provided to the Optimizer by using extensions to basic statistics like, histograms, column groups, and expression statistics.

Histograms tell the Optimizer about the distribution of data within a column. By default (without a histogram), the Optimizer assumes a uniform distribution of rows across the distinct values in a column. As described above, the Optimizer calculates the cardinality for an equality predicate by dividing the total number of rows in the table by the number of distinct values in the column used in the equality predicate. If the data distribution in that column is not uniform (i.e., a data skew) then the cardinality estimate will be incorrect. In order to accurately reflect a non-uniform data distribution, a histogram is required on the column. The presence of a histogram changes the formula used by the Optimizer to estimate the cardinality, and allows it to generate a more accurate execution plan.

Oracle automatically determines the columns that need histograms based on the column usage information (SYS.COL_USAGE$), and the presence of a data skew. For example, Oracle will not automatically create a histogram on a unique column if it is only seen in equality predicates. There are two types of histograms, frequency or height-balanced. Oracle determines the type of histogram to be created based on the number of distinct values in the column.

Frequency histograms are created when the number of distinct values in the column is less than 254. Oracle uses the following steps to create a frequency histogram.

1. Let’s assume that Oracle is creating a frequency histogram on the PROMO_CATEGORY_ID column of the PROMOTIONS table. The first step is to select the PROMO_CATEGORY_ID from the PROMOTIONS table ordered by PROMO_CATEGORY_ID. 2. Each PROMO_CATEGORY_ID is then assigned to its own histogram bucket (Figure 3).

Figure 3. Step 2 in frequency histogram creation

3. At this stage we could have more than 254 histogram buckets, so the buckets that hold the same value are then compressed into the highest bucket with that value. In this case, buckets 2 through 115 are compressed into bucket 115, and buckets 484 through 503 are compressed into bucket 503, and so on until the total number of buckets remaining equals the number of distinct values in the column (Figure 4).

Note the above steps are for illustration purposes. The DBMS_STATS package has been optimized to build compressed histograms directly.

Figure 4. Step 3 in frequency histogram creation: duplicate buckets are compressed

4. The Optimizer now accurately determines the cardinality for predicates on the PROMO_CATEGORY_ID column using the frequency histogram. For example, for the predicate PROMO_CATEGORY_ID =10, the Optimizer would first need to determine how many buckets in the histogram have 10 as their end point. It does this by finding the bucket whose endpoint is 10, bucket 503, and then subtracts the previous bucket number, bucket 483, 503 - 483 = 20. Then the cardinality estimate would be calculated using the following formula (number of bucket endpoints / total number of bucket) X NUM_ROWS, 20/503 X 503, so the number of rows in the PROMOTOINS table where PROMO_CATEGORY_ID =10 is 20.

Height-balanced histograms are created when the number of distinct values in the column is greater than 254. In a height-balanced histogram, column values are divided into buckets so that each bucket contains approximately the same number of rows. Oracle uses the following steps to create a height-balanced histogram.

- Let’s assume that Oracle is creating a height-balanced histogram on the CUST_CITY_ID column of the CUSTOMERS table because the number of distinct values in the CUST_CITY_ID column is greater than 254. Just like with a frequency histogram, the first step is to select the CUST_CITY_ID from the CUSTOMERS table ordered by CUST_CITY_ID.
- There are 55,500 rows in the CUSTOMERS table and there is a maximum of 254 buckets in a histogram. In order to have an equal number of rows in each bucket, Oracle must put 219 rows in each bucket. The 219th CUST_CITY_ID from step one will become the endpoint for the first bucket. In this case that is 51043. The 438th CUST_CITY_ID from step one will become the endpoint for the second bucket, and so on until all 254 buckets are filled (Figure 5).
- Once the buckets have been created Oracle checks to see if the endpoint of the first bucket is the minimum value for the CUST_CITY_ID column. If it is not, a “zero” bucket is added to the histogram that has the minimum value for the CUST_CITY_ID column as its end point (Figure 6).
- Just as with a frequency histogram, the final step is to compress the height-balanced histogram, and remove the buckets with duplicate end points. The value 51166 is the end

Figure 5. Step 2 of height-balance histogram creation: put an equal number of rows in each bucket

Figure 6. Step 3 of height-balance histogram creation: add a zero bucket for the min value

point for bucket 24 and bucket 25 in our height-balanced histogram on the CUST_CITY_ID column. So, bucket 24 will be compressed in bucket 25 (Figure 7).

Figure 7. Step 4 of height-balance histogram creation

The Optimizer now computes a better cardinality estimate for predicates on the CUST_CITY_ID column by using the height-balanced histogram. For example, for the predicate CUST_CITY_ID =51806, the Optimizer would first check to see how many buckets in the histogram have 51806 as their end point. In this case, the endpoint for bucket 136,137,138 and 139 is 51806(info found in USER_HISTOGRAMS). The Optimizer then uses the following formula:

(Number of bucket endpoints / total number of buckets) X number of rows in the table In this case 4/254 X 55500 = 874

Figure 8. Height balanced histogram used for popular value cardinality estimate

However, if the predicate was CUST_CITY_ID =52500, which is not the endpoint for any bucket then the Optimizer uses a different formula. For values that are the endpoint for only one bucket or are not an endpoint at all, the Optimizer uses the following formula:

DENSITY X number of rows in the table

where DENSITY is calculated ‘on the fly’ during optimization using an internal formula based on information in the histogram. The value for DENSITY seen in the dictionary view USER_TAB_COL_STATISTICS is not the value used by the Optimizer from Oracle Database 10.2.0.4 onwards. This value is recorded for backward compatibility, as this is the value used in Oracle Database 9i and earlier releases of 10g. Furthermore, if the parameter OPTIMIZER_FEATURES_ENABLE is set to version release earlier than 10.2.0.4, the value for DENSITY in the dictionary view will be used.

Figure 9. Height balanced histogram used for non- popular value cardinality estimate

In Oracle Database 11g, extensions to column statistics were introduced. Extended statistics encompasses two additional types of statistics; column groups and expression statistics.

In real-world data, there is often a relationship (correlation) between the data stored in different columns of the same table. For example, in the CUSTOMERS table, the values in the CUST_STATE_PROVINCE column are influenced by the values in the COUNTRY_ID column, as the state of California is only going to be found in the United States. Using only basic column statistics, the Optimizer has no way of knowing about these real-world relationships, and could potentially miscalculate the cardinality if multiple columns from the same table are used in the where clause of a statement. The Optimizer can be made aware of these real-world relationships by having extended statistics on these columns as a group.

By creating statistics on a group of columns, the Optimizer can compute a better cardinality estimate when several the columns from the same table are used together in a where clause of a SQL statement. You can use the function DBMS_STATS.CREATE_EXTENDED_STATS to define a column group you want to have statistics gathered on as a group. Once a column group has been created, Oracle will automatically maintain the statistics on that column group when statistics are gathered on the table, just like it does for any ordinary column (Figure 10).

Figure 10. Creating a column group on the CUSTOMERS table

After creating the column group and re-gathering statistics, you will see an additional column, with a system-generated name, in the dictionary view USER_TAB_COL_STATISTICS. This new column represents the column group (Figure 11).

Figure 11. System generated column name for a column group in USER_TAB_COL_STATISTICS

To map the system-generated column name to the column group and to see what other extended statistics exist for a user schema, you can query the dictionary view USER_STAT_EXTENSIONS (Figure 12).

Figure 12. Information about column groups is stored in USER_STAT_EXTENSIONS

The Optimizer will now use the column group statistics, rather than the individual column statistics when these columns are used together in where clause predicates. Not all of the columns in the column group need to be present in the SQL statement for the Optimizer to use extended statistics; only a subset of the columns is necessary.

It is also possible to create extended statistics for an expression (including functions), to help the Optimizer to estimate the cardinality of a where clause predicate that has columns embedded inside expressions. For example, if it is common to have a where clause predicate that uses the UPPER function on a customer’s last name, UPPER(CUST_LAST_NAME)=:B1, then it would be beneficial to create extended statistics for the expression UPPER(CUST_LAST_NAME)(Figure 13).

Figure 13. Extended statistics can also be created on expressions

Just as with column groups, statistics need to be re-gathered on the table after the expression statistics have been defined. After the statistics have been gathered, an additional column with a system-generated name will appear in the dictionary view USER_TAB_COL_STATISTICS representing the expression statistics. Just like for column groups, the detailed information about expression statistics can be found in USER_STAT_EXTENSIONS.

Restrictions on Extended StatisticsExtended statistics can only be used when the where the clause predicates are equalities or in-lists. Extended statistics will not be used if there are histograms present on the underlying columns and there is no histogram present on the column group.

Index statistics provide information on the number of distinct values in the index (distinct keys), the depth of the index (blevel), the number of leaf blocks in the index (leaf_blocks), and the clustering factor1. The Optimizer uses this information in conjunction with other statistics to determine the cost of an index access. For example the Optimizer will use b-level, leaf_blocks and the table statistics num_rows to determine the cost of an index range scan (when all predicates are on the leading edge of the index).

Gathering Statistics >>