Estimating Statistics of Expression results in DBMS - Estimating Statistics of Expression results in SQL

In order to determine ideal plan for evaluating the query, it checks various details about the tables that are stored in the data dictionary. These informations about tables are collected when a table is created and when various DDL / DML operations are performed on it. The optimizer checks data dictionary for :

Total number of records in a table, nr. This will help to determine which table needs to be accessed first. Usually smaller tables are executed first to reduce the size of the intermediary tables. Hence it is one of the important factors to be checked.

Total number of records in each block, fr. This will be useful in determining blocking factor and is required to determine if the table fits in the memory or not.

Total number of blocks assigned to a table, br. This is also an important factor to calculate number of records that can be assigned to each block. Suppose we have 100 records in a table and total number of blocks are 20, then fr can be calculated as nr/b r = 100/20 = 5.

Total length of the records in the table, l r. This is an important factor when the size of the records varies significantly between any two tables in the query. If the record length is fixed, there is no significant affect. But when a variable length records are involved in the query, average length or actual length needs to be used depending upon the type of operations.

Number of unique values for a column, d Ar. This is useful when a query uses aggregation operation or projection. It will provide an estimate on distinct number of columns selected while projection. Number groups of records can be determined using this when Aggregation operation is used in the query. E.g.; SUM, MAX, MIN, COUNT etc.

Levels of index, x. This data provides the information like whether the single level of index like primary key index, secondary key indexes are used or multi-level indexes like B+ tree index, merge-sort index etc are used. These index levels will provide details about number of block access required to retrieve the data.

Selection cardinality of a column, s A. This is the number of records present with same column value as A. This is calculated as nr/d Ar. i.e.; total number of records with distinct value of A. For example, suppose EMP table has 500 records and DEPT_ID has 5 distinct values. Then the selection cardinality of DEPT_ID in EMP table is 500/ 5 = 100. That means, on an average 100 employees are distributed among each department. This is helpful in determining average number of records that would satisfy selection criteria.

There many other factors too like index type, data file type, sorting order, type of sorting etc.