Optimization in SQL

Table of Contents

1. What is Optimization?

Optimization is part of RDBMS. It analyzes the query and tells the parser which path is the best path to execute the query. This is very much essential because when best path is selected to execute the query, it saves lots of time and resource.

Consider a query to find the name of the employee whose phone number is 456-234-876. Here, the query optimizer will analyze the query first.

SELECT FIRST_NAME, LAST_NAME FROM EMPLOYEES
WHERE PHONE_NUMBER = '456.234.876';

The optimizer will see that there is a filter condition on phone number and for that phone number first and last name needs to be selected. If the optimizer chooses full table scan, it will have to traverse each and every record in the EMPLOYEES table till it get above phone number. It may get this number at the beginning records or may be at the last records. So time consumed to get the last record in a very huge table like EMPLOYEES is more. At the same time optimizer sees that there is an index on PHONE_NUMBER column. This index will directly points to the address where the record with this phone number is saved. Therefore it understands that selecting the record using index scan is more efficient than full table scan. Hence query optimizer suggests using index scan on PHONE_NUMBER than full table scan. This is how an optimization works.

There are two types of optimization – Cost based and Rule based.

2. What is Cost based Optimization and how does it work?

Like the name suggests, cos t based optimizer optimizes the query based on the cost of the query. It collects various statistics about the database and the table, and then decides which path to be taken to execute the query. The statistics include database type, type of transaction mode, number of records in a table, number of indexes, columns on which indexes are created, uniqueness of the column values, selectivity of the column etc. Based on all these data, query optimizer decides the best path to be executed.

In the above example, the employee’s first name and last name is retrieved based on the cost of the EMPLOYEES table. It is decided by the optimizer that index scan is better to execute the query. That means cost of the index scan is less than the full table scan.

3. What is Rule based Optimization and how does it work?

In Rule based optimization, it applies set of rules on the query to check best suited path to execute the query. It does not bother about the cost of the query. This is based on the equivalence rule on relational expressions; hence the number of combination of queries get reduces here. Hence the cost of the query too reduces.

This method creates relational tree for the given query based on the equivalence rules. These equivalence rules by providing an alternative way of writing and evaluating the query, gives the better path to evaluate the query. This rule need not be true in all cases. It needs to be examined after applying those rules.

Some of the rules applied while rule based optimization is carried out is:

Perform all the selection operation as early as possible in the query. That means selecting by applying all the filter conditions will reduce the number of records in the result set. Otherwise we need to perform the joins on whole table data. This should be first and foremost set of actions on the tables in the query.

For example, consider below query:

SELECT FIRST_NAME, LAST_NAME, PHONE_NUMBER, EMAIL, JOB_ID
FROM EMPLOYEES emp
WHERE DEPARTMENT_ID = 20;

Here in this query, we are going to select first name, last name, phone number, email and job id of the employees who are working in department with id 20. Here we can select all required 5 columns first from the EMPLOYEES table and then we can apply filter to department id = 20. But this will require lot of space to hold the columns for all the records. Instead, if we apply the filter it will reduce the number of records to two and then we can extract the columns from it. Here space required to hold all the columns for two records is much lesser than space required to hold 5 columns of whole table records.

Perform all the projection as early as possible in the query. This is similar to selection but will reduce the number of columns in the query.

For example, consider below query :

SELECT emp.FIRST_NAME, emp.LAST_NAME, emp.PHONE_NUMBER, 	
	emp. EMAIL, emp.JOB_ID, dept.DEPARTMENT_NAME
FROM EMPLOYEES emp, DEPARTMENTS dept
WHERE emp.DEPARTMENT_ID = dept.DEPARTMENT_ID
AND emp.DEPARTMENT_ID = 20;

Here two tables are joined and columns are selected from both the tables. Now, we can see that there is a filter condition on the DEPARTMENT_ID of EMPLOYEES table. As the first rule, we need to filter the records for department id = 20. It will reduce the number of records. Since there is two tables involved, we need to join them. But if we join the tables at this stage, we will have to store all the columns of both the tables. Rather than doing this, if we first extract required columns from both the tables and then join them, then it will reduce the space required to store the data. That means, in total it requires space to store the 5 columns of 2 records from EMPLOYEES table and 1 column for whole department names. If we first joined then selected, then we need space to store whole columns for 2 records of EMPLOYEES and whole columns for all the records of DEPARTMENTS.

Next step is to perform most restrictive joins and selection operations. When we say most restrictive joins and selection means, select those set of tables and views which will result in comparatively less number of records. Any query will have better performance when tables with few records are joined.

SELECT FIRST_NAME, LAST_NAME, PHONE_NUMBER, 
	EMAIL, JOB_ID, dept.DEPARTMENT_NAME, loc.LOCATION_NAME
FROM EMPLOYEES emp, DEPARTMENTS dept, LOCATIONS loc
WHERE emp.DEPARTMENT_ID = dept.DEPARTMENT_ID
AND dept.LOCATION_ID = loc.LOCATION_ID 
AND dept.DEPARTMENT_ID = 20;

Here three tables are joined to get the result. Here we can evaluate the query in two ways.

First join EMPLOYEES and DEPARTMENTS and then join this result with LOCATIONS.
Second method is to join DEPARTMENTS and LOCATIONS first and then join this result with EMPLOYEES.

If we observe both these methods, in the first method we have to join one huge table with another comparatively bigger table to get huge result set which then joined with small table to get the result. In the second method, two comparatively small tables are joined to get smaller result set, which is then joined with the huge table to get smaller set of result. That means the space consumed by second method is very less.

Sometimes we can combine above heuristic steps with cost based optimization technique to get better results.

4. What is the difference between Cost based Optimization Rule based Optimization?

5. How do you tune SQL queries to improve performance?

There are various methods to tune SQL queries. While tuning the query, there are various factors about query and table affect the performance. First and foremost step is to check its explain plan to understand its current execution path and the cost at each step.

Next step is to check, if any sub queries or filters can reduce the number of rows returned by it. Third step is to reduce the number of columns in the SELECT query, i.e.; keep only those columns that are requested in the query. Once these steps are performed, next step is to check what type of optimizer is being used. If it is cost based optimizer, then it will collect the statistics to evaluate the query. We need to see if the statistics are updated properly or not. Next, we need to check if the order of conditions used in the WHERE clause affects the performance. If yes, then re-arrange the conditions (third rule of rule based optimizer). Next, check if the order of table names makes any difference in the query and rearrange them too. It will sometimes affect the performance based on the selectivity of columns. Make sure indexes are used properly so that query is executed effectively.

6. In databases, what is a full table scan? Also, what are some of the causes of full table scans?

When a SELECT statement is fired, if the query scans all the rows of a table, then it is called full table scan. It will traverse each row of the table to get the requested table. It will affect the performance of the query if the table has lots of data. For example, consider our EMPLOYEES table where we need to find the FIRST_NAME and LAST name of each employee. Then the query has to traverse all the records in the table to get the requested result. That means, it performs full table scan. Imagine that there is lakhs of employees in a table. Then its performance to get all the employee details will slow down.

    SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME
    FROM EMPLOYEES;

There are different scenarios where FULL table scan occur:

WHERE Clause : Like we discussed above, to get the employee details of all the employees will not have any WHERE clause in the query. It will need all the records in the table to be read and its ID, first and last name to be extracted. When we use WHERE clause, it will filter out the results from the table and there will not be any full table scan.

When Statistics of the table is not updated : Statistics are the ones in which the details about the table will be stored. It contains table name, its columns and their size, number of records, active indexes etc. when a query is being executed, it first checks this statistics to get the execution path. if the statistic is not updated, it may use Full table scan even if there exists indexes.

Ignoring the Index : There might be some queries where it will have WHERE clause to filter out the results. But in the WHERE clause columns used to filter the data may not be used for creating index. Hence it will use index. For example, find out all the employees whose salary is equal to 5000. Here SALARY is the column on which index is not created but that column is used to filter the data. Since there is no index on SALARY, it will have to check each and every record to compare the SALARY and get the result.

If we use NOT EQUAL TO (<>) or any other comparison operator (<, >, <=, >=) or any wild card comparison using LIKE and %, then it will not use index scan, rather it will use full table scan.

7. How do you tune database performance?

In order to run the database perfectly, we need to make sure everything in the DB is perfect and intact. For example, in order to run a query on the table, it should have proper data, constraints and indexes. If any one of them is incorrect or disabled, then its performance will degrade.

In a database, tuning the performance is a continuous task. There will be some of the DB objects that will perform better when tuned. There are many factors that affect the performance of the DB. Some of them are:

Using Triggers : If the table has triggers, then executing query on the table data might take extra time to execute the triggers too. This is because, whenever we run the query which in turn calls the trigger, both query and trigger is considered as single transaction. Hence it will complete only when trigger is also executed and time taken to execute the query will have extra time for executing the query. Hence we need to take care of triggers in performance driven queries.

Same Datatypes for Primary and Foreign keys : Primary and foreign keys which are used to join the tables should have same datatype and size. Otherwise it will be an extra cost to convert them into same type. i.e.; it is not good to have primary key as VARCHAR and foreign key as CHAR. It is always better to have both of them same – either VARCHAR or CHAR.

Choosing the datatypes for numeric columns : Whenever numeric column is used in the table, try to use smallest possible numeric datatype. It will save the space to store the data. For example, if the column can store data in SMALLINT then use SMALLINT itself than BIGINT.

CHAR instead of VARCHAR : Whenever there is fixed length of strings in columns of the table, use CHAR instead of VARCHAR. If we use VARCHAR, then it will take 1-3 bytes extra for each column value, which is a waste of space.

Create indexes : In most of the cases indexes give better performance. When there is index, the query is forced to use indexes. But we need to be careful while creating indexes, since if index created on column with less uniqueness, then its performance will degrade. In such case Full table scan will be faster one.