So far we have seen sequential file organization and fetching of records based on the primary key or the column which is frequently used in the query. But in real life the query is not restricted to primary key or single column. It will have various needs to see the records in different combinations of columns. In such case indexes on primary/single column will degrade the performance of the query. Because the table will have huge number of records, and index on any one column will efficiently fetch all the records which satisfies the search criteria for indexed column. But the query is still not complete. It still has to filter the data with other search criteria on which we do not have index. What would be the result? This fetch becomes slow, in total reducing the cost of the query.
For example consider employee table with huge number of employees in different department. Imagine we have to select all the employees whose salary is 5000 and works for department ID 20. Also, we have index created only on DEPT_ID. The query to select the records is as follows:
SELECT * FROM EMPLOYEE WHERE DEPT_ID = 20 and SALARY = 5000;
Since there is index on DEPT_ID, fetching all the employees who are working for department 20 is quick. But it will return large number of employees. Out of those employees, it has to filter the employees with salary = 5000. But this is same as querying a very big table with one search criteria. Hence this will be time consuming part of the query.
In real life situations, no one likes waiting and so is the case with query. We need to think of some better options than having index on single column/ primary keys.
To handle above situation, multiple key accesses are introduced. Here, we use combination of two or more columns which are frequently queried to get index. In the above example, both DEPT_ID and SALARY are clubbed into one index and are stored in the files. Now what happens when we fire above query? It filters both DEPT_ID = 20 and SALARY = 5000 at a shot and returns the result.
This type of indexing works well when all the columns used in the index are involved in the query. In above example we have index on (DEPT_ID, SALARY) and both the columns are used in the query. Hence it resulted quickly. But what will happen when any one of the column is used in the query? This index will not be used to fetch the record. Hence query becomes slower.
Imagine another case where we have queried like below. Here both the columns are used differently- we have queried to find the salary less than 5000, not equal to 5000. What will happen in this case?
SELECT * FROM EMPLOYEE WHERE DEPT_ID = 20 and SALARY < 5000;
This query will use the index and will fetch the result quickly. Reason is it has to find the address of the record in the file where DEPT_ID = 20 and SALARY = 5000 and then it has to return all the records which are stored previous to this address. Hence the query becomes simple and faster.
Imagine another case for the same example, where DEPT_ID <20 but SALARY = 5000 is used to fetch the record. What will happen in this case?
SELECT * FROM EMPLOYEE WHERE DEPT_ID < 40 and SALARY = 5000;
This query will not use the index. Why? The order of the columns in the index is (DEPT_ID, SALARY). The multiple key indexes work as below.
The condition (DEPT_ID1, SALARY1) < (DEPT_ID2, SALARY2) is used only if
DEPT_ID1 < DEPT_ID2
DEPT_ID1 = DEPT_ID2 AND SALARY1 <SALARY2
In this example, we are checking for second condition, where we have DEPT_ID <40. But SALARY < 5000 should have been used for the index to be used. But it is not the case in the query. That means, it should have only one condition DEPT_ID <40 or both conditions like DEPT_ID = 20 AND SALARY<5000 to use the index. This method of ordering in index is called lexicographic ordering.
Have a look at below index file to understand these examples. We can see that, records are first grouped based on DEPT_ID and then on SALARY. If we have to select DEPT_ID <40 alone, then all the records less than DEPT_ID =40 address have to selected which easy and quick. We get all records in a sequence here. DEPT_ID1 <DEPT_ID2 condition holds well here.
If DEPT_ID =20 but SALARY < 5000 condition is used, pointer will go to DEPT_ID = 20 Section in the file and then fetch all the records less than 5000. The index is used in the query making it faster. Here, second conditions for lexicographic works.
What if we have to search DEPT_ID <40 and SALARY = 5000, it is same as traversing the whole table. We will not be able to find all requested records at one place.
Similarly, if only one column in the index is used, say SALARY = 5000, then also searching is a full table scan. Below diagram clearly shows SALARY = 5000 is scattered in the memory file and have to traverse whole file to get the result. Hence it is not efficient.
- One has to thoroughly examine all kinds of queries that are fired on the table. This will help us to decide the columns to be put in the index. Also, it will determine the feasibility of index to run faster queries. That means, if employees’ details are retrieved based on DEPT_ID and SALARY, and then it is feasible use index on both (DEPT_ID, SALARY). Else it is an overhead on table performance.
- In general case, it is expected to put columns with highest unique value first. That is if create (SALARY, DEPT_ID) index than (DEPT_ID, SALARY) is most efficient. This is because; it will trim down the result set. That means, employees with same salary will have smaller result set than employees in the same department, hence making the search faster. But again selecting the order of columns in the index is up to developer. In our example above, we can see the affect of selecting the index order as (DEPT_ID, SALARY). It gave us more meaningful distribution of records in the file than grouping them based on salary. Also, our query on EMPLOYEE table is such that grouping on DEPT_ID first seems to be better option.
- It works efficiently when all the columns of the index is used in the query, provided it satisfies lexicographic ordering.
- It works efficiently even on larger tables. This is because, records are sorted based on the index.
- It works only for lexicographic ordering. The second example above does not use the index and cost of the query is high.
- It does not work efficiently if only one or partial columns of index are involved in querying. It does not use the index to retrieve the data.