The main goal of creating a database is to store the related data at one place, access and manipulate them as and when it is required by the user. Accessing and manipulating the data should be done efficiently i.e.; it should be accessed easily and quickly.
But a database is a system and the users are either another system or application or a person. The user can request the data in a language that he understands. But DBMS has its own language (SQL) which it understands. Hence the users are asked to query the database in its language – SQL. This SQL is a high level language created to build a bridge between user and DBMS for their communication. But the underlying systems in the DBMS will not understand SQL. There has to be some low level language which these systems can understand. Usually any query written in SQL is converted into low level language using relational algebra which system can understand. But it will be difficult for any user to directly write relational algebra kind of queries. It requires thorough knowledge of it.
Hence what DBMS does is it asks its users to write query in SQL. It verifies the code written by the user and then converts them into low level languages. It then selects the best execution path and executes the query and gets the data from internal memory. All these processes are together known as query processing.
It is the step by step process of breaking the high level language into low level language which machine can understand and perform the requested action for user. Query processor in the DBMS performs this task.
Above diagram depicts how a query is processed in the database to show the result. When a query is submitted to the database, it is received by the query compiler. It then scans the query and divides it into individual tokens. Once the tokens are generated, they are verified for their correctness by the parser. Then the tokenized queries are transformed into different possible relational expressions, relational trees and relational graphs (Query Plans). Query optimizer then picks them to identify the best query plan to process. It checks in the system catalog for the constraints and indexes and decides the best query plan. It generates different execution plans for the query plan. The query execution plan then decides the best and optimized execution plan for execution. The command processor then uses this execution plan to retrieve the data from the database and returns the result. This is an overview of how a query processing works. Let us see in detail in below.
- Parsing and Translation
- Query Optimization
- Evaluation or query code generation
- Execution in DB’s runtime processor.
Parsing and Translation
This is the first step of any query processing. The user typically writes his requests in SQL language. In order to process and execute this request, DBMS has to convert it into low level – machine understandable language. Any query issued to the database is first picked by query processor. It scans and parses the query into individual tokens and examines for the correctness of query. It checks for the validity of tables / views used and the syntax of the query. Once it is passed, then it converts each tokens into relational expressions, trees and graphs. These are easily processed by the other parsers in the DBMS.
Let us try to understand these steps using an example. Suppose user wants to see the student details who are studying in DESIGN_01 class. If the users say ‘Retrieve Student details who are in DESIGN_01 class’, the DBMS will not understand. Hence DBMS provides a language – SQL which both user and DBMS can understand and communicate with each other. This SQL is written in simple English like form which both can understand. So the user would write his request in SQL as below:
SELECT STD_ID, STD_NAME, ADDRESS, DOB FROM STUDENT s, CLASS c WHERE s.CLASS_ID = c.CLASS_ID AND c.CLASS_NAME = ‘DESIGN_01’;
When he issues this query, the DBMS reads and converts it into the form which DBMS can use to further process and synthesis it. This phase of query processing is known as parsing and translation phase. The query processor scans the SQL query submitted and divides into individual meaningful tokens. In our example, ’SELECT * FROM’, ‘STUDENT s’, ‘CLASS c’, ‘WHERE’, ‘s.CLASS_ID = c.CLASS_ID’, ‘AND’ and ‘c.CLASS_NAME = ‘DESIGN_01’’ are the different tokens. These tokenized forms of query are easily used by the processor to further processing. It fires query on the data dictionary tables to verify if the tables and columns in these tokens exists or not. If they are not present in the data dictionary, then the submitted query will be failed at this stage itself. Else it proceeds to find if the syntax used in the query are correct. Please note that it does not validate if DESIGN_01 exists in the table or not, it verifies if ’SELECT * FROM’, ‘WHERE’, ‘s.CLASS_ID = c.CLASS_ID’, ‘AND’ etc have SQL defined syntaxes. Once it validates the syntaxes, it converts them into a relational algebra, relational tree and graph representations. These are easily understood and handled by the optimizer for further processing. Above query can be converted into any of the two forms of relation algebra as below. First query identifies the students in DESIGN_01 class first and then selects only the requested columns from it. Another query first selects requested columns from the STUDENT table and then filters it for DESIGN_01. Both of them results in same result.
∏ STD_ID, STD_NAME, ADDRESS, DOB (σ CLASS_NAME = ‘DESIGN_01’ (STUDENT ∞CLASS))
σ CLASS_NAME = ‘DESIGN_01’ (∏ STD_ID, STD_NAME, ADDRESS, DOB (STUDENT ∞CLASS))
Query processor then applies the rules and algorithms on these relational structures to represent more efficient and powerful structures which are used only by the DBMS. These structures are based on the mappings between the tables, joins used, cost of execution algorithm of these queries. It determines which structure – selecting and then projecting or projecting and then selecting – is the efficient way of processing, when to apply filters etc. In the third step of query processing, the best structure and plan selected by the optimizer is selected and executed. It digs into the database memory to retrieve the records based on the plan. Sometimes it process and compiles the query and keeps it in DB to use it in the runtime DB processor. The result is then returned to the user. This is the overall step processed by the DBMS when a simple to complex query is fired. The time taken by all these process will be in fraction of seconds. But ideal optimization and selection of execution path will make the query even faster
Cost of query is the time taken by the query to hit the database and return the result. It involves query processing time i.e.; time taken to parse and translate the query, optimize it, evaluate, execute and return the result to the user is called cost of the query. Though it is in fraction of seconds, it includes multiple sub tasks and time taken by each of them. Executing the optimized query involves hitting the primary and secondary memory based on the file organization method. Depending on file organization and the indexes used, time taken to retrieve the data may vary.
Majority of time is spent by the query in accessing the data from the memory. It too has several factors determining the cost of access time – disk I/O time, CPU time, network access time etc. Disk access time is the time taken by the processor to search and find the record in the secondary memory and return the result. This takes the majority of time while processing a query. Other times can be ignored compared to disk I/O time.
While calculating the disk I/O time, usually only two factors are considered – seek time and transfer time. The seek time is the time taken the processor to find a single record in the disk memory and is represented by tS. For example, in order to find the student ID of a student ‘John’, the processor will fetch in the memory based on the index and the file organization method. The time taken by the processor to hit the disk block and search for his ID is called the seek time. The time taken by the disk to return fetched result back to the processor / user is called transfer time and is represented by tT.
Suppose a query need to seek S times to fetch a record and there is B blocks needs to be returned to the user. Then the disk I/O cost is calculated as below
(S* tS)+ (B* tT)
That is, it is the sum of the total time taken for seek S times and the total time taken to transfer B blocks. Here other costs like CPU cost, RAM cost etc are ignored as they are comparatively small. Disk I/O alone is considered as cost of a query. But we have to calculate the worst case cost – the maximum time taken by the query when there is a worst case like buffer is full or no buffers etc. because the memory space / buffers depend on the number of queries executing in parallel. All queries would be using the buffers and determining the number of buffers / blocks available for our query is unpredictable. The processor might have to wait till it gets all the memory blocks.
We have already seen indexes and index based file organization system. Hence we can now imagine how it will affect the query processing and how it will fasten the time of retrieval. They drastically shorten the query execution time. Let see different indexes and their roles in reducing the processing time.
In this type of indexes, index is created for each of the search key values and are sorted based on the search key. For example, if STD_ID is the search key, then we will have entry for each STD_ID in the index table. In such case we need not traverse through the files from beginning till we get the desired value. This index helps to directly fetch the record with single seek. Even if we want to search range of records, searching the initial record would suffice. Hence only single seek will have to be performed. Hence the cost can be calculated as:
tS+ (B* tT)
It is the total of time taken to transfer B blocks and single seek time of the query.
In this type of index, the search key values are grouped into multiple blocks and only the starting key value of the block is indexed. i.e.; if STD_ID is the search key, then its values are grouped of, say 10 records each (i.e.; 100, 110, 120, 130…) and only 100, 110, 120, 130 etc will have the index stored in the index table. Hence when we have to search for any record, we can get the block where the data is stored in a single seek, but we have to traverse linearly in the block till we get the requested record. i.e.; if we need to search STD_ID = 119, in a single seek we will get into the block of 110, but inside that block we have to traverse / seek 9 times to get 119. Hence we need 10 seeks in this case.
Suppose we have to search for range of records from 119 – 125, then we have to hit the index block 110 to fetch 119 and we have to hit 120 too to fetch records from 120 to 125. Hence the number of blocks accessed and transferred is 2.
(S* tS)+ (B* tT)
Hence it is less efficient compared to dense index, but better than the linear search which searches from the beginning of a file (i.e.; number seeks are less compared to linear search). But it requires less space compared to dense index. Also we need not worry of storing the indexes for newly added records or updated records.
In this method primary key itself is the search key and indexes are stored for these primary key. It can be dense or sparse index. But since it is a primary key, records will be stored in a sorted form. Hence when we have to search for a key, we will get it in a single seek. Even if we have to search for a range of values, in single seek we would get the starting record of range. Then rest of the records is fetched sequentially from it. Hence the number of blocks to be searched is less.
Here the index is created on the keys other than search key. Hence we have to store each value of secondary index in the index table, making it dense. In this method searching a key value will be difficult and each record will be traversed to get the record.
in this method, search key values are grouped into two or more levels. i.e.; say initial level of index will contain search key values 100,200,300 etc , and each of which will point to secondary level where we will have search key value ranges like 100, 110,120…200,210,.. etc. these secondary level index can point to third level or directly to the records in the file. This will reduce the number of block reads as we saw in sparse indexes. It is usually helpful in the case of binary search.
Here two or more search key values are clustered together and it will point to the actual data in the file. It can be multi-leveled too. It will reduce the seek time as well as the number of blocks to be read.
B+ Tree Index
In this method all the search key values are kept at equidistance from the root and all the search key values will be at the child node. Intermediary node will have only pointer to the actual search key node. The height of B+ tree is same for all the nodes. In this method, maximum seek required to search any key value is logarithmic and is log (h). But this is more compared to other indexes where we can even search records in single seek. Advantage of B+ tree is, it is sorted and balanced. We need not rearrange the records when there is insertion or deletion since it automatically does it. Hence it gives better performance in high performance required systems.