DISTINCT statement is used for eliminating the duplicate rows from the result. When we fire a SELECT statement, it will select all the matching rows from the table. These matching rows may have duplicate entries. When we use DISTINCT it will eliminate all the duplicate rows and retain single entry for such entries.
Consider the DEPARTMENTS table. It has LOCATION_ID which indicates the location of each department.
If we write a simple query to find the locations where departments exist, then it will list all the location ids irrespective of its row and duplicate values.
SELECT LOCATION_ID FROM DEPARTMENTS;
We can notice the duplicate entries of location ids in above result set. But we need not have duplicate values in the result set to know various locations of the departments. We have to eliminate the duplicate values here. Therefore, we use DISTINCT in above query to get different locations.
SELECT DISTINCT LOCATION_ID FROM DEPARTMENTS;
Now we do not have any duplicate values in the result. We can even apply DISTINCT on combination of columns. It need not be on single column like above.
Consider below query to list various MANAGER_ID and DEPARTMENT_ID from the EMPLOYEES table like below.
SELECT MANAGER_ID, DEPARTMENT_ID FROM EMPLOYEES;
We can see duplicate result values in above query. If we want to eliminate the duplicate entries, we can use DISTINCT clause. It will remove the duplicate values on the combination of columns. We can see duplicate values for MANAGER_ID but combination of MANAGER_ID and DEPARTMENT_ID does not have any duplicate values.
SELECT DISTINCT MANAGER_ID, DEPARTMENT_ID FROM EMPLOYEES;
Since we cannot use MAX or MIN aggregate functions, we need to think of logic for getting the maximum and minimum values. The logic for getting these values can be anything but it should return right value. Let us first discuss the simplest way of getting minimum and maximum values by sorting the data.
Consider the DEPARTMENTS table from which we need to find the LOCATION_ID with minimum value and maximum value.
SELECT LOCATION_ID FROM DEPARTMENTS;
It has 27 records with different LOCATION_IDs and is displayed in the order they appear in the table. How can we get minimum value of the LOCATION_ID here without using MIN () function? If we sort the LOCATION_ID column in ascending order, then we will get the smallest number at the first row and the biggest number at the last row. Hence if we sort above query in ascending order, then we will get the result starting from smallest number.
SELECT LOCATION_ID FROM DEPARTMENTS ORDER BY LOCATION_ID;
But we do not want all the LOCATION_IDs. We need only the smallest ID alone. Hence we need to restrict the number of rows in the SELECT Statement above to one row by using ROWNUM. But ROWNUM condition will be given in the WHERE clause. If we restrict the number of rows in WHERE clause, then first row is selected first and then it will be sorted. This will not give us the required smallest number.
SELECT LOCATION_ID FROM DEPARTMENTS WHERE ROWNUM
We need to get 1400 as smallest number. Therefore, we need to sort the records first and then we need to select the smallest number by limiting the rows like below. This will give us correct result.
SELECT LOCATION_ID FROM (SELECT LOCATION_ID FROM DEPARTMENTS ORDER BY LOCATION_ID) WHERE ROWNUM <2;
This is how we got smallest number. What should we do to get largest number? We need to sort the column in descending order. Then largest number will come at the top and we can select it by restricting the number of rows to one.
SELECT LOCATION_ID FROM (SELECT LOCATION_ID FROM DEPARTMENTS ORDER BY LOCATION_ID DESC) WHERE ROWNUM <2;
Suppose we need to find minimum and maximum values without using any min, max function or order by clause. Then we have to think of complex logic to get the minimum and maximum values. In this case we need to think little bit more about the logic and query. We need to consider the same table twice and compare each value with rest of them to get max and min values. Let us start step by step to get the values.
As a first step, how to compare the values from same table? We need to have DEPARTMENTS table twice in our query with self-join. But self-join is not equijoin here (values are not compared with equal to, ‘=’). We have to use greater than or lesser than operator to get smaller and larger values. i.e.;
SELECT smaller.LOCATION_ID smaller_loc_id, larger.LOCATION_ID large_loc_id FROM DEPARTMENTS smaller, DEPARTMENTS larger WHERE smaller.LOCATION_ID < larger.LOCATION_ID;
Here each LOCATION_ID is compared against rest of the LOCATION_IDs by using ‘<’ comparison. Hence in the result set we will get all the smaller LOCATION_IDs and corresponding larger location ID combinations like below.
This is not the final. We need to extract the smaller and larger values from this list. But how? What could be the logic to get the values from this list? In above result set, the first row first column will have smallest number. We need to extract it by limiting the number of rows to one. In addition, we do not want combination of smaller and larger values. We need only smaller number. Hence we can eliminate the second column from above list.
SELECT smaller.LOCATION_ID smaller_loc_id FROM DEPARTMENTS smaller, DEPARTMENTS larger WHERE smaller.LOCATION_ID < larger.LOCATION_ID AND ROWNUM <2;
Now how we will get largest number? One method is to extract the last row second column. But we cannot use ORDER BY to sort and get the value. Another method is to use the same logic as above in little different way. In the first column above we will have all the smaller value compared against larger value. The largest value will not have any value to be compared and it will not be listed in the first column. Hence if we extract that number by writing a query on DEPARTMENTS to find which number is not listed in above query, we will get the largest number.
SELECT LOCATION_ID FROM DEPARTMENTS WHERE LOCATION_ID NOT IN (SELECT smaller.LOCATION_ID smaller_loc_id FROM DEPARTMENTS smaller, DEPARTMENTS larger WHERE smaller.LOCATION_ID < larger.LOCATION_ID);
These are the two ways of getting smaller and larger numbers from the table. There will be other methods too depending on your thinking about logic.
In SQL, any column in the table can take some valid value or UNKNOWN value. When the value of the column is UNKNOWN, we represent it by using NULL or some blank column which is not actually blank.
When we have to compare the column values we usually think that either it has some value or it does not have any value. But NULL does not indicate not having any value. It represents that the value in the column is not still known. Hence in above table, when we write query like below it returns nothing.
That means, we cannot compare NULL like above with ‘=’ operator. Since NULL (a UNKNOWN Value) is not equal to another NULL (UNKNOWN value), comparing NULL using ‘equal to’ or ‘not equal to’ or ‘greater than’ or ‘less than’ etc. are meaningless. Hence it does not return any value.
Suppose we want to list all the STATE_PROVINCE which are not NULL.
It does not return any rows that have STATE_PROVICE value. That means, it cannot compare this UNKNOWN value. Therefore, SQL provides special syntax to compare NULL and Non-NULL value using ‘IS NULL’ or IS NOT NULL’. Hence any valid values (which are not NULL) can be compared using ‘=’ or ‘<>’ values. When we have to compare NULL or NOT NULL values (a third value of the column), we have to use third operator ‘IS NULL’ or ‘IS NOT NULL’ operator.
SELECT * FROM LOCATIONS WHERE STATE_PROVINCE IS NULL;
SELECT * FROM LOCATIONS WHERE STATE_PROVINCE IS NOT NULL;
When we have to compare any valid values like whether it is equal or not (TRUE or FALSE – the two common values of the column), we can use any comparison operator.
Note below that, the ‘<>’ operator has not returned any records with NULL STATE_PROVICE. This is because it could not compare the third value ‘NULL’ using ‘<>’ operator.
Hence, any column in the table has three values to be compared – TRUE, FALSE or NULL. Therefore, it is known as three valued logic or ternary logic.
A query result can be sorted either in ascending or descending order. We need to specify whether we need to sort the table or query result in ascending or descending order by specifying ASC or DESC in the query. When ASC or DESC is not specified in the query, the database considers the default sort order. In SQL, default sort order is ascending. Hence if we do not specify the sorting order, then it will automatically consider sorting it in ascending order.
Consider the JOBS table below. When no ORDER BY clause is used, then it gives the result in the order the data are entered.
Suppose we need to sort them in ascending order of JOB_ID. Note below that we have not used any sorting order. In the result, we can see that it is sorted in ascending order of JOB_ID. There is no harm in specifying ASC below query. Then also it will sort in ascending order.
SELECT * FROM JOBS ORDER BY JOB_ID;
In case we need to sort it in descending order, then we need to explicitly mention DESC in the query.
SELECT * FROM JOBS ORDER BY JOB_ID DESC;
We know that DISTINCT keyword is used to get the distinct values out of a query (i.e.; without any duplicate values). The ORDER BY clause is used to sort the result in ascending or descending order.
Now consider a query where we find all distinct departments in the descending order of their ID. That means, we do not want to see any duplicate values in the result set. Also, we need them in descending order. This is straight forward. We need to select DISTINCT DEPARTMENT_IDs with ORDER BY DEPARTMENT_ID DESC.
SELECT DISTINCT DEPARTMENT_ID FROM EMPLOYEES ORDER BY DEPARTMENT_ID DESC;
The results are obvious and there is not any complex logic. Now let us twist the requirement little bit. Before twisting the query, first check how data appears in the EMPLOYEES table. That means each department and employee’s salary are listed in the order they are inserted into the table. They are not in the sorted order. When we say DISTINCT on these two columns, the query pulls the result in the same order as they are inserted. We can see the sample data for the same below.
SELECT DISTINCT DEPARTMENT_ID, SALARY FROM EMPLOYEES;
Now let us find all distinct departments and sort the result in the descending order of the SALARY of the employee. That means we need to list all the distinct departments. But listing the departments should be in such a way that the department of the employee with highest salary should be listed at the beginning, then department second highest salary of employee and so on. Before writing the query let us see whole EMPLOYEES data in descending order of their SALARY.
From above sample data of EMPLOYEES table, what we are expecting in our result set is that DEPARTMENT_ID should be in this order: 90, 80, 20, 100, 110, 30, 70, 60 and so on. Let us write the query now.
SELECT DISTINCT DEPARTMENT_ID FROM EMPLOYEES ORDER BY SALARY DESC;
But the result is:
Error! Why? What is wrong with our query? We have written the query for what is asked. But why we cannot execute this query? We could find the distinct departments when sorted based on DEPARTMENT_ID, but why can’t we sort using SALARY?
To answer this, we need to recall all those points discussed above and observe the query written by us carefully. How this query will be evaluated? Firstly, SELECT statement is executed to get the distinct DEPARTMENT_IDs. That means, it will select distinct DEPARTMENT_IDs in the order they appear or inserted into the table. This selection is not dependent on SALARY. Once the various department list is obtained, then the query is trying to sort the list based on SALARY which is not selected at all. Though we will have SALARY in EMPLOYEES table of the query, it will not again fire query to get the salaries. All the selections are over by the time sorting starts. Hence it does not get SALARY to sort. Therefore, it throws an error: “SALARY is NOT in the SELECT Expression”.
Therefore, if we need to select the DEPARTMENT_IDs based on the descending order of the SALARY, we need to select SALARY too.
SELECT DISTINCT DEPARTMENT_ID, SALARY FROM EMPLOYEES ORDER BY SALARY DESC;
But this will not give us distinct DEPARTMENT_IDs alone. It will give combination of distinct DEPARTMENT_ID and SALARY. Hence we have to modify this query even more.
There is another method to get distinct values by using GROUP BY clause. In this method it first selects all the distinct values from the table to create different groups. Hence if we group the records in EMPLOYEES table based on the DEPARTMENT_ID, we will get all the distinct values in the table. Next task is to sort the DEPARTMENT_IDs based on the descending order of the SALARY. In order to get the salary in descending order, we need to select the departments in such a way that highest salaried employee’s department should be listed first. Highest salary of the employee in the department is found by using MAX function on the same group by on DEPARTMENT_ID. That means each distinct department selected will have highest salary. i.e.;
SELECT DEPARTMENT_ID, MAX (SALARY) AS max_salary FROM EMPLOYEES GROUP BY DEPARTMENT_ID;
Now if we sort this result in descending order of salary we will get all the departments - salary in descending order of salary.
SELECT DEPARTMENT_ID, MAX (SALARY) AS max_salary FROM EMPLOYEES GROUP BY DEPARTMENT_ID ORDER BY max_salary DESC;
From this result we have to select DEPARTMENT_ID alone to get the requested result.
SELECT DEPARTMENT_ID FROM (SELECT DEPARTMENT_ID, MAX (SALARY) AS max_salary FROM EMPLOYEES GROUP BY DEPARTMENT_ID ORDER BY max_salary DESC);
This is how we select DISTINCT values from the table by sorting it on some other column of the table.
GROUP BY is the clause in SQL which is used to group the result based on one or more columns. Let us explain this question using example of EMPLOYEES table. Suppose we need to find the total salary of employees. Here the request is to find the total salary of all employees. No individual salary or no other condition is mentioned. That means we need to find the sum of SALARY column. It is not based on any column or condition. Hence it would be:
SELECT SUM (SALARY) FROM EMPLOYEES;
Here result has given sum of salary of all the employees. It does not give us much information about employee or department or their job ids.
Now, consider we have to see the total salary of employees in each department. This request is little different from above. We need to find total salary based on the departments. That means, first we need to list distinct departments for which employees exist in EMPLOYEES table. Then we need to find the SUM of their salaries. In other words, we have group the table data based on DEPARTMENT_ID and then find the sum of salaries.
SELECT DEPARTMENT_ID, SUM (SALARY) TOTAL_SALARY FROM EMPLOYEES GROUP BY DEPARTMENT_ID;
Here the sum of salary is obtained based one column –DEPARTMENT_ID. Let us change the requirement even more. Consider that we need to find the total salary of employees based on their department and job ids. That means, we need to show the DEPARTMENT_ID and then various JOB_IDs in its department. For each of this combination, we need to find the total salary.
SELECT DEPARTMENT_ID, JOB_ID, SUM (SALARY) TOTAL_SALARY FROM EMPLOYEES GROUP BY DEPARTMENT_ID, JOB_ID;
Here two columns are used to find the total salary. That means more than one column is used to get the aggregate value. This is how and why we use multiple columns in group by clause.
Usually DISTINCT keyword is used to eliminate all the duplicate values from the result set. But there are other techniques also to get the results without any Duplicate values. One of the methods is to use ‘GROUP BY’ clause. The GROUP BY clause is used to group the records based on one or more columns. Usually we use this to get the aggregate results like COUNT, SUM, MAX, MIN and AVG. But this is the common use of GROUP BY clause.
Apart from finding aggregate, if used on SELECT statement, it gives only distinct records. When we use GROUP BY in the query, we need to make sure that the columns that are selected in the SELECT statement should be included in the GROUP BY clause too. Hence if we are going to select all the columns of the table, then all those columns have to be included in group by too. If we are selecting only one or few columns of the table, then only that column/s have to be included in the group by clause.
Consider the EMPLOYEES table, where JOB_ID is the column which contains job ids of different employees.
SELECT JOB_ID FROM EMPLOYEES;
But result has duplicate values. If we write the same query using GROUP BY clause, it first find the distinct values of JOB_IDs from the table and list them. Then it will check what to be done with this grouped value. But we need only first step to get the distinct values from the table
SELECT JOB_ID FROM EMPLOYEES GROUP BY JOB_ID;
Now we do not have any duplicate values in the result.
Finding the highest salary in the table or finding any highest value is simple and can be achieved by writing GROUP BY clause and MAX function. But here we have to find second highest salary.
One way to get this value is by using ‘not equal to’ operator or NOT IN condition. That means, first we need to find the max salary from the EMPLOYEES table and then again we need to select highest salary that is not equal to first highest salary.
SELECT MAX (SALARY) FROM EMPLOYEES WHERE SALARY <> ( SELECT MAX (SALARY) AS max_salary FROM EMPLOYEES);
This is simple and is suitable to find second or third highest salaries. But as the value of N increases, this query gets more complex. In such case we need to use RANK function to get the values. The rank function will rank each data in the order of their values, either ascending or descending order. Hence, if we want to select Nth highest value, we can just select the row with Nth rank.
In above example, let us first give the rank to each record based on the SALARY in descending order.
SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, RANK () OVER (ORDER BY Salary DESC) RANK_SAL FROM EMPLOYEES;
In the result, we can see that the record with highest salary is ranked one; next highest salaried record is ranked two and so on. If we observe the query, rank function is used over the SALARY in such a way that SALARY is sorted in descending order first and then ranking is given to them. That is why we get the ranks in descending order of the salary. Here we have created a new column to have rank – RANK_SAL. Therefore, we can select this new RANK_SAL in the outer most query like below to get any rank record.
SELECT * FROM ( SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, RANK () OVER (ORDER BY Salary DESC) RANK_SAL FROM EMPLOYEES) WHERE RANK_SAL = 5;
It will select 5th highest salaried employee’s record from the table.
We can even use ROW_NUMBER function to get the row number for each record. It also works in the same way as RANK ().
SELECT * FROM ( SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, ROW_NUMBER () OVER (ORDER BY Salary DESC) ROWNUM_SAL FROM EMPLOYEES) WHERE ROWNUM_SAL = 10;
Sub query and derived tables looks similar in the query. But both of them are different. In simple words, like their name itself indicates, sub queries are queries within a query, and derived tables are tables derived from the existing database tables. Let us discuss them in detail to understand the difference between them.
Sub Query : It is a query inside another query. It is usually written using ‘IN’ or ‘NOT IN’ clause in the query. In the query if we need to compare set of value which is rather than hard coding, if derived from same or other table, we use IN or NOT IN condition. Such kind of query is called sub query. Let us illustrate the same using an example of EMPLOYEES and DEPARTMENTS.
Suppose we want to see all the employees who are working in Accounting, Benefits, IT Helpdesk and NOC. Here we need to show all the employee details. Hence we need to query EMPLOYEES table. But in employees table, we do not have any department names; rather we have only department ids. In order to get the DEPARTMENT_ID of the department names above, we have to query DEPARTMENTS table first. All these IDs then should be hard coded within the IN clause of EMPLOYEES query. If we do so, we cannot get the result in a single query. We need to write one query to get all the DEPARTMENT_IDs from DEPARTMENTS table, and one more query to get all the employee details from EMPLOYEES table for those DEPARTMENT_IDs.
If we have to do in single query, we should be able to get all the DEPARTMENT_IDs from DEPARTMENTS table within EMPLOYEES query itself. That means we should be able to merge these two queries into one. In SQL, it is possible by the technique called sub query. i.e.; a query to find all the DEPARTMENT_ID is written within the IN clause of the EMPLOYEES query.
SELECT * FROM EMPLOYEES WHERE DEPARTMENT_ID IN ( SELECT DEPARTMENT_ID FROM DEPARTMENTS WHERE DEPARTMENT_NAME IN ('Accounting', 'Benefits', 'IT Helpdesk’,'NOC'));
When this query is processed, it first evaluates inner query – sub query to get all the DEPARTMENT_IDs and then evaluated outer query to get all employee details.
Since a query is written within another query, it is called sub query.
Derived Table : These are also queries used in another query, but they are used as tables in the query. In sub queries, the queries are used for getting some values from them, but in derived tables, queries are written on existing database tables and are used as some temporary table for that particular query.
Suppose display department details and its employee details like his first name, last name, id and salary. This is the simple requirement and can be written using join as below
Here simple join is used between EMPLOYEES and DEPARTMENTS table. Same query can be re-written using derived tables. In order to get derived table, we can query EMPLOYEES table to get only his ID, department id, first name, last name and salary. Then it can be considered as another table and can be joined with DEPARTMENTS table to get its department details.
SELECT dept.DEPARTMENT_ID, dept.DEPARTMENT_NAME, emp.EMPLOYEE_ID, emp.FIRST_NAME, emp.LAST_NAME, emp.SALARY FROM DEPARTMENTS dept, (SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, DEPARTMENT_ID FROM EMPLOYEES) emp -- Derived table WHERE dept.DEPARTMENT_ID = emp.DEPARTMENT_ID;
Here we can notice that we have written a query in the FORM clause. Hence this query acts as another table. Even though this uses the database table/s, query as a table is temporary. That means, this derived table exists till the existence of the query.
We can have one or more tables in the derived table query. This will reduce the execution of complex query multiple times. That means, if we write a complex query by using normal join, it may sometimes take lots of time. If we make complex query part as derived table, then it will be executed once to get all the records for that query part and act as records of temporary table. It will be then joined with other tables to get the result.
Sub queries are of two types – Uncorrelated Sub query and correlated sub query. An uncorrelated sub query, as the name itself suggests, not related to any other query or the outer query. These types of sub queries can be executed independent of outer query. In other words, there will be two or more queries, which are independent of each other, but when executed as sub query it gives meaningful result.
Consider the query to get all the employees who works for department ‘%Account%’. This query needs to fetch all the department ids for ‘%Account%’ and then fire the query on EMPLOYEES table to get their details. That means even though the query to get the department id is related or required to execute query requested, the department id fetch query is independent of employees query.
SELECT * FROM EMPLOYEES WHERE DEPARTMENT_ID IN (SELECT DEPARTMENT_ID FROM DEPARTMENTS WHERE DEPARTMENT_NAME LIKE '%Account%'
Here even though we have sub queries, both of them are independent of each other. There is no dependency on inner query on outer query. When we execute above query as a whole, the compiler will first execute the inner query to get all the department ids with name ‘%Account%’, and then for the listed department id, it will execute the query on EMPLOYEES table to get their details. Hence the cost of execution is to execute the inner query plus outer query.