6. In SQL, how and when would you do a group by with multiple columns? Also provide an example.
GROUP BY is the clause in SQL which is used to group the result based on one or more columns. Let us explain this question using example of EMPLOYEES table. Suppose we need to find the total salary of employees. Here the request is to find the total salary of all employees. No individual salary or no other condition is mentioned. That means we need to find the sum of SALARY column. It is not based on any column or condition. Hence it would be:
SELECT SUM (SALARY) FROM EMPLOYEES;
Here result has given sum of salary of all the employees. It does not give us much information about employee or department or their job ids.
Now, consider we have to see the total salary of employees in each department. This request is little different from above. We need to find total salary based on the departments. That means, first we need to list distinct departments for which employees exist in EMPLOYEES table. Then we need to find the SUM of their salaries. In other words, we have group the table data based on DEPARTMENT_ID and then find the sum of salaries.
SELECT DEPARTMENT_ID, SUM (SALARY) TOTAL_SALARY FROM EMPLOYEES GROUP BY DEPARTMENT_ID;
Here the sum of salary is obtained based one column –DEPARTMENT_ID. Let us change the requirement even more. Consider that we need to find the total salary of employees based on their department and job ids. That means, we need to show the DEPARTMENT_ID and then various JOB_IDs in its department. For each of this combination, we need to find the total salary.
SELECT DEPARTMENT_ID, JOB_ID, SUM (SALARY) TOTAL_SALARY FROM EMPLOYEES GROUP BY DEPARTMENT_ID, JOB_ID;
Here two columns are used to find the total salary. That means more than one column is used to get the aggregate value. This is how and why we use multiple columns in group by clause.
7. In the table below, how would you retrieve the unique values for the EMPLOYEE_LOCATION without using the DISTINCT keyword ?
Usually DISTINCT keyword is used to eliminate all the duplicate values from the result set. But there are other techniques also to get the results without any Duplicate values. One of the methods is to use ‘GROUP BY’ clause. The GROUP BY clause is used to group the records based on one or more columns. Usually we use this to get the aggregate results like COUNT, SUM, MAX, MIN and AVG. But this is the common use of GROUP BY clause.
Apart from finding aggregate, if used on SELECT statement, it gives only distinct records. When we use GROUP BY in the query, we need to make sure that the columns that are selected in the SELECT statement should be included in the GROUP BY clause too. Hence if we are going to select all the columns of the table, then all those columns have to be included in group by too. If we are selecting only one or few columns of the table, then only that column/s have to be included in the group by clause.
Consider the EMPLOYEES table, where JOB_ID is the column which contains job ids of different employees.
SELECT JOB_ID FROM EMPLOYEES;
But result has duplicate values. If we write the same query using GROUP BY clause, it first find the distinct values of JOB_IDs from the table and list them. Then it will check what to be done with this grouped value. But we need only first step to get the distinct values from the table
SELECT JOB_ID FROM EMPLOYEES GROUP BY JOB_ID;
Now we do not have any duplicate values in the result.
8. Write a SQL query to get the second highest salary from EMPLOYEES table. Also write a query to find the nth highest salary in SQL, where n can be any number.
Finding the highest salary in the table or finding any highest value is simple and can be achieved by writing GROUP BY clause and MAX function. But here we have to find second highest salary.
One way to get this value is by using ‘not equal to’ operator or NOT IN condition. That means, first we need to find the max salary from the EMPLOYEES table and then again we need to select highest salary that is not equal to first highest salary.
SELECT MAX (SALARY) FROM EMPLOYEES WHERE SALARY <> ( SELECT MAX (SALARY) AS max_salary FROM EMPLOYEES);
This is simple and is suitable to find second or third highest salaries. But as the value of N increases, this query gets more complex. In such case we need to use RANK function to get the values. The rank function will rank each data in the order of their values, either ascending or descending order. Hence, if we want to select Nth highest value, we can just select the row with Nth rank.
In above example, let us first give the rank to each record based on the SALARY in descending order.
SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, RANK () OVER (ORDER BY Salary DESC) RANK_SAL FROM EMPLOYEES;
In the result, we can see that the record with highest salary is ranked one; next highest salaried record is ranked two and so on. If we observe the query, rank function is used over the SALARY in such a way that SALARY is sorted in descending order first and then ranking is given to them. That is why we get the ranks in descending order of the salary. Here we have created a new column to have rank – RANK_SAL. Therefore, we can select this new RANK_SAL in the outer most query like below to get any rank record.
SELECT * FROM ( SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, RANK () OVER (ORDER BY Salary DESC) RANK_SAL FROM EMPLOYEES) WHERE RANK_SAL = 5;
It will select 5th highest salaried employee’s record from the table.
We can even use ROW_NUMBER function to get the row number for each record. It also works in the same way as RANK ().
SELECT * FROM ( SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, ROW_NUMBER () OVER (ORDER BY Salary DESC) ROWNUM_SAL FROM EMPLOYEES) WHERE ROWNUM_SAL = 10;
9. What is the difference between a derived table and a subquery? Explain it with an example.
Sub query and derived tables looks similar in the query. But both of them are different. In simple words, like their name itself indicates, sub queries are queries within a query, and derived tables are tables derived from the existing database tables. Let us discuss them in detail to understand the difference between them.
Sub Query : It is a query inside another query. It is usually written using ‘IN’ or ‘NOT IN’ clause in the query. In the query if we need to compare set of value which is rather than hard coding, if derived from same or other table, we use IN or NOT IN condition. Such kind of query is called sub query. Let us illustrate the same using an example of EMPLOYEES and DEPARTMENTS.
Suppose we want to see all the employees who are working in Accounting, Benefits, IT Helpdesk and NOC. Here we need to show all the employee details. Hence we need to query EMPLOYEES table. But in employees table, we do not have any department names; rather we have only department ids. In order to get the DEPARTMENT_ID of the department names above, we have to query DEPARTMENTS table first. All these IDs then should be hard coded within the IN clause of EMPLOYEES query. If we do so, we cannot get the result in a single query. We need to write one query to get all the DEPARTMENT_IDs from DEPARTMENTS table, and one more query to get all the employee details from EMPLOYEES table for those DEPARTMENT_IDs.
If we have to do in single query, we should be able to get all the DEPARTMENT_IDs from DEPARTMENTS table within EMPLOYEES query itself. That means we should be able to merge these two queries into one. In SQL, it is possible by the technique called sub query. i.e.; a query to find all the DEPARTMENT_ID is written within the IN clause of the EMPLOYEES query.
SELECT * FROM EMPLOYEES WHERE DEPARTMENT_ID IN ( SELECT DEPARTMENT_ID FROM DEPARTMENTS WHERE DEPARTMENT_NAME IN ('Accounting', 'Benefits', 'IT Helpdesk’,'NOC'));
When this query is processed, it first evaluates inner query – sub query to get all the DEPARTMENT_IDs and then evaluated outer query to get all employee details.
Since a query is written within another query, it is called sub query.
Derived Table : These are also queries used in another query, but they are used as tables in the query. In sub queries, the queries are used for getting some values from them, but in derived tables, queries are written on existing database tables and are used as some temporary table for that particular query.
Suppose display department details and its employee details like his first name, last name, id and salary. This is the simple requirement and can be written using join as below
Here simple join is used between EMPLOYEES and DEPARTMENTS table. Same query can be re-written using derived tables. In order to get derived table, we can query EMPLOYEES table to get only his ID, department id, first name, last name and salary. Then it can be considered as another table and can be joined with DEPARTMENTS table to get its department details.
SELECT dept.DEPARTMENT_ID, dept.DEPARTMENT_NAME, emp.EMPLOYEE_ID, emp.FIRST_NAME, emp.LAST_NAME, emp.SALARY FROM DEPARTMENTS dept, (SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, SALARY, DEPARTMENT_ID FROM EMPLOYEES) emp -- Derived table WHERE dept.DEPARTMENT_ID = emp.DEPARTMENT_ID;
Here we can notice that we have written a query in the FORM clause. Hence this query acts as another table. Even though this uses the database table/s, query as a table is temporary. That means, this derived table exists till the existence of the query.
We can have one or more tables in the derived table query. This will reduce the execution of complex query multiple times. That means, if we write a complex query by using normal join, it may sometimes take lots of time. If we make complex query part as derived table, then it will be executed once to get all the records for that query part and act as records of temporary table. It will be then joined with other tables to get the result.
10. What is the difference between a correlated subquery and non correlated subquery? Explain it with an example.
Sub queries are of two types – Uncorrelated Sub query and correlated sub query. An uncorrelated sub query, as the name itself suggests, not related to any other query or the outer query. These types of sub queries can be executed independent of outer query. In other words, there will be two or more queries, which are independent of each other, but when executed as sub query it gives meaningful result.
Consider the query to get all the employees who works for department ‘%Account%’. This query needs to fetch all the department ids for ‘%Account%’ and then fire the query on EMPLOYEES table to get their details. That means even though the query to get the department id is related or required to execute query requested, the department id fetch query is independent of employees query.
SELECT * FROM EMPLOYEES WHERE DEPARTMENT_ID IN (SELECT DEPARTMENT_ID FROM DEPARTMENTS WHERE DEPARTMENT_NAME LIKE '%Account%'
Here even though we have sub queries, both of them are independent of each other. There is no dependency on inner query on outer query. When we execute above query as a whole, the compiler will first execute the inner query to get all the department ids with name ‘%Account%’, and then for the listed department id, it will execute the query on EMPLOYEES table to get their details. Hence the cost of execution is to execute the inner query plus outer query.
A Correlated sub query is the one where inner query cannot be executed independently. Inner query will always be dependent on the outer query results/values. We can modify above example using corrleated sub query like below:
SELECT * FROM EMPLOYEES emp WHERE EXISTS (SELECT 1 FROM DEPARTMENTS dept WHERE dept.DEPARTMENT_ID = emp.DEPARTMENT_ID AND DEPARTMENT_NAME LIKE '%Account%');
Here we can notice that in the inner query we have a condition ‘dept.DEPARTMENT_ID = emp.DEPARTMENT_ID ‘, which indicates, for each department id in EMPLOYEES table find the match in the DEPARTMENTS table. It also indicates that we cannot execute inner query independently – it is always dependent on the value of outer query department id. The number of department ids to be compared is restricted by another condition within inner query – DEPARTMENT_NAME LIKE ‘%Account%‘. Here when query is executed, it first executes the outer query – EMPLOYEES table, then for each record from the outer query, it fires query on the inner query to find the match. That means, if the outer query returns 100 records, then inner query is executed 100 times for each record of the outer query. Hence cost of execution of above query would be cost to execute outer query once plus (cost to execute inner query * number of records returned by outer query).