Hash/Direct File Organization in DBMS

Hash/Direct File Organization

In this method of file organization, hash function is used to calculate the address of the block to store the records. The hash function can be any simple or complex mathematical function. The hash function is applied on some columns/attributes – either key or non-key columns to get the block address. Hence each record is stored randomly irrespective of the order they come. Hence this method is also known as Direct or Random file organization. If the hash function is generated on key column, then that column is called hash key, and if hash function is generated on non-key column, then the column is hash column.

When a record has to be retrieved, based on the hash key column, the address is generated and directly from that address whole record is retrieved. Here no effort to traverse through whole file. Similarly when a new record has to be inserted, the address is generated by hash key and record is directly inserted. Same is the case with update and delete. There is no effort for searching the entire file nor sorting the files. Each record will be stored randomly in the memory.

These types of file organizations are useful in online transaction systems, where retrieval or insertion/updation should be faster.

Advantages of Hash File Organization

  • Records need not be sorted after any of the transaction. Hence the effort of sorting is reduced in this method.
  • Since block address is known by hash function, accessing any record is very faster. Similarly updating or deleting a record is also very quick.
  • This method can handle multiple transactions as each record is independent of other. i.e.; since there is no dependency on storage location for each record, multiple records can be accessed at the same time.
  • It is suitable for online transaction systems like online banking, ticket booking system etc.

Disadvantages of Hash File Organization

  • This method may accidentally delete the data. For example, In Student table, when hash field is on the STD_NAME column and there are two same names – ‘Antony’, then same address is generated. In such case, older record will be overwritten by newer. So there will be data loss. Thus hash columns needs to be selected with utmost care. Also, correct backup and recovery mechanism has to be established.
  • Since all the records are randomly stored, they are scattered in the memory. Hence memory is not efficiently used.
  • If we are searching for range of data, then this method is not suitable. Because, each record will be stored at random address. Hence range search will not give the correct address range and searching will be inefficient. For example, searching the employees with salary from 20K to 30K will be efficient.
  • Searching for records with exact name or value will be efficient.  If the Student name starting with ‘B’ will not be efficient as it does not give the exact name of the student.
  • If there is a search on some columns which is not a hash column, then the search will not be efficient. This method is efficient only when the search is done on hash column. Otherwise, it will not be able find the correct address of the data.
  • If there is multiple hash columns – say name and phone number of a person, to generate the address, and if we are searching any record using phone or name alone will not give correct results.
  • If these hash columns are frequently updated, then the data block address is also changed accordingly. Each update will generate new address. This is also not acceptable.

Hardware and software required for the memory management are costlier in this case. Complex programs needs to be written to make this method efficient.

Translate »