BISQL – Laymen to SQL Developer # 19 – Record Storage and Primary File Organization #4 – Hashing Techniques
Hi Folks,
This post is part of Series Database Management Systems
Currently running topic for this series is listed as below :
Series of Database Management Systems
>>Chapter 1 : DBMS [Database Management Systems]
>>Chapter 2 : Database Core Concepts and Applications
>>Chapter 3 : Record Storage and Primary File Organization <You are Here>
Continuing from my previous post on this series, If you have missed any link please visit link below
We are going to Cover the Following Points in this article
- Hashing Techniques
Hashing Techniques
One disadvantage of sequential file organization is that we must use linear search or binary search to locate the desired record and that results in more i/o operations. In this there are a number of unnecessary comparisons. In hashing technique or direct file organization, the key value is converted into an address by performing some arithmetic manipulation on the key value, which provides very fast access to records
Let us consider a hash function h that maps the key value k to the value h(k). The VALUE h(k) is used as an address.
The basic terms associated with the hashing techniques are:
1) Hash table: It is simply an array that is having address of records.
2) Hash function: It is the transformation of a key into the corresponding location or address in the hash table (it can be defined as a function that takes key as input and transforms it into a hash table index).
3) Hash key: Let ‘R’ be a record and its key hashes into a key value called hash key.
Internal Hashing
For internal files, hash table is an array of records, having array in the range from 0 to M-1. Let as consider a hash function H(K) such that H(K)=key mod M which produces a remainder between 0 and M-1 depending on the value of key. This value is then used for the record address. The problem with most hashing function is that they do not guarantee that distinct value will hash to distinct address, a situation that occurs when two non-identical keys are hashed into the same location.
For example: let us assume that there are two non-identical keys k1=342 and k2=352 and we have some mechanism to convert key values to address. Then the simple hashing function is:
h(k) = k mod 10
Here h (k) produces a bucket address.
To insert a record with key value k, we must have its key first. E.g.: Consider h (K-1)=K1% 10 will get 2 as the hash value. The record with key value 342 is placed at the location 2, another record with 352 as its key value produces the same has address i.e. h(k1) = h(k2). When we try to place the record at the location where the record with key K1 is already stored, there occurs a collision. The process of finding another position is called collision resolution. There are numerous methods for collision resolution.
1) Open addressing: With open addressing we resolve the hash clash by inserting the record in the next available free or empty location in the table.
2) Chaining: Various overflow locations are kept, a pointer field is added to each record and the pointer is set to address of that overflow location.
External Hashing for Disk Files
Figure Matching bucket numbers to disk block addresses
Handling Overflow for Buckets By Chaining
Hashing for disk files is called external hashing. Disk storage is divided into buckets, each of which holds multiple records. A bucket is either one disk block or a cluster of continuous blocks.
The hashing function maps a key into a relative bucket number. A table maintained in the file header converts the bucket number into the corresponding disk block address.
Figure Handling overflow for buckets by chaining
The collision problem is less severe with buckets, because many records will fit in a same bucket. When a bucket is filled to capacity and we try to insert a new record into the same bucket, a collision is caused. However, we can maintain a pointer in each bucket to address overflow records.
The hashing scheme described is called static hashing, because a fixed number of buckets ‘M’ is allocated. This can be serious drawback for dynamic files. Suppose M be a number of buckets, m be the maximum number of records that can fit in one bucket, then at most m*M records will fit in the allocated space. If the records are fewer than m*M numbers, collisions will occur and retrieval will be slowed down.
Dynamic Hashing Technique
A major drawback of the static hashing is that address space is fixed. Hence it is difficult to expand or shrink the file dynamically.
In dynamic hashing, the access structure is built on the binary representation of the hash value. In this, the number of buckets is not fixed [as in regular hashing] but grows or diminishes as needed. The file can start with a single bucket, once that bucket is full, and a new record is inserted, the bucket overflows and is slit into two buckets. The records are distributed among the two buckets based on the value of the first [leftmost] bit of their hash values. Records whose hash values start with a 0 bit are stored in one bucket, and those whose hash values start with a 1 bit are stored in another bucket. At this point, a binary tree structure called a directory is built. The directory has two types of nodes.
1. Internal nodes: Guide the search, each has a left pointer corresponding to a 0 bit, and a right pointer corresponding to a 1 bit.
2. Leaf nodes: It holds a pointer to a bucket – a bucket address.
Each leaf node holds a bucket address. If a bucket overflows, for example: a new record is inserted into the bucket for records whose hash values start with 10 and causes overflow, then all records whose hash value starts with
100 are placed in the first split bucket, and the second bucket contains those whose hash value starts with 101. The levels of a binary tree can be expanded dynamically.
Extendible Hashing: In extendible hashing the stored file has a directory or index table or hash table associated with it. The index table consists of a header containing a value d called the global depth of the table, and a list of 2d pointers [pointers to data block]. Here d is the number of left most bits currently being used to address the index table. The left most d bits of a key, when interpreted as a number give the bucket address in which the desired records are stored.
Each bucket also has a header giving the local depth d1.
Of that bucket – specifies the number bits on which the bucket contents are based. Suppose d=3 and that the first pointer in the table [the 000 pointer] points to a bucket for which the local depth d1 is 2, the local depth 2 means that in this case the bucket contains all the records whose search keys start with 000 and 001 [because first two bits are 00].
To insert a record with search value k, if there is room in the bucket, we insert the record in the bucket. If the bucket is full, we must split the bucket and redistribute the current records plus the new one.
For ex: To illustrate the operation of insertion using account file. We assume that a bucket can hold only two records.
Bangalore |
100 |
Mysore |
200 |
Mysore |
300 |
Mangalore |
400 |
Hassan |
500 |
Hassan |
600 |
Hassan |
700 |
Sample account file
Hash function for branch name
Branch-name |
H(branch-name) |
Bangalore |
0010 1101 |
Mysore |
1010 0011 |
Mangalore |
1100 0001 |
Hassan |
1111 0001 |
Let us insert the record (Bangalore, 100). The hash table (address table) contains a pointer to the one-bucket, and the record is inserted. The second record is also placed in the same bucket (bucket size is 2).
When we attempt to insert the next records (downtown, 300), the bucket is full. We need to increase the number of bits that we use from the hash value i.e., d=1, 21=2 buckets. This increases entries in the hash address table. Now the hash table contains two entries i.e., it points to two buckets. The first bucket contains the records whose search key has a hash value that begins with 0, and the second bucket contains records whose search key has a hash value beginning with 1. Now the local depth of bucket =1.
Next we insert (Mianus, 400). Since the first bit of h (Mianus) is 1, the new record should placed into the 2nd bucket, but we find that the bucket is full. We increase the number of bits for comparison, that we use from the hash to 2(d=2). This increases the number of entries in the hash table to 4 (22 =
4). The records will be distributed among two buckets. Since the bucket that has prefix 0 was not split, hash prefix 00 and 01 both point to this bucket.
Figure Extendable Hash Structure
Next (perryridge, 500) record is inserted into the same bucket as Mianus. Next insertion of (Perryridge, 600) results in a bucket overflow, causes an increase in the number of bits (increase global depth d by 1 i,e d=3), and thus increases the hash table entries. (Now the hash table has 23 = 8 entries).
Figure Extendable Hash Structure
The records will be distributed among two buckets; the first contains all records whose hash value start with 110, and the second all those whose hash value start with 111.
Advantages of dynamic hashing:
1. The main advantage is that splitting causes minor reorganization, since only the records in one bucket are redistributed to the two new buckets.
2. The space overhead of the directory table is negligible.
3. The main advantage of extendable hashing is that performance does not degrade as the file grows. The main space saving of hashing is that no buckets need to be reserved for future growth; rather buckets can be allocated dynamically.
Disadvantages:
1. The index tables grow rapidly and too large to fit in main memory. When part of the index table is stored on secondary storage, it requires extra access.
2. The directory must be searched before accessing the bucket, resulting in two-block access instead of one in static hashing.
3. A disadvantage of extendable hashing is that it involves an additional level of indirection.
Hope you will like Series of Database Management Systems series !
If you have not yet subscribe this Blog , Please subscribe it from “follow me” tab !
So that you will be updated @ real time and all updated knowledge in your mail daily for free without any RSS subscription OR news reading !!
Happy Learning and Sharing !!
For More information related to BI World visit our all Mentalist networks Blog
SQL Server Mentalist … SQL Learning Blog
Business Intelligence Mentalist … Business Intelligence World
Microsoft Mentalist … MVC,ASP.NET, WCF & LinQ
MSBI Mentalist … MS BI and SQL Server
NMUG Bloggers …Navi Mumbai User Group Blog
Architectural Shack … Architectural implementation and design patterns
DBA Mentalist …Advance SQL Server Blog
MVC Mentalist … MVC Learning Blog
Link Mentalist … Daily Best link @ your email
Infographics Mentalist … Image worth explaining thousand Words
Hadoop Mentalist … Blog on Big Data
BI Tools Analysis … BI Tools
Connect With me on
| Facebook |Twitter | LinkedIn| Google+ | Word Press | RSS | About Me |