C++ Tutorial: Intro to Hash Tables

June 21st, 2014

In this tutorial, we will be taking an introductory look at hash tables in C++. Hash tables are useful for storing large amounts of data in an efficient manner. They are also an essential aspect of most undergraduate computer science courses, so study up! Near the bottom of this post, you can find everything you need to create a C++ program that implements a hash table. (You can also download the source files you need from the Github link here).The example program will create a simple table for storing words.

Note: This tutorial assumes that you are familiar with arrays, pointers, modular arithmetic, algorithm efficiency, and linked lists especially. To learn more about linked lists, please see the previous tutorial on linked lists. With that in mind, let’s begin!

Contents:

  1. What is a Hash Table?
  2. What is a Hash Function?
  3. What are Hash Collisions?
  4. Designing the Hash Table
  5. Coding the Hash Table
    • LinkedList.h
    • LinkedList.cpp
    • HashTable.h
    • HashTable.cpp
    • main.cpp
  6. References

What is a Hash Table?

A hash table is a data structure for storing key-value pairs. Unlike a basic array, which uses index numbers for accessing elements, a hash table uses keys to look up table entries. This makes data management more manageable for the human user since it’s easier to catalog data entries by their attributes rather than their count in a giant list.

Hash_Tab_2

In C++, we implement a hash table as an array of linked lists. It’s sort of like a multidimensional array. In a two-dimensional array, for instance, the elements consist of rows of a fixed length. In a hash table, however, the elements (a.k.a. buckets) can expand or shrink to accommodate a virtually infinite number of table entries.

In terms of efficiency, a hash table is a compromise between an array and a linked list. It uses both indexing and list traversal to store and retrieve data elements.

Looking up elements by index makes an array very efficient. No matter where an item is stored in the array, it always takes the same amount of time to retrieve it. In technical terms, getting an item from an array is an O(1) or “constant time” operation.

Looking up elements in a linked list is a lot less efficient. You can’t just directly access any node in the list. Instead, you have to traverse down the list until you find the target item. If the item you are looking for happens to be at the front of the list, the retrieval is an O(1) operation since you only traversed down one node. If the item is at the end of the list, retrieving it would be an O(n) operation, where n is the total number of nodes in the list.

To summarize, as the number of elements increases in an array, the runtime to access an element by its index remains constant. With a linked list, the time it takes to access a particular element increases linearly with the number of elements.

What is a Hash Function?

A hash function decides where to store and retrieve items in a hash table. It takes in an item key as its parameter and returns an index location for that particular item. Generally, a hash function uses modulo arithmetic. In particular, a key value is divided by the table length to generate an index number in the table. This index number refers to a location, or bucket, in the hash table.

HashFunction_7

Here’s an example: Suppose the hash function takes in a string as its parameter, and then it adds up the ASCII values of all of the characters in that string to get an integer. If the key is “pumpkin,” then the sum of the ASCII values would be 772. After that, the hash function takes the modulus of this number by the table length to get an index number. If the table length is 13, then 772 modulo 13 is 5. So the item with the key “pumpkin,” would go into bucket # 5 in the hash table.

int hash( string key )
{
    int value = 0;
    for ( int i = 0; i < key.length(); i++ )
        value += key[i];
    return value % tableLength;
}

What are Hash Collisions?

Occasionally, different words will generate the same hash function result. For example, the keys “tar” and “rat” will both hash to the index 2 if there are 13 buckets total in the hash table. When two keys hash to the same index, we get a hash collision.

To maximize the efficiency of storing and retrieving items in the hash table, we need to minimize the number of collisions. Each collision is another item in the bucket; another node in the linked list. We know that storing and retrieving items in a linked list grows more inefficient as the list grows longer. So it makes sense to keep these linked lists as short as possible.

There are several ways to write a good hash function. Making the table size a prime number is a good way to minimize collisions. In general, a good hash function fills out a hash table fairly uniformly. The printHistogram method of the HashTable class in the code example below lets you visualize the hash table. Ideally, the distribution of items should be fairly balanced.

Hash Table Contains 25 Items total
1:	 X X X X
2:	 X
3:	 X
4:	 X
5:	 X
6:	 X
7:	 X X X
8:	 X X X
9:	 X X X
10:	 X
11:	 X X X
12:	
13:	 X X X

Designing the Hash Table

Before we begin coding, it’s helpful to design the classes we will need for the program and how they will all work together. In the image below, we have UML (Universal Modeling Language) diagrams for the objects we will need in the program. First we start with a simple struct for representing items, called Item. From there, we can build a LinkedList from Items. Finally, we build a HashTable from a dynamic array of LinkedList objects. UML_HashTable

The Code

Copy and paste this code into your project:

**(June 1st 2015 Update: Thank you, David, for detecting my errors and helping me improve the code!)***

References:

  1. Carrano, Frank. Timothy, Henry. Data Abstraction & Problem Solving with C : Walls and Mirrors. Boston: Pearson, 2013. Print.
  2. “Hash Functions.” SparkNotes. 2014. Web. http://www.sparknotes.com/cs/searching/hashtables/section2.rhtml
  3. Suh, Eric. “Hash Tables.” Cprogramming.com. 2011. Web. http://www.cprogramming.com/tutorial/computersciencetheory/hash-table.html
  4. Walker, Julienne. “The Art of Hashing.” Eternally Confused. 2014. Web. http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

Categories: mathtechnologytutorials