Programming #3: Maps and Hashing

So far, we have talked about data structures that use numerical indices to find values in an array. What if we want more freedom in how we look up a value. Consider the Oxford dictionary, if we wanted to find the definition (value) for a certain word (key), we would just look up the word.

Dictionaries are structured as maps. The main idea behind a map is that we can look something up by its key and find whatever is stored by its value. This key-value structure lends itself to many uses that an index based array could not. In Python, the dictionary data-structure is a map.

Maps as Sets

Before we dive deeper into maps, we need to talk about sets. Sets are like the list-based data structures we talked about in previous posts, with two additional rules: 1) order doesn’t matter and 2) we can’t repeat elements.

Since keys in a map can’t be repeated, it makes sense to build maps as sets. For example, think of a student ID system as a set based map. No two students can have the same ID and the order doesn’t really matter. Student 199 isn’t 99 students better than student 100, or anything like that.

Let’s do a quick example. Consider the following code that builds a dictionary in Python which stores cities by country and continent. The first key will be continent, which will store another dictionary with country and city values:

```locations = {'North America': {'USA': ['Mountain View']}}
locations['North America']['USA'].append('Atlanta')
locations['Asia'] = {'India': ['Bangalore']}
locations['Asia']['China'] = ['Shanghai']
locations['Africa'] = {'Egypt': ['Cairo']}
```

Now let’s print all of the cities in the USA sorted alphabetically:

```usa_sorted = sorted(locations['North America']['USA'])
for city in usa_sorted:
print city

# Atlanta
# Mountain View
```

Now let’s print all the cities in Asia next to the name of their country:

```asia_cities = []
for countries, cities in locations['Asia'].iteritems():
city_country = cities[0] + " - " + countries
asia_cities.append(city_country)
asia_sorted = sorted(asia_cities)
for city in asia_sorted:
print city

# Bangalore - India
# Shanghai - China
```

Hashing

With everything we’ve looked at so far (list, stacks, queues, sets), searching for an element has required linear time in the worst cases. In other words, we need to look through every element to find the one we are looking for. Adding a hash function to a data-structure allows us to speed this up and do look-ups in constant time.

A hash function takes in a value, runs it through the function, and returns a hash value. The hash value is the index in an array where the original value can be stored.

For example, imagine we have student IDs with 6 digits. We want to quickly see if an ID (012345) exists in our system. Searching through all 6 digits IDs would take a while, especially if there is no order as is the case with sets. Instead, we can speed this up with a hash function. A common hash function is to take the last 2 digits (45), divide them by a constant number (10), and return the remainder (45/10 = remainder 5). So we can just look up index 5 in our hash table and see if the value stored there is 012345.

What happens if the hash function spits out the same hash value for different inputs? This is called a collision and there are two main ways to deal with it. First, we can change the hash function so that it gives unique results for all the inputs involved. Second, we can store multiple inputs in a single hash value. For example, hash value 5 could point to 012345 and 012355. These buckets could then be searched to see if our value exists.

The first approach is faster but takes more space. The second is slower but uses less space. There is no right way to implement a hash function.

When dealing with hash tables, we can define a load factor which gives us a sense of how full a hash table is. For example, if we’re trying to store 100 values in a hash table that has 1000 buckets, our load factor would be 10%.

Load Factor = Number of Entries / Number of Buckets

In this case, we would be using more memory than we need since 90% of the buckets are empty. This means we should reduce the amount of buckets in our hash function. If the load factor was closer to or past 100%, we would want to add buckets. The goal is to efficiently use our space.

Hash Maps

Now that we have a sense of the concept of hashing and the concept of maps, we can combine them. Since each key in a map is unique, we can use a hash function to give them their own unique hash values. Each key-value pair in the map can then be stored in the bucket at its hash value.

Why do this? Say we want to find the definition of a word in our dictionary ‘apple’. If we have 100000 words, we would have to look through each one until we find ‘apple’. Remember, sets don’t have an order so we can’t really speed this up.

But with a hash map, we could convert ‘apple’ to a number and use that number as the input to a hash function. When the hash value is returned, we just have to do a constant time lookup to find the value stored at the hash value. This takes more space but the speed gains are often worth it.

Let’s write our own hash map that takes string keys as input. The hash function will be:

Hash Value = (ASCII Value of First Letter * 100) + ASCII Value of Second Letter

The Python function ord() returns an ASCII value of a letter, and chr() returns the letter of the ASCII value:

```class HashTable(object):
def __init__(self):
self.table = [None]*10000

def store(self, string):
"""
Input a string that's stored in
the table.
"""
hv = self.calculate_hash_value(string)
if hv != -1:
if self.table[hv] != None:
self.table[hv].append(string)
else:
self.table[hv] = [string]

def lookup(self, string):
"""
Return the hash value if the
string is already in the table.
Return -1 otherwise.
"""
hv = self.calculate_hash_value(string)
if hv != -1:
if self.table[hv] != None:
if string in self.table[hv]:
return hv
return -1

def calculate_hash_value(self, string):
"""
Helper function to calulate a
hash value from a string.
"""
value = ord(string[0])*100 + ord(string[1])
return value
```