Introduction to Hash Tables

Fred Agbo

2026-04-01

Announcements

  • Welcome back!
  • If you have not done so, please PM me your PS4 code file arrayqueue.py on Discord.
  • Short homework: PS5 is due today at 10 pm.
    • Submission is by uploading a PDF file on Canvas.
  • Today’s Class:
    • Introducing Hash Tables
    • Sam (TA) will co-teach this topic

Hash Tables: A Fast Data Structure

Why Hash Tables?

Speed Comparison:

Data Structure Search Time
Array O(N)
Binary Tree O(log N)
Hash Table O(1)

Hash tables provide constant-time insertion and searching!

Real-World Hash Table Applications

Common Use Cases

  • Dictionaries: English-language dictionary with 50,000 words
  • Compiler Symbol Tables: Storing variable and function names
  • Database Indexing: Bank account numbers, customer IDs
  • Caching: Web browsers, DNS lookups
  • Password Storage: Secure password verification

The Challenge: Storing a Dictionary

The Dictionary Problem

Challenge: Store 50,000 English words in memory with fast access.

  • Want each word in its own array cell
  • Need fast lookup: Word → Definition
  • Key question: How do we convert “ambiguous” to an array index?

Converting Words to Numbers

Simple Character Encoding

Create a custom code for lowercase letters:

Character Code Character Code
(space) 0 n 14
a 1 o 15
b 2
c 3 z 26

Total: 27 characters (26 letters + space)

Approach 1: Adding the Digits

Convert each letter to its code and sum them.

Example: “elf”

e = 5
l = 12
f = 6
---------
Sum = 23

Store “elf” at array index 23.

Adding Digits: The Problem

For 10-letter words: - Minimum: “a” → 0+0+0+0+0+0+0+0+0+1 = 1 - Maximum: “zzzzzzzzzz” → 26×10 = 260

Range: 1 to 260

Problem: Only 260 possible indices for 50,000 words! - Average: ~192 words per cell - Too many collisions!

Adding Digits: Collision Example

All these words hash to 23:

acne, ago, aim, baked, cable, elf, hack, ...
  • Any anagram has the same sum
  • Hundreds of words can map to the same index
  • Not discriminating enough!

Approach 2: Multiplying by Powers

Make each position contribute uniquely to the final number.

Like decimal numbers:

7,546 = 7×10³ + 5×10² + 4×10¹ + 6×10⁰

For words (base 27):

"elf" = e×27² + l×27¹ + f×27⁰
      = 5×729 + 12×27 + 6×1
      = 3,645 + 324 + 6
      = 3,975

Multiplying by Powers: Uniqueness

Guarantee: Every word gets a unique number!

Example: “zzzzzzzzzz” (10 z’s)

26×27⁹ + 26×27⁸ + ... + 26×27⁰

Just 27⁹ alone = 7,625,597,484,987

Problem: Array can’t have 7+ trillion cells! - Most cells would be empty (for non-words) - Huge waste of memory

The Two Extremes

Method Range Problem
Adding digits 1-260 Too small - massive collisions
Powers of 27 1-7 trillion Too large - wasted memory

We need something in between!

The Hash Function Solution

Hashing: Compressing the Range

Use the modulo operator (%) to squeeze large numbers into a smaller range:

arrayIndex = hugeNumber % arraySize

Example: Range 0-199 → Range 0-9

13 % 10 = 3
157 % 10 = 7

The modulo operation gives us the remainder, effectively “wrapping” large numbers into a smaller range.

Complete Hash Function for Words

def encode_letter(letter):
    """Encode letters a-z as 1-26, space as 0"""
    letter = letter.lower()
    if 'a' <= letter <= 'z':
        return ord(letter) - ord('a') + 1
    return 0

def unique_encode_word(word):
    """Encode word uniquely using powers of 27"""
    return sum(encode_letter(word[i]) * 27 ** (len(word) - 1 - i)
               for i in range(len(word)))

# Hash to array index
arraySize = 100000  # 2× the 50,000 words
arrayIndex = unique_encode_word(word) % arraySize

Choosing Array Size

Rule of thumb: Array should be 2× the number of items

  • For 50,000 words:
    • Array size = 100,000
    • Load factor = 0.5 (50% full)
    • Balances space vs. collision frequency

Trade-off:

  • Larger array → fewer collisions, more memory
  • Smaller array → more collisions, less memory

Hash Function Terminology

  • Hash Function: Converts data → number in large range
  • Hash Table: Array that stores the hashed data
  • Hash Address: The array index for a specific key
  • Collision: When two keys hash to same index

Understanding Collisions

What Are Collisions?

Collision: Two different keys hash to the same array index

Example: Three words hash to index 24,122:

Word Unique Code Hash (mod 100,000)
bring 1,424,122 24,122
abductor 11,303,824,122 24,122
missable 139,754,124,122 24,122

Collisions are unavoidable when compressing a large range into a smaller one!

The Birthday Paradox

Question: How many people needed before two likely share a birthday?

  • Intuition says: ~183 people (half of 365 days)
  • Reality: Only 23 people for >50% chance!
  • This explains why collisions are more common than expected

Birthday Paradox: The Math

Think about pairs, not just people:

People Pairs Collision Probability
1 0 0%
2 1 0.27%
3 3 0.82%
10 45 11.7%
23 253 50.7%
30 435 70.6%
50 1,225 97.0%

The Pairs Formula

Number of comparisons (pairs) for n people:

\[\text{Pairs} = \binom{n}{2} = \frac{n(n-1)}{2}\]

Why this matters: - Each pair is a chance for collision - Grows quadratically (O(n²)) - 23 people → 253 comparisons in 365 days

Birthday Paradox → Hash Tables

The connection:

Birthday Analogy Hash Table
365 days Array cells
People Items inserted
Shared birthday Collision

Key Insight: Even at 10% capacity, collisions are highly likely! - 23 items in 366 cells = 6.3% full → 50% collision chance - With 50,000 words in 100,000 cells → collisions are inevitable

Collision Resolution Strategies

Two Main Approaches

When a collision occurs, we must have a strategy:

  1. Open Addressing
    Find another empty cell in the array

  2. Separate Chaining
    Store multiple items at the same index using linked lists

Key Takeaways

What We Learned

Hash Tables provide O(1) operations by:

  1. Using hash functions to map keys → array indices

  2. Compressing large ranges with modulo operator

  3. Handling inevitable collisions with resolution strategies

Collision Resolution:

  • Open addressing: Linear probing, quadratic probing, double hashing

  • Separate chaining: Linked lists at each index

The Birthday Paradox Lesson

Collisions are MORE common than intuition suggests!

  • Formula: \(\frac{n(n-1)}{2}\) pairs for n items
  • 23 people in 365 days → 50% collision chance
  • Grows quadratically, not linearly

Design implication: Always plan for collisions from the start

Practical Design Guidelines

  • Array size: ≈ 2× expected items (load factor = 0.5)
  • Hash function: Good distribution is critical
  • Collision strategy: Choose based on access patterns
  • Monitor: Load factor and performance metrics
  • Trade-offs: Speed vs. memory, simplicity vs. efficiency

Next Steps

  • Implement hash table with linear probing
  • Analyze performance with different load factors
  • Compare open addressing vs. separate chaining
  • Explore advanced hash functions
  • Practice with real-world applications