Introduction to Hash Tables

Fred Agbo

2026-04-01

Announcements

Welcome back!
If you have not done so, please PM me your PS4 code file arrayqueue.py on Discord.
Short homework: PS5 is due today at 10 pm.
- Submission is by uploading a PDF file on Canvas.
Today’s Class:
- Introducing Hash Tables
- Sam (TA) will co-teach this topic

Hash Tables: A Fast Data Structure

Why Hash Tables?

Speed Comparison:

Data Structure	Search Time
Array	O(N)
Binary Tree	O(log N)
Hash Table	O(1)

Hash tables provide constant-time insertion and searching!

Real-World Hash Table Applications

Common Use Cases

Dictionaries: English-language dictionary with 50,000 words
Compiler Symbol Tables: Storing variable and function names
Database Indexing: Bank account numbers, customer IDs
Caching: Web browsers, DNS lookups
Password Storage: Secure password verification

The Challenge: Storing a Dictionary

The Dictionary Problem

Challenge: Store 50,000 English words in memory with fast access.

Want each word in its own array cell
Need fast lookup: Word → Definition
Key question: How do we convert “ambiguous” to an array index?

Converting Words to Numbers

Simple Character Encoding

Create a custom code for lowercase letters:

Character	Code	Character	Code
(space)	0	n	14
a	1	o	15
b	2	…	…
c	3	z	26

Total: 27 characters (26 letters + space)

Approach 1: Adding the Digits

Convert each letter to its code and sum them.

Example: “elf”

e = 5
l = 12
f = 6
---------
Sum = 23

Store “elf” at array index 23.

Adding Digits: The Problem

For 10-letter words: - Minimum: “a” → 0+0+0+0+0+0+0+0+0+1 = 1 - Maximum: “zzzzzzzzzz” → 26×10 = 260

Range: 1 to 260

Problem: Only 260 possible indices for 50,000 words! - Average: ~192 words per cell - Too many collisions!

Adding Digits: Collision Example

All these words hash to 23:

acne, ago, aim, baked, cable, elf, hack, ...

Any anagram has the same sum
Hundreds of words can map to the same index
Not discriminating enough!

Approach 2: Multiplying by Powers

Make each position contribute uniquely to the final number.

Like decimal numbers:

7,546 = 7×10³ + 5×10² + 4×10¹ + 6×10⁰

For words (base 27):

"elf" = e×27² + l×27¹ + f×27⁰
      = 5×729 + 12×27 + 6×1
      = 3,645 + 324 + 6
      = 3,975

Multiplying by Powers: Uniqueness

Guarantee: Every word gets a unique number!

Example: “zzzzzzzzzz” (10 z’s)

26×27⁹ + 26×27⁸ + ... + 26×27⁰

Just 27⁹ alone = 7,625,597,484,987

Problem: Array can’t have 7+ trillion cells! - Most cells would be empty (for non-words) - Huge waste of memory

The Two Extremes

Method	Range	Problem
Adding digits	1-260	Too small - massive collisions
Powers of 27	1-7 trillion	Too large - wasted memory

We need something in between!

The Hash Function Solution

Hashing: Compressing the Range

Use the modulo operator (%) to squeeze large numbers into a smaller range:

arrayIndex = hugeNumber % arraySize

Example: Range 0-199 → Range 0-9

13 % 10 = 3
157 % 10 = 7

The modulo operation gives us the remainder, effectively “wrapping” large numbers into a smaller range.

Complete Hash Function for Words

def encode_letter(letter):
    """Encode letters a-z as 1-26, space as 0"""
    letter = letter.lower()
    if 'a' <= letter <= 'z':
        return ord(letter) - ord('a') + 1
    return 0

def unique_encode_word(word):
    """Encode word uniquely using powers of 27"""
    return sum(encode_letter(word[i]) * 27 ** (len(word) - 1 - i)
               for i in range(len(word)))

# Hash to array index
arraySize = 100000  # 2× the 50,000 words
arrayIndex = unique_encode_word(word) % arraySize

Choosing Array Size

Rule of thumb: Array should be 2× the number of items

For 50,000 words:
- Array size = 100,000
- Load factor = 0.5 (50% full)
- Balances space vs. collision frequency

Trade-off:

Larger array → fewer collisions, more memory
Smaller array → more collisions, less memory

Hash Function Terminology

Hash Function: Converts data → number in large range
Hash Table: Array that stores the hashed data
Hash Address: The array index for a specific key
Collision: When two keys hash to same index

Understanding Collisions

What Are Collisions?

Collision: Two different keys hash to the same array index

Example: Three words hash to index 24,122:

Word	Unique Code	Hash (mod 100,000)
bring	1,424,122	24,122
abductor	11,303,824,122	24,122
missable	139,754,124,122	24,122

Collisions are unavoidable when compressing a large range into a smaller one!

The Birthday Paradox

Question: How many people needed before two likely share a birthday?

Intuition says: ~183 people (half of 365 days)
Reality: Only 23 people for >50% chance!
This explains why collisions are more common than expected

Birthday Paradox: The Math

Think about pairs, not just people:

People	Pairs	Collision Probability
1	0	0%
2	1	0.27%
3	3	0.82%
10	45	11.7%
23	253	50.7%
30	435	70.6%
50	1,225	97.0%

The Pairs Formula

Number of comparisons (pairs) for n people:

\[\text{Pairs} = \binom{n}{2} = \frac{n(n-1)}{2}\]

Why this matters: - Each pair is a chance for collision - Grows quadratically (O(n²)) - 23 people → 253 comparisons in 365 days

Birthday Paradox → Hash Tables

The connection:

Birthday Analogy	Hash Table
365 days	Array cells
People	Items inserted
Shared birthday	Collision

Key Insight: Even at 10% capacity, collisions are highly likely! - 23 items in 366 cells = 6.3% full → 50% collision chance - With 50,000 words in 100,000 cells → collisions are inevitable

Collision Resolution Strategies

Two Main Approaches

When a collision occurs, we must have a strategy:

Open Addressing
Find another empty cell in the array
Separate Chaining
Store multiple items at the same index using linked lists

Key Takeaways

What We Learned

Hash Tables provide O(1) operations by:

Using hash functions to map keys → array indices
Compressing large ranges with modulo operator
Handling inevitable collisions with resolution strategies

Collision Resolution:

Open addressing: Linear probing, quadratic probing, double hashing
Separate chaining: Linked lists at each index

The Birthday Paradox Lesson

Collisions are MORE common than intuition suggests!

Formula: \(\frac{n(n-1)}{2}\) pairs for n items
23 people in 365 days → 50% collision chance
Grows quadratically, not linearly

Design implication: Always plan for collisions from the start

Practical Design Guidelines

Array size: ≈ 2× expected items (load factor = 0.5)
Hash function: Good distribution is critical
Collision strategy: Choose based on access patterns
Monitor: Load factor and performance metrics
Trade-offs: Speed vs. memory, simplicity vs. efficiency

Next Steps

Implement hash table with linear probing
Analyze performance with different load factors
Compare open addressing vs. separate chaining
Explore advanced hash functions
Practice with real-world applications