Beyond the Python List

Fred Agbo

2025-09-17

Announcements

Welcome back!
First Mini-project (MP1) is due today at 10pm

Introducing Python NumPy for Data Structures

What’s the Problem?

You’re a data scientist analyzing a massive dataset.
Your data contains a million numerical values.
You need to perform a simple operation on all of them,
- for example, squaring each number.
How would you do this in standard Python?

Python Lists: The Swiss Army Knife

A standard Python list is a highly flexible, general-purpose container.
It can store different data types (integers, floats, strings, even other lists) within the same structure.
It’s a “list of pointers” to objects stored in various memory locations.
This flexibility comes at a cost…

Code Example: The List Approach

Let’s see the standard way to handle this.

import time

# Create a large list of integers
my_list = list(range(10000000))

# Time the operation
start_time = time.time()

# Square each element using a list comprehension
squared_list = [x**2 for x in my_list]

list_time = time.time() - start_time
print(f"List comprehension time: {list_time:.4f} seconds")

Why is it so slow?

Python has to “de-reference” each pointer one by one.
The for loop is an interpreted operation.
Python has to perform type checking on each element.
Each calculation is done separately.

Enter NumPy: The Specialized Toolbox

NumPy (Numerical Python) is the foundational library for scientific computing in Python.
Its core object is the ndarray (n-dimensional array).
Unlike a list, a NumPy array is:
- Homogeneous: All elements must be of the same data type.
- Contiguous: Stored in one single, continuous block of memory.
- Optimized: Operations are performed by fast C/Fortran code “under the hood.”
This design makes it perfect for numerical operations.

Code Example: Creating a NumPy Arrayimport numpy as np

# Create an array from a Python list
my_np_array = np.array([1, 2, 3, 4, 5])
print(f"Array: {my_np_array}")
print(f"Type: {type(my_np_array)}")
print(f"Data type: {my_np_array.dtype}") # Notice the single data type

# NumPy forces homogeneity - it will upcast types
hetero_list = [1, 2, 3.5, 4]
hetero_array = np.array(hetero_list)
print(f"Upcast array: {hetero_array}")
print(f"Data type: {hetero_array.dtype}")

The Power of Vectorization

Vectorization is the ability to perform operations on entire arrays at once, without explicit Python loops.
NumPy functions and operators are “vectorized” by default.
This is where the massive speed gains come from.

Code Example: NumPy Performance Testimport numpy as np
import time

# Create a large NumPy array
large_np_array = np.arange(10000000)

# Time the NumPy operation
start_time = time.time()

# Perform the vectorized operation
squared_np_array = large_np_array**2

np_time = time.time() - start_time
print(f"NumPy vectorization time: {np_time:.4f} seconds")

# Compare with our previous result
# List time was ~ 2.0 seconds
print("NumPy is orders of magnitude faster!")

Benefits of NumPy

Performance: Vectorized operations are implemented in highly optimized C and Fortran code, making them significantly faster than Python loops.
Memory Efficiency: Storing homogeneous data in contiguous memory blocks uses far less memory than a Python list. This is crucial for large datasets.
Rich Functionality: NumPy includes a vast library of built-in mathematical, statistical, and linear algebra functions.

Code Example: Built-in Functionsimport numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print(f"Original Array: {data}")
print(f"Sum of elements: {np.sum(data)}")
print(f"Mean of elements: {np.mean(data)}")
print(f"Standard deviation: {np.std(data)}")
print(f"Maximum value: {np.max(data)}")

# Or for a 2D array (matrix)
matrix = np.array([[1, 2], [3, 4]])
print(f"Sum of each column: {np.sum(matrix, axis=0)}")

Pitfalls and Limitations

Homogeneity: While a benefit for performance, it means NumPy is not suitable for data that must contain mixed types. For that, use a Python list or a Pandas DataFrame.
Fixed Size: Resizing a NumPy array is a costly operation. If your data size changes frequently, a Python list’s append() and pop() methods might be more efficient for those specific tasks.
Overhead for Small Data: For small arrays (e.g., a few hundred elements), the overhead of creating a NumPy array might not be worth the speed gain.
Requires Installation: It is a third-party library and must be installed separately.

Summary: When to Use What

Feature	Python List	NumPy Array
Data Type	Heterogeneous (mixed)	Homogeneous (uniform)
Size	Dynamic (can grow/shrink)	Fixed (expensive to resize)
Memory	High (stores pointers)	Low (stores raw data)
Speed	Slower (interpreted loop)	Faster (vectorized, compiled code)
Best For	General-purpose data, collections of mixed types	Large-scale numerical computation, linear algebra

In-Class Challenge

Let’s put your new knowledge to the test.

Write a short Python program that:

Imports the NumPy library.
Creates a 2D NumPy array (a 3x4 matrix) of all zeros. Hint: np.zeros().
Calculates the average of the elements in the array and prints it.
Calculates the sum of the elements along each column and prints it.

Give students a few minutes to work on this independently or in pairs.
Walk around and provide hints as needed.
Solutions can be presented on the board afterward.
Solution to Challenge

import numpy as np

# 1. Create a 3x4 matrix of zeros
my_matrix = np.zeros((3, 4))
print(f"Matrix:\n{my_matrix}")

# 2. Calculate the average of all elements
matrix_mean = np.mean(my_matrix)
print(f"Average of all elements: {matrix_mean}")

# 3. Calculate the sum of elements along each column (axis=0)
column_sums = np.sum(my_matrix, axis=0)
print(f"Sum of each column: {column_sums}")