Beyond the Python List

Fred Agbo

2026-02-04

Announcements

  • Welcome back!
  • First Mini-project (MP1) is due today at 10pm
    • Special Research Opportunity

Introducing Python NumPy for Data Structures

What’s the Problem?

  • You’re a data scientist analyzing a massive dataset.
  • Your data contains a million numerical values.
  • You need to perform a simple operation on all of them,
    • for example, squaring each number.
  • How would you do this in standard Python?

Python Lists: The Swiss Army Knife

  • A standard Python list is a highly flexible, general-purpose container.
  • It can store different data types (integers, floats, strings, even other lists) within the same structure.
  • It’s a “list of pointers” to objects stored in various memory locations.
  • This flexibility comes at a cost…

Code Example: The List Approach

Let’s see the standard way to handle this.

import time

# Create a large list of integers
my_list = list(range(10000000))

# Time the operation
start_time = time.time()

# Square each element using a list comprehension
squared_list = [x**2 for x in my_list]

list_time = time.time() - start_time
print(f"List comprehension time: {list_time:.4f} seconds")

Why is it so slow?

  • Python has to “de-reference” each pointer one by one.
  • The for loop is an interpreted operation.
  • Python has to perform type checking on each element.
  • Each calculation is done separately.

Enter NumPy: The Specialized Toolbox

  • NumPy (Numerical Python) is the foundational library for scientific computing in Python.
  • Its core object is the ndarray (n-dimensional array).
  • Unlike a list, a NumPy array is:
    • Homogeneous: All elements must be of the same data type.
    • Contiguous: Stored in one single, continuous block of memory.
    • Optimized: Operations are performed by fast C/Fortran code “under the hood.”
  • This design makes it perfect for numerical operations.

Code Example: Creating a NumPy Array

import numpy as np

# Create an array from a Python list
my_np_array = np.array([1, 2, 3, 4, 5])
print(f"Array: {my_np_array}")
print(f"Type: {type(my_np_array)}")
print(f"Data type: {my_np_array.dtype}") # Notice the single data type

# NumPy forces homogeneity - it will upcast types
hetero_list = [1, 2, 3.5, 4]
hetero_array = np.array(hetero_list)
print(f"Upcast array: {hetero_array}")
print(f"Data type: {hetero_array.dtype}")

The Power of Vectorization

  • Vectorization is the ability to perform operations on entire arrays at once, without explicit Python loops.
  • NumPy functions and operators are “vectorized” by default.
  • This is where the massive speed gains come from.

Code Example: NumPy Performance Test

import numpy as np
import time

# Create a large NumPy array
large_np_array = np.arange(10000000)

# Time the NumPy operation
start_time = time.time()

# Perform the vectorized operation
squared_np_array = large_np_array**2

np_time = time.time() - start_time
print(f"NumPy vectorization time: {np_time:.4f} seconds")

# Compare with our previous result
# List time was ~ 2.0 seconds
print("NumPy is orders of magnitude faster!")

Benefits of NumPy

  • Performance: Vectorized operations are implemented in highly optimized C and Fortran code, making them significantly faster than Python loops.
  • Memory Efficiency: Storing homogeneous data in contiguous memory blocks uses far less memory than a Python list. This is crucial for large datasets.
  • Rich Functionality: NumPy includes a vast library of built-in mathematical, statistical, and linear algebra functions.

Code Example: Built-in Functions

import numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print(f"Original Array: {data}")
print(f"Sum of elements: {np.sum(data)}")
print(f"Mean of elements: {np.mean(data)}")
print(f"Standard deviation: {np.std(data)}")
print(f"Maximum value: {np.max(data)}")

# Or for a 2D array (matrix)
matrix = np.array([[1, 2], [3, 4]])
print(f"Sum of each column: {np.sum(matrix, axis=0)}")

Pitfalls and Limitations

  • Homogeneity: While a benefit for performance, it means NumPy is not suitable for data that must contain mixed types. For that, use a Python list or a Pandas DataFrame.
  • Fixed Size: Resizing a NumPy array is a costly operation. If your data size changes frequently, a Python list’s append() and pop() methods might be more efficient for those specific tasks.
  • Overhead for Small Data: For small arrays (e.g., a few hundred elements), the overhead of creating a NumPy array might not be worth the speed gain.
  • Requires Installation: It is a third-party library and must be installed separately.

Summary: When to Use What

Feature Python List NumPy Array
Data Type Heterogeneous (mixed) Homogeneous (uniform)
Size Dynamic (can grow/shrink) Fixed (expensive to resize)
Memory High (stores pointers) Low (stores raw data)
Speed Slower (interpreted loop) Faster (vectorized, compiled code)
Best For General-purpose data, collections of mixed types Large-scale numerical computation, linear algebra

In-Class Challenge

Let’s put your new knowledge to the test.

Write a short Python program that:

  1. Imports the NumPy library.
  2. Creates a 2D NumPy array (a 3x4 matrix) of all zeros. Hint: np.zeros().
  • Remember to add some values to the array!.
  1. Calculates the average of the elements in the array and prints it.
  2. Calculates the sum of the elements along each column and prints it.