Beyond the Python List

Fred Agbo

2025-09-17

Announcements

  • Welcome back!
  • First Mini-project (MP1) is due today at 10pm

Introducing Python NumPy for Data Structures

What’s the Problem?

  • You’re a data scientist analyzing a massive dataset.
  • Your data contains a million numerical values.
  • You need to perform a simple operation on all of them,
    • for example, squaring each number.
  • How would you do this in standard Python?

Python Lists: The Swiss Army Knife

  • A standard Python list is a highly flexible, general-purpose container.
  • It can store different data types (integers, floats, strings, even other lists) within the same structure.
  • It’s a “list of pointers” to objects stored in various memory locations.
  • This flexibility comes at a cost…

Code Example: The List Approach

Let’s see the standard way to handle this.

import time

# Create a large list of integers
my_list = list(range(10000000))

# Time the operation
start_time = time.time()

# Square each element using a list comprehension
squared_list = [x**2 for x in my_list]

list_time = time.time() - start_time
print(f"List comprehension time: {list_time:.4f} seconds")

Why is it so slow?

  • Python has to “de-reference” each pointer one by one.
  • The for loop is an interpreted operation.
  • Python has to perform type checking on each element.
  • Each calculation is done separately.

Enter NumPy: The Specialized Toolbox

  • NumPy (Numerical Python) is the foundational library for scientific computing in Python.
  • Its core object is the ndarray (n-dimensional array).
  • Unlike a list, a NumPy array is:
    • Homogeneous: All elements must be of the same data type.
    • Contiguous: Stored in one single, continuous block of memory.
    • Optimized: Operations are performed by fast C/Fortran code “under the hood.”
  • This design makes it perfect for numerical operations.

Code Example: Creating a NumPy Array

import numpy as np

# Create an array from a Python list
my_np_array = np.array([1, 2, 3, 4, 5])
print(f"Array: {my_np_array}")
print(f"Type: {type(my_np_array)}")
print(f"Data type: {my_np_array.dtype}") # Notice the single data type

# NumPy forces homogeneity - it will upcast types
hetero_list = [1, 2, 3.5, 4]
hetero_array = np.array(hetero_list)
print(f"Upcast array: {hetero_array}")
print(f"Data type: {hetero_array.dtype}")

The Power of Vectorization

  • Vectorization is the ability to perform operations on entire arrays at once, without explicit Python loops.
  • NumPy functions and operators are “vectorized” by default.
  • This is where the massive speed gains come from.

Code Example: NumPy Performance Test

import numpy as np
import time

# Create a large NumPy array
large_np_array = np.arange(10000000)

# Time the NumPy operation
start_time = time.time()

# Perform the vectorized operation
squared_np_array = large_np_array**2

np_time = time.time() - start_time
print(f"NumPy vectorization time: {np_time:.4f} seconds")

# Compare with our previous result
# List time was ~ 2.0 seconds
print("NumPy is orders of magnitude faster!")

Benefits of NumPy

  • Performance: Vectorized operations are implemented in highly optimized C and Fortran code, making them significantly faster than Python loops.
  • Memory Efficiency: Storing homogeneous data in contiguous memory blocks uses far less memory than a Python list. This is crucial for large datasets.
  • Rich Functionality: NumPy includes a vast library of built-in mathematical, statistical, and linear algebra functions.

Code Example: Built-in Functions

import numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print(f"Original Array: {data}")
print(f"Sum of elements: {np.sum(data)}")
print(f"Mean of elements: {np.mean(data)}")
print(f"Standard deviation: {np.std(data)}")
print(f"Maximum value: {np.max(data)}")

# Or for a 2D array (matrix)
matrix = np.array([[1, 2], [3, 4]])
print(f"Sum of each column: {np.sum(matrix, axis=0)}")

Pitfalls and Limitations

  • Homogeneity: While a benefit for performance, it means NumPy is not suitable for data that must contain mixed types. For that, use a Python list or a Pandas DataFrame.
  • Fixed Size: Resizing a NumPy array is a costly operation. If your data size changes frequently, a Python list’s append() and pop() methods might be more efficient for those specific tasks.
  • Overhead for Small Data: For small arrays (e.g., a few hundred elements), the overhead of creating a NumPy array might not be worth the speed gain.
  • Requires Installation: It is a third-party library and must be installed separately.

Summary: When to Use What

Feature Python List NumPy Array
Data Type Heterogeneous (mixed) Homogeneous (uniform)
Size Dynamic (can grow/shrink) Fixed (expensive to resize)
Memory High (stores pointers) Low (stores raw data)
Speed Slower (interpreted loop) Faster (vectorized, compiled code)
Best For General-purpose data, collections of mixed types Large-scale numerical computation, linear algebra

In-Class Challenge

Let’s put your new knowledge to the test.

Write a short Python program that:

  1. Imports the NumPy library.
  2. Creates a 2D NumPy array (a 3x4 matrix) of all zeros. Hint: np.zeros().
  3. Calculates the average of the elements in the array and prints it.
  4. Calculates the sum of the elements along each column and prints it.