| title | NumPy: Numerical Python | ||||||
|---|---|---|---|---|---|---|---|
| sidebar_label | NumPy | ||||||
| description | Mastering N-dimensional arrays, vectorization, and broadcasting: the foundational tools for numerical computing in ML. | ||||||
| tags |
|
If Python is the skeleton of Machine Learning, NumPy is the muscle. It is a library for scientific computing that introduces the ndarray (N-dimensional array), which is significantly faster and more memory-efficient than standard Python lists.
Python lists are flexible but slow because they store pointers to objects scattered in memory. NumPy arrays store data in contiguous memory blocks, allowing the CPU to process them using SIMD (Single Instruction, Multiple Data).
graph LR
List[Python List] --> L_Ptr["Scattered Memory (Slow)"]
Array[NumPy Array] --> A_Cont["Contiguous Block (Fast)"]
style Array fill:#e1f5fe,stroke:#01579b,color:#333
In ML, we describe data by its Rank (number of dimensions) and Shape.
- Scalar (Rank 0): A single number.
- Vector (Rank 1): A line of numbers (e.g., a single sample's features).
- Matrix (Rank 2): A table of numbers (e.g., a whole dataset).
- Tensor (Rank 3+): Higher dimensional arrays (e.g., a batch of color images).
import numpy as np
# Creating a 2D Matrix
data = np.array([[1, 2, 3], [4, 5, 6]])
print(data.shape) # Output: (2, 3) -> 2 rows, 3 columnsVectorization is the practice of replacing explicit for loops with array expressions. This is how we achieve high performance in Python.
Instead of this:
# Slow: Element-wise addition with a loop
result = []
for i in range(len(a)):
result.append(a[i] + b[i])Do this:
# Fast: NumPy handles the loop in C
result = a + bBroadcasting allows NumPy to perform arithmetic operations on arrays with different shapes, provided they meet certain compatibility rules.
graph TD
A["Matrix: (3, 3)"]
B["Scalar: (1,)"]
A -->|Add| B
B -->|Broadcast| B_Stretch["Stretched to (3, 3)"]
B_Stretch --> Result["Element-wise Sum"]
Example: Adding a constant bias to every row in a dataset.
features = np.array([[10, 20], [30, 40]]) # Shape (2, 2)
bias = np.array([5, 5]) # Shape (2,)
result = features + bias # [[15, 25], [35, 45]]| Operation | NumPy Function | ML Use Case |
|---|---|---|
| Dot Product | np.dot(a, b) |
Calculating weighted sums in a neuron. |
| Reshaping | arr.reshape(1, -1) |
Changing an image from 2D to a 1D feature vector. |
| Transposing | arr.T |
Aligning dimensions for matrix multiplication. |
| Aggregations | np.mean(), np.std() |
Normalizing data (Standard Scaling). |
| Slicing | arr[:, 0] |
Extracting a single column (feature) from a dataset. |
NumPy allows for "Boolean Indexing," which is incredibly powerful for filtering data.
# Select all values in the array greater than 0.5
weights = np.array([0.1, 0.8, -0.2, 0.9])
positive_weights = weights[weights > 0]
# Result: [0.1, 0.8, 0.9]While NumPy handles the raw numbers, we need a way to manage data with column names, different data types, and missing values. For that, we turn to the most popular data manipulation library.