| title | Mastering APIs for Data Collection | ||||||
|---|---|---|---|---|---|---|---|
| sidebar_label | APIs | ||||||
| description | A deep dive into REST and GraphQL APIs: how to fetch, authenticate, and process external data for machine learning. | ||||||
| tags |
|
In the Data Engineering lifecycle, APIs are the "clean" way to collect data. Unlike web scraping, which is brittle and unstructured, APIs provide a contract-based method to access data that is versioned, documented, and usually delivered in machine-readable formats like JSON.
An API acts as a middleman between your ML pipeline and a remote server. You send a Request (a specific question) and receive a Response (the data answer).
sequenceDiagram
participant Pipeline as ML Data Pipeline
participant API as API Gateway
participant Server as Data Server
Pipeline->>API: HTTP Request (GET /data)
Note right of Pipeline: Includes Headers & API Key
API->>Server: Validate & Route
Server-->>API: Data Payload
API-->>Pipeline: HTTP Response (200 OK + JSON)
- Endpoint (URL): The address where the data lives (e.g.,
api.twitter.com/v2/tweets). - Method: What you want to do (
GETto fetch,POSTto send). - Headers: Metadata like your API Key or the format you want (
Content-Type: application/json). - Parameters: Filters for the data (e.g.,
?start_date=2023-01-01).
The most common architecture. It treats every piece of data as a "Resource."
- Best for: Standardized data fetching.
- Format: Almost exclusively JSON.
Developed by Meta, it allows the client to define the structure of the data it needs.
- Advantage in ML: If a user profile has 100 fields but you only need 3 features for your model, GraphQL prevents "Over-fetching," saving bandwidth and memory.
[Image comparing REST vs GraphQL data fetching efficiency]
Used when data needs to be delivered in real-time.
- ML Use Case: Algorithmic trading or live social media sentiment monitoring.
The requests library is the standard tool for interacting with APIs.
import requests
url = "https://api.example.com/v1/weather"
headers = {
"Authorization": "Bearer YOUR_TOKEN"
}
params = {
"city": "Mandsaur",
"country": "IN",
"units": "metric"
}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
data = response.json()
temperature = data["main"]["temp"] # Extracting temperature
humidity = data["main"]["humidity"] # Extracting humidity
print(f"Temperature in Mandsaur: {temperature}°C")
print(f"Humidity: {humidity}%")
else:
print("Failed to fetch weather data")APIs are not infinite resources. Providers implement Rate Limiting to prevent abuse.
| Status Code | Meaning | Action for ML Pipeline |
|---|---|---|
| 200 | OK | Process the data. |
| 401 | Unauthorized | Check your API Key/Token. |
| 404 | Not Found | Check your Endpoint URL. |
| 429 | Too Many Requests | Exponential Backoff: Wait and try again later. |
flowchart TD
Req[Send API Request] --> Res{Status Code?}
Res -- 200 --> Save[Ingest to Database]
Res -- 429 --> Wait[Wait/Sleep] --> Req
Res -- 401 --> Fail[Alert Developer]
style Wait fill:#fff3e0,stroke:#ef6c00,color:#333
- API Keys: A simple string passed in the header.
- OAuth 2.0: A more secure, token-based system used by Google, Meta, and Twitter.
- JWT (JSON Web Tokens): Often used in internal microservices.
-
REST API Tutorial: Understanding the principles of RESTful design.
-
Python Requests Guide: Mastering HTTP requests for data collection.
APIs give us structured data, but sometimes the "front door" is locked. When there is no API, we must use the more aggressive "side window" approach.