| title | JSON: The Semi-Structured Standard | ||||||
|---|---|---|---|---|---|---|---|
| sidebar_label | JSON | ||||||
| description | Mastering JSON for Machine Learning: handling nested data, converting dictionaries, and efficient parsing for NLP pipelines. | ||||||
| tags |
|
JSON (JavaScript Object Notation) is a lightweight, text-based format for storing and transporting data. While CSVs are perfect for simple tables, JSON excels at representing hierarchical or nested data—where one observation might contain lists or other sub-observations.
JSON structure is almost identical to a Python dictionary. It uses key-value pairs and supports several data types:
- Objects: Enclosed in
{}(Maps to Pythondict). - Arrays: Enclosed in
[](Maps to Pythonlist). - Values: Strings, Numbers, Booleans (
true/false), andnull.
{
"user_id": 101,
"metadata": {
"login_count": 5,
"tags": ["premium", "active"]
},
"is_active": true
}
Text data often comes with complex metadata (author, timestamp, geolocation, and nested entity tags). JSON allows all this info to stay bundled with the raw text.
Most ML frameworks use JSON (or its cousin, YAML) to store Hyperparameters.
{
"model": "ResNet-50",
"learning_rate": 0.001,
"optimizer": "Adam"
}
As discussed in the APIs section, almost every web service returns data in JSON format.
Machine Learning models (like Linear Regression or XGBoost) require flat 2D arrays (Rows and Columns). They cannot "see" inside a nested JSON object. Data engineers must Flatten or Normalize the data.
graph LR
Nested[Nested JSON] --> Normalize["pd.json_normalize()"]
Normalize --> Flat[Flat DataFrame]
style Normalize fill:#f3e5f5,stroke:#7b1fa2,color:#333
Example in Python:
import pandas as pd
import json
raw_json = [
{"name": "Alice", "info": {"age": 25, "city": "NY"}},
{"name": "Bob", "info": {"age": 30, "city": "SF"}}
]
# Flattens 'info' into 'info.age' and 'info.city' columns
df = pd.json_normalize(raw_json)| Feature | JSON | CSV | Parquet |
|---|---|---|---|
| Flexibility | Very High (Schema-less) | Low (Fixed Columns) | Medium (Evolving Schema) |
| Parsing Speed | Slow (Heavy string parsing) | Medium | Very Fast |
| File Size | Large (Repeated Keys) | Medium | Small (Binary) |
:::note
In a JSON file, the key (e.g., "user_id") is repeated for every single record, which wastes a lot of disk space compared to CSV.
:::
Standard JSON files require you to load the entire file into memory to parse it. For datasets with millions of records, we use JSONL (JSON Lines).
- Each line in the file is a separate, valid JSON object.
- Benefit: You can stream the file line-by-line without crashing your RAM.
{"id": 1, "text": "Hello world"}
{"id": 2, "text": "Machine Learning is fun"}
- Validation: Use JSON Schema to ensure the data you're ingesting hasn't changed structure.
- Encoding: Always use
UTF-8to avoid character corruption in text data. - Compression: Since JSON is text-heavy, always use
.gzor.zipwhen storing raw JSON files to save up to 90% space.
-
Python
jsonModule: Learningjson.loads()andjson.dumps(). -
Pandas
json_normalizeGuide: Mastering complex flattening of API data.
JSON is the king of flexibility, but for "Big Data" production environments where speed and storage are everything, we move to binary formats.