tutorial/ai-ml/machine-learning/data-engineering-basics/data-collection/data-sources.mdx at b098e2c66e44a01782a3b783cf790e36f4a5f30e · codeharborhub/tutorial

title

Data Sources in ML

sidebar_label

Data Sources

description

Identifying and integrating various data sources: from relational databases and APIs to unstructured web data and IoT streams.

1. The Data Source Landscape

We generally categorize data sources based on their Structure and their Storage Method.

graph TD
    Root[Data Sources] --> Structured[Structured]
    Root --> Semi[Semi-Structured]
    Root --> Unstructured[Unstructured]
    
    Structured --> SQL[Relational DBs: MySQL, Postgres]
    Semi --> Files[JSON, XML, Parquet]
    Unstructured --> Media[Images, Video, Audio, PDF]

2. Common Data Sources

A. Relational Databases (SQL)

The most common source for tabular data (customer records, transactions).

Protocol: SQL (Structured Query Language).
Pros: Highly reliable (ACID compliant), easy to join tables.
Cons: Hard to scale horizontally; requires a fixed schema.

B. NoSQL Databases

Used for high-volume, high-velocity, or non-tabular data.

Key-Value Stores: Redis.
Document Stores: MongoDB (Stores data as JSON/BSON).
ML Use Case: Storing user profiles or real-time feature stores.

C. APIs (Application Programming Interfaces)

Used to pull data from external services like Twitter, Google Maps, or Financial markets.

Format: Usually JSON or REST.
Challenges: Rate limiting (you can only pull so much data per hour) and authentication.

D. Cloud Object Storage (The Data Lake)

Services like AWS S3 or Google Cloud Storage act as a dumping ground for raw files before they are processed.

ML Use Case: Storing millions of images for a Computer Vision model.

3. Batch vs. Streaming Sources

How the data arrives at your model is just as important as where it comes from.

Feature	Batch Processing	Stream Processing
Source	Databases, CSV files, Data Lakes	Kafka, Kinesis, IoT Sensors
Frequency	Hourly, Daily, Weekly	Real-time (Milliseconds)
Use Case	Training a model on historical sales	Predicting fraud during a transaction

flowchart LR
    S1[(Database)] -->|Batch| B[ETL Process]
    S2{{IoT Sensor}} -->|Stream| P[Real-time Pipeline]
    B --> DL[Data Lake]
    P --> DL
    style P fill:#fff3e0,stroke:#ef6c00,color:#333

4. Web Scraping & Crawling

When data isn't available via API or DB, we use scrapers (like BeautifulSoup or Scrapy) to extract information from HTML.

Ethics Check: Always check a site's robots.txt before scraping to ensure you are legally and ethically allowed to take the data.

5. Identifying High-Quality Sources

Not all data sources are equal. When evaluating a source for an ML project, ask:

Freshness: How often is this data updated?
Reliability: Does the source go down often?
Completeness: Does it have missing values ()?
Granularity: Is the data at the level we need (e.g., individual transactions vs. daily totals)?

References for More Details

Google Cloud - Data Source Types: Understanding how cloud providers handle different data types.
MongoDB University: Learning the difference between Document stores and SQL.

Finding the data is only the first step. Once we have access, we need to move it into our systems without losing information or causing bottlenecks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. The Data Source Landscape

2. Common Data Sources

A. Relational Databases (SQL)

B. NoSQL Databases

C. APIs (Application Programming Interfaces)

D. Cloud Object Storage (The Data Lake)

3. Batch vs. Streaming Sources

4. Web Scraping & Crawling

5. Identifying High-Quality Sources

References for More Details

Uh oh!

FilesExpand file tree

data-sources.mdx

Latest commit

History

data-sources.mdx

File metadata and controls

1. The Data Source Landscape

2. Common Data Sources

A. Relational Databases (SQL)

B. NoSQL Databases

C. APIs (Application Programming Interfaces)

D. Cloud Object Storage (The Data Lake)

3. Batch vs. Streaming Sources

4. Web Scraping & Crawling

5. Identifying High-Quality Sources

References for More Details