awesome-architecture-mds/data-analytics/DataProfiler/Data_Ingestion_Preprocessing.md at main · CodeBoarding/awesome-architecture-mds

graph LR
    BaseData["BaseData"]
    CSVData["CSVData"]
    JSONData["JSONData"]
    ParquetData["ParquetData"]
    TextData["TextData"]
    GraphData["GraphData"]
    StructuredDataMixin["StructuredDataMixin"]
    filepath_or_buffer["filepath_or_buffer"]
    CSVData -- "implements" --> BaseData
    CSVData -- "uses" --> filepath_or_buffer
    JSONData -- "implements" --> BaseData
    JSONData -- "inherits from" --> StructuredDataMixin
    JSONData -- "uses" --> filepath_or_buffer
    ParquetData -- "implements" --> BaseData
    ParquetData -- "inherits from" --> StructuredDataMixin
    ParquetData -- "uses" --> filepath_or_buffer
    TextData -- "implements" --> BaseData
    TextData -- "uses" --> filepath_or_buffer
    GraphData -- "implements" --> BaseData
    GraphData -- "uses" --> filepath_or_buffer

Details

The Data Ingestion & Preprocessing subsystem is a critical part of the DataProfiler project, responsible for standardizing diverse raw data into a structured format for subsequent analysis.

BaseData

Serves as the abstract base class, establishing a standardized interface (data(), get_batch_generator(), reload()) for all concrete data readers. It ensures a consistent output contract for the initial stage of the data processing pipeline. Embodies the 'Pipeline/Workflow' pattern by defining the entry point and common interface for data ingestion.

Related Classes/Methods:

BaseData:17-245

CSVData

Concrete implementation of BaseData, specializing in reading, parsing, and performing initial preprocessing for CSV data. Handles format-specific complexities like delimiter detection. Aligns with the 'Extensible Architecture' and 'Modular Architecture'.

Related Classes/Methods:

CSVData:20-770

JSONData

Concrete implementation of BaseData, specializing in reading, parsing, and performing initial preprocessing for JSON data. Handles format-specific complexities like flattening nested structures. Aligns with the 'Extensible Architecture' and 'Modular Architecture'.

Related Classes/Methods:

JSONData:19-446

ParquetData

Concrete implementation of BaseData, specializing in reading, parsing, and performing initial preprocessing for Parquet data. Aligns with the 'Extensible Architecture' and 'Modular Architecture'.

Related Classes/Methods:

ParquetData:13-184

TextData

Concrete implementation of BaseData, specializing in reading, parsing, and performing initial preprocessing for plain text data. Aligns with the 'Extensible Architecture' and 'Modular Architecture'.

Related Classes/Methods:

TextData:10-148

GraphData

Concrete implementation of BaseData, specializing in reading, parsing, and performing initial preprocessing for graph data. Aligns with the 'Extensible Architecture' and 'Modular Architecture'.

Related Classes/Methods:

GraphData:13-308

StructuredDataMixin

Provides reusable logic and common functionalities for structured data readers (e.g., CSV, JSON, Parquet). Promotes code reuse and consistency across similar data types. Reinforces the 'Modular Architecture' by abstracting common functionalities into a reusable mixin.

Related Classes/Methods:

StructuredDataMixin

filepath_or_buffer

A context manager that abstracts and standardizes the handling of diverse data sources, whether they are file paths or in-memory buffers. It ensures uniform input handling for all data readers. Contributes to the 'Modular Architecture' and 'Pipeline/Workflow' by providing a consistent mechanism for input data access.

Related Classes/Methods:

filepath_or_buffer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details

BaseData

CSVData

JSONData

ParquetData

TextData

GraphData

StructuredDataMixin

filepath_or_buffer

FAQ

FilesExpand file tree

Data_Ingestion_Preprocessing.md

Latest commit

History

Data_Ingestion_Preprocessing.md

File metadata and controls

Details

BaseData

CSVData

JSONData

ParquetData

TextData

GraphData

StructuredDataMixin

filepath_or_buffer

FAQ