Skip to content

Latest commit

 

History

History
150 lines (108 loc) · 5.47 KB

File metadata and controls

150 lines (108 loc) · 5.47 KB

Databricks Unity Catalog Serverless

Databricks Python Data Sources

Introduced in Spark 4.x, Python Data Source API allows you to create PySpark Data Sources leveraging long standing python libraries for handling unique file types or specialized interfaces with spark read, readStream, write and writeStream APIs.

Data Source Name Purpose
mcap Read MCAP (ROS 2 bag) files
mqtt Stream data from MQTT brokers
zipdcm Read DICOM files from Zip file archives

Installation

Install the base package:

pip install python-data-sources

Install with specific data source support:

# Install with MCAP support
pip install python-data-sources[mcap]

# Install with MQTT support
pip install python-data-sources[mqtt]

# Install with ZipDCM support
pip install python-data-sources[zipdcm]

# Install with all data sources
pip install python-data-sources[all]

Quick Start

MCAP Data Source

from pyspark.sql import SparkSession
from python_data_sources.mcap import MCAPDataSource

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(MCAPDataSource)

df = spark.read.format("mcap")
    .option("path", "/path/to/data.mcap")
    .load()

MQTT Streaming Data Source

from pyspark.sql import SparkSession
from python_data_sources.mqtt import MqttDataSource

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(MqttDataSource)

df = spark.readStream.format("mqtt_pub_sub") \
    .option("broker_address", "mqtt.example.com") \
    .option("topic", "sensors/#") \
    .load()

ZipDCM Data Source

from pyspark.sql import SparkSession
from python_data_sources.zipdcm import ZipDCMDataSource

spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ZipDCMDataSource)

df = spark.read.format("zipdcm") \
    .load("/path/to/dicom_files.zip")

Development

This project uses Hatch for build and environment management.

Setup

# Install hatch
pip install hatch

# Create development environment
hatch env create

Running Tests

# Run tests for a specific submodule
hatch run test-mcap:test
hatch run test-mqtt:test
hatch run test-zipdcm:test

# Run tests with coverage
hatch run test-mcap:cov
hatch run test-mqtt:cov
hatch run test-zipdcm:cov

Project Structure

python-data-sources/
├── pyproject.toml              # Unified build configuration
├── src/
│   └── python_data_sources/    # Main package
│       ├── mcap/               # MCAP data source
│       ├── mqtt/               # MQTT streaming data source
│       └── zipdcm/             # ZipDCM data source
├── tests/
|   └── unit/
│       ├── common/             # Common module tests
│       ├── mcap/               # MCAP tests
│       ├── mqtt/               # MQTT tests
│       └── zipdcm/             # ZipDCM tests
└── .github/workflows/
    └── test.yml                # CI/CD workflow

Documentation

Refer to the python-data-sources documentation for detailed information on how to use supplied python data sources, its features, and configuration options.

Contributing

See CONTRIBUTING.md for detailed information about contributing to the Python Data Sources library.

📄 Third-Party Package Licenses

© 2025 Databricks, Inc. All rights reserved. The source in this project is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.

Datasource Package Purpose License Source
mcap mcap Python API for MCAP files MIT https://github.com/foxglove/mcap
mcap mcap-protobuf-support Protobuf schema support MIT https://github.com/foxglove/mcap
mqtt paho-mqtt MQTT client library EPL-2.0 / EDL-1.0 (BSD-3) https://github.com/eclipse/paho.mqtt.python
zipdcm pydicom Python API for DICOM files MIT https://github.com/pydicom/pydicom
zipdcm pylibjpeg Decoding / Encoding pixel formats MIT https://github.com/pydicom/pylibjpeg