Introduced in Spark 4.x, Python Data Source API allows you to create PySpark Data Sources leveraging long standing python libraries for handling unique file types or specialized interfaces with spark read, readStream, write and writeStream APIs.
| Data Source Name | Purpose |
|---|---|
| mcap | Read MCAP (ROS 2 bag) files |
| mqtt | Stream data from MQTT brokers |
| zipdcm | Read DICOM files from Zip file archives |
Install the base package:
pip install python-data-sourcesInstall with specific data source support:
# Install with MCAP support
pip install python-data-sources[mcap]
# Install with MQTT support
pip install python-data-sources[mqtt]
# Install with ZipDCM support
pip install python-data-sources[zipdcm]
# Install with all data sources
pip install python-data-sources[all]from pyspark.sql import SparkSession
from python_data_sources.mcap import MCAPDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(MCAPDataSource)
df = spark.read.format("mcap")
.option("path", "/path/to/data.mcap")
.load()from pyspark.sql import SparkSession
from python_data_sources.mqtt import MqttDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(MqttDataSource)
df = spark.readStream.format("mqtt_pub_sub") \
.option("broker_address", "mqtt.example.com") \
.option("topic", "sensors/#") \
.load()from pyspark.sql import SparkSession
from python_data_sources.zipdcm import ZipDCMDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ZipDCMDataSource)
df = spark.read.format("zipdcm") \
.load("/path/to/dicom_files.zip")This project uses Hatch for build and environment management.
# Install hatch
pip install hatch
# Create development environment
hatch env create# Run tests for a specific submodule
hatch run test-mcap:test
hatch run test-mqtt:test
hatch run test-zipdcm:test
# Run tests with coverage
hatch run test-mcap:cov
hatch run test-mqtt:cov
hatch run test-zipdcm:covpython-data-sources/
├── pyproject.toml # Unified build configuration
├── src/
│ └── python_data_sources/ # Main package
│ ├── mcap/ # MCAP data source
│ ├── mqtt/ # MQTT streaming data source
│ └── zipdcm/ # ZipDCM data source
├── tests/
| └── unit/
│ ├── common/ # Common module tests
│ ├── mcap/ # MCAP tests
│ ├── mqtt/ # MQTT tests
│ └── zipdcm/ # ZipDCM tests
└── .github/workflows/
└── test.yml # CI/CD workflow
Refer to the python-data-sources documentation for detailed information on how to use supplied python data sources, its features, and configuration options.
See CONTRIBUTING.md for detailed information about contributing to the Python Data Sources library.
© 2025 Databricks, Inc. All rights reserved. The source in this project is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.
| Datasource | Package | Purpose | License | Source |
|---|---|---|---|---|
| mcap | mcap | Python API for MCAP files | MIT | https://github.com/foxglove/mcap |
| mcap | mcap-protobuf-support | Protobuf schema support | MIT | https://github.com/foxglove/mcap |
| mqtt | paho-mqtt | MQTT client library | EPL-2.0 / EDL-1.0 (BSD-3) | https://github.com/eclipse/paho.mqtt.python |
| zipdcm | pydicom | Python API for DICOM files | MIT | https://github.com/pydicom/pydicom |
| zipdcm | pylibjpeg | Decoding / Encoding pixel formats | MIT | https://github.com/pydicom/pylibjpeg |