Install the package with the data sources you need:
# Install with MCAP support
pip install python-data-sources[mcap]
# Install with MQTT support
pip install python-data-sources[mqtt]
# Install with ZipDCM support
pip install python-data-sources[zipdcm]
# Install with all data sources
pip install python-data-sources[all]You can install the package directly in a Databricks notebook:
%pip install python-data-sources[all]Or add it to your cluster's library configuration.
- Clone the project you'd like to run into your Databricks Workspace
- Open the Asset Bundle Editor in the Databricks UI
- Click on "Deploy"
- Navigate to the Deployments tab in the Asset Bundle UI (🚀 icon) and click "Run" on the job available. This will run the notebooks from this project sequentially.
After installation, register and use the data sources:
from pyspark.sql import SparkSession
from python_data_sources.mcap import MCAPDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(MCAPDataSource)
df = (
spark.read.format("mcap")
.option("path", "/path/to/data.mcap")
.option("numPartitions", "4")
.load()
)from pyspark.sql import SparkSession
from python_data_sources.mqtt import MqttDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(MqttDataSource)
df = (
spark.readStream.format("mqtt_pub_sub")
.option("broker_address", "mqtt.example.com")
.option("topic", "sensors/#")
.option("username", "user")
.option("password", "pass")
.load()
)from pyspark.sql import SparkSession
from python_data_sources.zipdcm import ZipDCMDataSource
spark = SparkSession.builder.getOrCreate()
spark.dataSource.register(ZipDCMDataSource)
df = (
spark.read.format("zipdcm")
.option("numPartitions", "2")
.load("/path/to/dicom_files.zip")
)