By submitting a contribution to this repository, you certify that:
-
**You have the right to submit the contribution. You created the code/content yourself, or you have the right to submit it under the project's license.
-
**You grant us a license to use your contribution. You agree that your contribution will be licensed under the same terms as the rest of this project, and you grant the project maintainers the right to use, modify, and distribute your contribution as part of the project.
-
**You are not submitting confidential or proprietary information. Your contribution does not include anything you don’t have permission to share publicly.
If you are contributing on behalf of an organization, you confirm that you have the authority to do so. You agree to confirm these terms in your pull request. Any request that does not explicitely accept the terms will be assumed to have accepted.
- Each data source should function independently of other data sources.
- Each data source should be implemented in a subfolder of
/src, e.g.src/python_data_sources/zipdcm. The folder name should be the shortname of your data source. - Each data source must implement tests in a subfolder of
/tests, e.g./tests/unit/zipdcm. The folder name should be the shortname of the data source. - Each data source must list dependencies in its own section of
project.optional-dependenciesinpyproject.toml. - Each data source must specify its own development and test environment in the
tools.hatch.envssection ofpyproject.toml. This includes any dependencies and pytest commands necessary to run tests. - Each data source must include a
README.mdwhich describes the data source and shows example usage. - Each data source must include a
<data source name>-demo.pydemo notebook which details example usage. - Each data source must include a
LICENSE.mdfile approved by Databricks' legal team. Use open source subcomponents whenever possible. If proprietary components (e.g. external libraries) are required, provide a downloader method. Do not package proprietary components into data sources. - Each data source must provide BYOL ("Bring Your Own Lineage"). This should distinguish the data sources from sources for other platforms.
- Each data source's capabilities should be summarized and added to the main
README.mdAdd check marks for specific capabilities (e.g. :check:Read :check:Write :check:Readstream :check:Writestream) - Each data source's compute requirements, environment requirements, and any limitations should be documented in its
README.mdand demo notebook. - All public methods should have Python docstrings. Format docstrings using the standards detailed in the Google Python style guide.
- Error & Exception handling is critical. Exceptions must include a helpful message but must mask sensitive data (e.g. connection strings or credentials).
- All code must pass formatting and linting before it can be merged into the main repository. Run
make fmtlocally to validate code formatting.
To add a new data source (shortname <source>):
- Create
src/python_data_sources/<source>/and add the data source implementation, along with__init__.py,README.md, andLICENSE.md. - Create
tests/unit/<source>/and add unit tests covering the implementation. - Create
examples/<source>/with a<source>-demonotebook, andtests/e2e/<source>/with an end-to-end notebook test that runs the demo in a Databricks workspace. - In
pyproject.toml, add a<source>entry to[project.optional-dependencies], a[tool.hatch.envs.test-<source>]section declaring the module's dependencies, and a matching[tool.hatch.envs.test-<source>.scripts]section defining at leasttest = "pytest tests/unit/<source> -v {args}". - Add
<source>toALLOWED_SUBMODULESin.github/scripts/detect_changed_submodules.shso the CI test matrix picks it up. - Update
README.md(capabilities table and data source summary) andINSTALL.md(install instructions for the new optional dependency group).
If you'd like to contribute to python-data-sources, please create a pull request or open an issue on the repository. To submit a pull request:
- Fork the
python-data-sourcesrepository - Clone your forked repository locally (
git clone <Your repository URL>) - Update from the main branch (
git checkout main && git pull) - Create a branch for your changes (
git checkout -b <Your feature name>) - Once your changes are finished, run
make fmtin your IDE terminal and fix any reported issues - Commit and push your changes (
git commit -S -a -m "<Description of the changes> && git push origin <Your feature name>) - Open your PR using the GitHub web UI or CLI