Thank you for considering contributing to GSP-Py, a Python implementation of the Generalized Sequential Pattern algorithm! Contributions, whether they're reporting bugs, suggesting improvements, or submitting code, are greatly appreciated. Please review this document to understand how you can contribute effectively.
Here are some ways to contribute to the project:
- Report Bugs: If you encounter any issues, please let us know by submitting a bug report.
- Suggest Enhancements: Suggest new features or improvements.
- Fix Bugs or Add Features: Help fix issues or implement features based on the project's roadmap or your own ideas.
- Improve Documentation: Correct typos, add usage examples, or expand explanations in the documentation.
- Write Tests: Testing is an essential component to ensure a reliable library.
If you'd like to contribute code or documentation, please follow these steps:
- Visit the GSP-Py repository and click "Fork."
Clone the forked repository to your local machine:
git clone https://github.com/<your-username>/gsp-py.git
cd gsp-pyCreate a new branch for your work:
git checkout -b feature/my-featureUse a descriptive branch name, such as fix/issue-123 or feature/custom-filter-support.
- Edit and test your code locally.
- Ensure any new features include unit tests.
Before submitting your changes, ensure the code passes all tests:
pytestIf you're adding a new feature, include tests for it in the tests/ directory.
Push your changes to your forked repository:
git add .
git commit -m "Add a brief but descriptive commit message"
git push origin feature/my-feature- Navigate to the main repository's GitHub page and select Pull Requests.
- Click New Pull Request and provide details about your changes, including:
- A clear summary of your contribution.
- Any relevant references (e.g., a related issue or feature request).
- Be sure to reference the associated issue (if applicable) in your PR description (e.g., "Closes #12").
To maintain consistency and code quality, please follow these coding guidelines:
-
Formatting:
- Follow PEP 8 for Python code style conventions.
- Use tools like
pylintorflake8to ensure your code is formatted correctly:pylint path/to/file.py
-
Commit Messages:
- Follow the Conventional Commits specification.
- Use the format:
<type>(<scope>): <description> - Common types:
feat,fix,docs,style,refactor,perf,test,build,ci,chore - Examples:
feat(cli): add export format optionfix: correct off-by-one error in pattern matchingdocs: update installation instructions
- For breaking changes, add
!after type/scope or includeBREAKING CHANGE:in the footer - See Release Management Guide for details
-
Tests:
- Write tests for new features or bug fixes.
- Use
pytestas the testing framework. - Place tests in the
tests/directory using descriptive test names. - Consider property-based tests for robust validation (see Property-Based Testing section below).
-
Documentation:
- Update documentation (
README.md,CHANGELOG.md, comments in code) related to your changes. - If a new feature is added, provide usage examples and explanation.
- Update documentation (
To get familiar with the existing code, follow these steps:
-
Setup Environment: This project uses uv for managing dependencies and the virtual environment. Follow these instructions to set it up:
-
Install uv (if not already installed):
curl -Ls https://astral.sh/uv/install.sh | bashMake sure uv's binary directory is added to your
PATH:export PATH="$HOME/.local/bin:$PATH"
-
Create the virtual environment and install dependencies from uv.lock:
uv venv .venv uv sync --frozen --extra dev uv pip install -e .This will create a local
.venvand install project dependencies.
-
-
Run Tests: Use uv to run tests and verify the baseline state:
uv run pytest -n auto
This will execute tests in parallel using pytest-xdist if available.
Note: This project integrates the "tox-uv" plugin. When running
toxlocally (ormake tox), missing Python interpreters can be provisioned automatically via uv, so you don't need to have all versions installed ahead of time. -
Optional: Rust Acceleration
Some hot loops can be accelerated with Rust via PyO3. This is entirely optional: the library will fall back to pure Python if the extension is not present.
-
Install Rust and build the extension:
make rust-build
-
Choose backend at runtime (defaults to auto):
export GSPPY_BACKEND=rust # or python, or unset for auto
-
Run a benchmark (small):
make bench-small
-
Run a larger benchmark (adjust to your machine):
make bench-big # or customize: GSPPY_BACKEND=auto uv run --python .venv/bin/python --no-project \ python benchmarks/bench_support.py --n_tx 1000000 --tx_len 8 --vocab 50000 --min_support 0.2 --warmup
- Explore the Code:
The main entry point for the GSP algorithm is in the
gsppymodule. The libraries for support counting, candidate generation, and additional utility functions are also within this module.
- No need to manage a global virtualenv manually; uv will create and manage
.venvas desired. - If you’re unfamiliar with uv, refer to its documentation.
- Useful Makefile targets:
make setup,make install,make test,make lint,make format,make typecheckmake pre-commit-installto install the pre-commit hookmake pre-commit-runto run on all files
- To set up pre-commit manually:
uv run pre-commit install
GSP-Py uses Hypothesis for property-based testing (fuzzing), which automatically generates test cases to discover edge cases and ensure robustness.
Rather than writing individual test cases with specific inputs, property-based testing defines properties (invariants) that should hold for all inputs. Hypothesis then generates hundreds of random test cases to verify these properties.
All property-based tests use the @given decorator from Hypothesis:
# Run all fast fuzzing tests (excludes slow integration tests, ~2 minutes)
pytest tests/test_gsp_fuzzing.py tests/test_gsp_edge_cases.py tests/test_cli_fuzzing.py -v
# Run specific fuzzing test
pytest tests/test_gsp_edge_cases.py::test_gsp_handles_large_transactions -v
# Run with a fixed Hypothesis seed for reproducible fuzzing
pytest tests/test_gsp_fuzzing.py --hypothesis-seed=42
# Run slow integration tests (marked for CI/master branch only)
pytest -m integration -v
# Skip integration tests explicitly
pytest -m "not integration" -vIntegration Tests:
Some tests are marked as @pytest.mark.integration due to long runtime (10+ minutes). These are:
test_gsp_stress_large_transactions- Stress test with large individual transactions (~15 minutes)
These tests are automatically excluded from regular test runs but can be run explicitly with pytest -m integration.
GSP-Py provides reusable strategies in tests/hypothesis_strategies.py that you can compose for new tests:
from hypothesis import given
from tests.hypothesis_strategies import (
transaction_lists, # Standard transaction data
extreme_transaction_lists, # Extreme sizes (large/many/minimal)
sparse_transaction_lists, # Low pattern overlap
noisy_transaction_lists, # Mixed signal and noise
timestamped_transaction_lists, # Temporal data
valid_support_thresholds, # Valid support values
)
@given(transactions=transaction_lists())
def test_my_property(transactions):
gsp = GSP(transactions)
result = gsp.search(min_support=0.2)
# Assert your property here
assert isinstance(result, list)Basic Strategies:
transaction_lists()- Standard transaction data (2-50 transactions)item_pool()- Generate unique items for transactionsitem_strings()- Generate individual item strings
Edge Case Strategies:
extreme_transaction_lists(size_type="large")- Few transactions with many itemsextreme_transaction_lists(size_type="many")- Many transactions (100-500)extreme_transaction_lists(size_type="minimal")- Minimal valid input (2 transactions)sparse_transaction_lists()- Low pattern overlap/sparse patternsnoisy_transaction_lists()- High noise ratio with some signalvariable_length_transaction_lists()- Highly variable transaction sizes
Malformed Input Strategies:
transactions_with_duplicates()- Duplicate items within transactionstransactions_with_special_chars()- Unicode, special chars, whitespace
Temporal Strategies:
timestamped_transaction_lists()- Transactions with timestampspathological_timestamped_transactions()- Edge cases (reversed, identical, large gaps)
Support Threshold Strategies:
valid_support_thresholds()- Valid range (0.01-1.0)edge_case_support_thresholds()- Boundary values
-
Identify the property/invariant you want to test (e.g., "support counts should never exceed total transactions")
-
Choose or create appropriate strategies from
tests/hypothesis_strategies.py -
Write the test using
@given:
from hypothesis import given, settings, HealthCheck
from tests.hypothesis_strategies import transaction_lists
@given(transactions=transaction_lists())
@settings(max_examples=5, deadline=None, suppress_health_check=[HealthCheck.too_slow])
def test_gsp_my_property(transactions):
"""
Property: Describe what invariant this test validates.
Explain what the test is checking and why it matters.
"""
gsp = GSP(transactions)
result = gsp.search(min_support=0.2)
# Assert your property
for level_patterns in result:
for pattern, support in level_patterns.items():
assert support <= len(transactions), "Support cannot exceed transaction count"- Configure test settings as needed:
max_examples: Number of random test cases to generate (use small values like 3-10 for fast tests)deadline: Maximum time per test case (useNonefor slow operations)suppress_health_check: Disable specific health checks likeHealthCheck.too_slow
Note: Keep max_examples low (3-10) to ensure tests run quickly. The fuzzing suite should complete in under a minute.
To create a new reusable strategy in tests/hypothesis_strategies.py:
from hypothesis import strategies as st
@st.composite
def my_custom_strategy(draw: st.DrawFn, **kwargs) -> MyType:
"""
Generate custom test data.
Args:
draw: Hypothesis draw function
**kwargs: Custom parameters
Returns:
Generated test data
"""
# Use draw() to generate random values
n = draw(st.integers(min_value=1, max_value=10))
items = draw(st.lists(st.text(), min_size=n, max_size=n))
return items-
Test properties, not specific outputs: Focus on invariants that should always hold
- ✅ "All patterns should meet minimum support threshold"
- ❌ "Transaction X should produce patterns Y"
-
Use appropriate strategies: Match the strategy to what you're testing
- Testing edge cases? Use
extreme_transaction_lists() - Testing noise resistance? Use
noisy_transaction_lists()
- Testing edge cases? Use
-
Handle expected failures gracefully: Use
assume()to skip invalid inputs orpytest.raises()for expected errors -
Keep tests focused: Each test should validate one clear property
-
Document properties clearly: Explain what invariant is being tested in the docstring
-
Compose strategies: Combine existing strategies rather than creating everything from scratch
If you're adding a new feature (e.g., new pruning strategies), create tests that:
- Validate basic functionality with standard strategies
- Test edge cases with extreme strategies
- Verify invariants (e.g., parameters don't violate constraints)
- Check integration with existing features
@given(
transactions=transaction_lists(),
)
@settings(max_examples=5, deadline=None)
def test_gsp_patterns_property(transactions):
"""Property: Returned pattern supports should stay within valid bounds."""
gsp = GSP(transactions)
result = gsp.search(min_support=0.2)
# Verify support calculations are within expected bounds
max_possible_support = len(transactions)
for level_patterns in result:
for pattern, support in level_patterns.items():
assert 0 <= support <= max_possible_supportWhen a property-based test fails, Hypothesis provides:
- The specific input that caused the failure
- A simplified version of that input (shrinking)
- The random seed to reproduce the failure
To reproduce a failure:
pytest tests/test_gsp_edge_cases.py::test_name --hypothesis-seed=12345For more information, see the Hypothesis documentation.
The project's CI configuration automatically runs fuzzing tests. The tests use settings configured in each test file via the @settings decorator.
Test Execution Strategy:
- Regular CI (PRs, feature branches): Runs all tests except those marked with
@pytest.mark.integration. This completes in ~2 minutes. - Integration CI (master branch): Runs all tests including integration tests with
pytest -m integration. This may take 15+ minutes.
To run all tests locally (excluding slow integration tests):
pytest tests/To run including integration tests:
pytest tests/ -m "integration or not integration"To report a bug or suggest an enhancement, open an issue on GitHub:
- Go to the Issues page.
- Select New Issue and choose the appropriate issue template:
- Bug report
- Feature request
- Include as much detail as possible:
- Steps to reproduce (for bugs).
- Clear description of the feature or enhancement (for feature requests).
We welcome all suggestions or feedback for improving this project! You can reach out by opening an issue or submitting a pull request.
Let’s build a great tool together! 😊
Please note that this project adheres to the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to the project maintainer.
Thank you for contributing! 🙌