Thank you for your interest in contributing to codoff! This document provides guidelines for contributing to the project, with a focus on the testing structure and development workflow.
- Python 3.10 or higher
- Git
-
Fork the repository on GitHub:
- Go to https://github.com/Kalan-Lab/codoff
- Click the "Fork" button in the top-right corner
- This creates your own copy of the repository
-
Clone your fork and set up the upstream remote:
git clone https://github.com/YOUR_USERNAME/codoff.git cd codoff git remote add upstream https://github.com/Kalan-Lab/codoff.git -
Create a new branch for your development work:
git checkout -b feature/your-feature-name # or git checkout -b bugfix/issue-descriptionBranch naming conventions:
feature/description- for new features (e.g.,feature/codon-caching)bugfix/description- for bug fixes (e.g.,bugfix/warning-messages)hotfix/description- for urgent fixes (e.g.,hotfix/critical-error)docs/description- for documentation updates (e.g.,docs/contributing-guide)
-
Create and activate a conda environment:
conda env create -f codoff_env.yml -n codoff_env conda activate codoff_env
-
Install the package in development mode:
pip install -e . -
Keep your fork up to date:
git fetch upstream git checkout main git merge upstream/main
The project uses a comprehensive test suite organized into specialized test modules. All tests are located in the tests/ directory and follow a structured approach to ensure code quality and reliability.
The test suite is organized into the following categories:
- Purpose: Tests the main codoff functionality and core algorithms
- Focus: Codon caching, simulation logic, and basic statistical calculations
- Key Test Classes:
TestCodonCaching: Tests codon count caching functionalityTestProportionalSampling: Tests proportional sampling logicTestWarningMessages: Tests warning message generationTestIntegration: Tests simulation consistency and integrationTestEdgeCases: Tests edge cases like empty gene lists and single codon types
- Purpose: Tests complete workflows with realistic data
- Focus: End-to-end functionality with real-world scenarios
- Key Test Classes:
TestIntegrationWorkflow: Tests complete workflow with realistic genome data and edge cases
- Purpose: Tests integration with antiSMASH and caching functionality
- Focus: Data structure integrity, background calculations, and parameter handling
- Key Test Classes:
TestAntismashCodoffIntegration: Tests caching data structure, background calculation methods, and function signatures
- Purpose: Tests warning messages and error handling
- Focus: User feedback and error reporting
- Key Test Classes:
TestWarningMessages: Tests warning messages for missing locus tags, coordinate warnings, and warning conditionsTestWarningMessageContent: Tests warning message format and output capture
- Purpose: Tests that random seed parameter produces reproducible results
- Focus: Reproducibility and deterministic behavior
- Key Test Classes:
TestSeedReproducibility: Tests that identical seeds produce identical results and different seeds can produce different results
- Purpose: Tests genome data caching for performance optimization
- Focus: Data structure integrity and caching consistency
- Key Test Classes:
TestCachingFunctionality: Tests extract_genome_codon_data() structure, cached vs non-cached consistency, and simulation count parameter
The project includes a dedicated test runner script:
python run_tests.pyThis will:
- Discover all test files in the
tests/directory - Run all tests with verbose output
- Return appropriate exit codes (0 for success, 1 for failure)
You can also use pytest directly:
# Run all tests
python -m pytest tests/ -v
# Run with coverage
python -m pytest tests/ --cov=src/codoff --cov-report=html
# Run specific test file
python -m pytest tests/test_codoff.py -vNote: The project is configured to work with both unittest and pytest. No additional configuration files are needed.
You can also run specific test modules:
# Run core functionality tests
python -m unittest tests.test_codoff
# Run integration tests
python -m unittest tests.test_integration
# Run antiSMASH integration tests
python -m unittest tests.test_antismash_codoff_integration
# Run warning system tests
python -m unittest tests.test_warnings
# Run seed reproducibility tests
python -m unittest tests.test_seed_reproducibility
# Run caching functionality tests
python -m unittest tests.test_caching
# Run utility function tests
python -m unittest tests.test_utils
# Run default behavior tests
python -m unittest tests.test_new_default# Run a specific test class
python -m unittest tests.test_codoff.TestCodonCaching
# Run a specific test method
python -m unittest tests.test_codoff.TestCodonCaching.test_gene_codons_structureWhen adding new tests, follow these guidelines:
-
Test Organization: Place tests in the appropriate module based on functionality:
- Core algorithm tests →
test_codoff.py - End-to-end workflow tests →
test_integration.py - Processing antiSMASH results tests →
test_antismash_codoff_integration.py - Warning/error tests →
test_warnings.py - Reproducibility tests →
test_seed_reproducibility.py - Caching tests →
test_caching.py - Utility function tests →
test_utils.py - Default behavior tests →
test_new_default.py
- Core algorithm tests →
-
Test Naming: Use descriptive test method names that explain what is being tested:
def test_codon_frequency_calculation_with_empty_gene_list(self): """Test that codon frequency calculation handles empty gene lists correctly."""
-
Test Documentation: Include docstrings that explain the test purpose:
def test_realistic_genome_simulation(self): """Test simulation with realistic genome data to ensure proper handling of real-world scenarios."""
-
Test Data: Use realistic test data that reflects actual biological scenarios when possible.
-
Assertions: Use specific assertions that test the exact behavior expected:
self.assertEqual(total_sampled, expected_count) self.assertAlmostEqual(proportion, expected_proportion, places=5)
- All test classes must inherit from
unittest.TestCase - Test methods must start with
test_ - Use descriptive class and method names
- Include comprehensive docstrings
- Follow the existing code style and PEP 8 guidelines
- Follow PEP 8 style guidelines
- Use type hints for function parameters and return values
- Keep functions brief (preferably under 14 lines)
- Use meaningful variable and function names
- Add comprehensive docstrings for all public functions using NumPy-style format:
def function_name(param1: str, param2: int = 10) -> Dict[str, Any]: """ Brief one-line description. Parameters ---------- param1 : str Description of parameter param2 : int, optional Description with default value, by default 10 Returns ------- Dict[str, Any] Description of return value Notes ----- Additional notes if needed. """
When working with simulation code:
- Sequential Contiguous-Window Sampling: The tool uses sequential contiguous-window sampling by default (as of v1.2.3), which randomly selects genomic windows of the same size as the focal region for more biologically realistic null distributions
- Random Seeding: Simulations use fixed random seeds (default: 42) for reproducible results. Users can specify custom seeds via
--seed/-xparameter - Coordinate Information: Genomic coordinates (gene positions and scaffold lengths) are always extracted and cached for efficient sequential sampling
- Discordance Percentile: Results report a discordance percentile (not p-value) indicating how unusual the focal region's codon usage is compared to similarly sized genomic windows
- Testing: Always test simulation consistency and reproducibility when modifying simulation code
- Performance: Sequential processing with coordinate-based sampling provides optimal performance
- Compatibility: Ensure changes work with both
codoffandantismash_codoffscripts
- Place new functionality in appropriate modules within
src/codoff/ - Update tests when adding new features
- Ensure all new code is covered by tests
- Maintain backward compatibility when possible
When adding new CLI parameters:
- Consistency: Ensure parameters work in both
codoffandantismash_codoffscripts - Defaults: Use sensible defaults (e.g., 10000 simulations, seed=42, sequential sampling enabled)
- Documentation: Update help text, README.md, and docstrings
- Testing: Add tests for new parameters in the appropriate test modules
- Backward Compatibility: New parameters should be optional with sensible defaults
- Coordinate Requirements: Sequential sampling requires genomic coordinates, which are automatically extracted from GenBank files or generated via pyrodigal for FASTA files
- Follow the development setup above to fork and create a feature branch
- Make your changes on your feature branch
- Write tests for any new functionality
- Ensure all tests pass using
python run_tests.py - Update documentation if needed
- Commit your changes with clear commit messages:
git add . git commit -m "Add feature: brief description of changes"
- Push your branch to your fork:
git push origin feature/your-feature-name
- Submit a pull request on GitHub with:
- A clear title describing the changes
- A detailed description of what was changed and why
- Reference to any related issues
- Screenshots or examples if applicable
The test suite provides comprehensive coverage across multiple specialized modules:
- Core algorithmic functionality and codon caching
- Edge cases and error conditions
- Integration scenarios with realistic data
- User-facing features and warnings
- Statistical accuracy and precision
- antiSMASH integration and genome data caching
- Seed-based reproducibility
- Sequential contiguous-window sampling
- Coordinate extraction and usage
The project uses GitHub Actions for continuous integration, which automatically runs the test suite on:
- Pull requests
- Pushes to main branch
- Release tags
All tests must pass before code can be merged.
As of version 1.2.3, codoff reports a discordance percentile rather than an empirical P-value. This is an important distinction:
- Discordance Percentile: The proportion of simulations that show codon usage as or more discordant than the observed focal region, expressed as a percentage (0-100)
- Interpretation: Lower percentiles indicate more unusual/discordant codon usage
- Example: A percentile of 5.0 means the focal region is within the top 5% most discordant regions
When contributing code or documentation:
- Use "discordance percentile" (not "p-value" or "empirical p-value")
- The internal variable is
empirical_freq(a proportion from 0-1) - Output displays it as "Discordance Percentile" (multiplied by 100)
If you have questions about contributing or need help with the codebase:
- Check the existing issues on GitHub
- Review the README.md and wiki for examples of how functionality is used
- Open a new issue with specific questions
By contributing to codoff, you agree that your contributions will be licensed under the same BSD 3-Clause License that covers the project.