Skip to content

Commit 082552a

Browse files
committed
feat: Initialize comprehensive LLM Fine-Tuning Lab for SynthoraAI
Setup complete fine-tuning infrastructure for training, evaluating, and deploying LLMs for the SynthoraAI AI-Gov-Content-Curator project. Features: - Complete training pipeline (FineTuner, LoRAFineTuner, DistributedTrainer) - Data processing and loading from SynthoraAI API - Model architectures (Summarization, Classification, Bias Detection) - Comprehensive evaluation and benchmarking - Production-ready export utilities (ONNX, TorchScript, quantization) - Docker and docker-compose setup - CI/CD with GitHub Actions - Extensive documentation and examples - Utility scripts for all common tasks - Configuration management with YAML files Components: - src/training: Fine-tuning implementations - src/data: Data loading and processing - src/models: Model wrappers and export utilities - src/evaluation: Evaluation metrics and benchmarking - src/utils: Utility functions - configs/: YAML configuration files - scripts/: Training and utility scripts - docs/: Comprehensive documentation - examples/: Usage examples - tests/: Unit tests Ready for: - Training summarization models - Training classification models - Training bias detection models - Integration with SynthoraAI backend - Production deployment
0 parents  commit 082552a

47 files changed

Lines changed: 5189 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Environment variables for LLM Fine-Tuning Lab
2+
3+
# API Keys
4+
GOOGLE_AI_API_KEY=your_google_ai_api_key_here
5+
OPENAI_API_KEY=your_openai_api_key_here
6+
HUGGINGFACE_TOKEN=your_huggingface_token_here
7+
8+
# SynthoraAI Backend
9+
SYNTHORAAI_API_URL=https://ai-content-curator-backend.vercel.app
10+
MONGODB_URI=mongodb://localhost:27017/synthoraai
11+
12+
# Weights & Biases
13+
WANDB_API_KEY=your_wandb_api_key_here
14+
WANDB_PROJECT=synthoraai-finetuning
15+
WANDB_ENTITY=your_wandb_entity
16+
17+
# Model Paths
18+
MODEL_CACHE_DIR=./models
19+
CHECKPOINT_DIR=./checkpoints
20+
DATA_DIR=./datasets
21+
22+
# Training Configuration
23+
USE_GPU=true
24+
GPU_DEVICE=0
25+
BATCH_SIZE=8
26+
LEARNING_RATE=5e-5
27+
NUM_EPOCHS=3
28+
29+
# Logging
30+
LOG_LEVEL=INFO
31+
LOG_FILE=logs/training.log
32+
33+
# Redis (optional)
34+
REDIS_HOST=localhost
35+
REDIS_PORT=6379
36+
37+
# Pinecone (for vector search)
38+
PINECONE_API_KEY=your_pinecone_api_key_here
39+
PINECONE_ENV=your_pinecone_environment_here

.github/workflows/ci.yml

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
name: CI/CD Pipeline
2+
3+
on:
4+
push:
5+
branches: [ main, develop ]
6+
pull_request:
7+
branches: [ main, develop ]
8+
9+
jobs:
10+
lint:
11+
name: Code Quality
12+
runs-on: ubuntu-latest
13+
steps:
14+
- uses: actions/checkout@v4
15+
16+
- name: Set up Python
17+
uses: actions/setup-python@v5
18+
with:
19+
python-version: '3.11'
20+
21+
- name: Install dependencies
22+
run: |
23+
python -m pip install --upgrade pip
24+
pip install black flake8 isort mypy
25+
26+
- name: Run Black
27+
run: black --check src/ scripts/
28+
29+
- name: Run Flake8
30+
run: flake8 src/ scripts/ --max-line-length=120
31+
32+
- name: Run isort
33+
run: isort --check-only src/ scripts/
34+
35+
test:
36+
name: Run Tests
37+
runs-on: ubuntu-latest
38+
strategy:
39+
matrix:
40+
python-version: ['3.9', '3.10', '3.11']
41+
steps:
42+
- uses: actions/checkout@v4
43+
44+
- name: Set up Python ${{ matrix.python-version }}
45+
uses: actions/setup-python@v5
46+
with:
47+
python-version: ${{ matrix.python-version }}
48+
49+
- name: Install dependencies
50+
run: |
51+
python -m pip install --upgrade pip
52+
pip install -r requirements.txt
53+
pip install pytest pytest-cov
54+
55+
- name: Run tests
56+
run: |
57+
pytest tests/ --cov=src --cov-report=xml --cov-report=term
58+
59+
- name: Upload coverage
60+
uses: codecov/codecov-action@v4
61+
with:
62+
file: ./coverage.xml
63+
fail_ci_if_error: false
64+
65+
build-docker:
66+
name: Build Docker Image
67+
runs-on: ubuntu-latest
68+
needs: [lint, test]
69+
steps:
70+
- uses: actions/checkout@v4
71+
72+
- name: Set up Docker Buildx
73+
uses: docker/setup-buildx-action@v3
74+
75+
- name: Build Docker image
76+
uses: docker/build-push-action@v5
77+
with:
78+
context: .
79+
push: false
80+
tags: synthoraai/llm-finetuning-lab:latest
81+
cache-from: type=gha
82+
cache-to: type=gha,mode=max

.gitignore

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
*.egg-info/
20+
.installed.cfg
21+
*.egg
22+
23+
# Virtual environments
24+
venv/
25+
env/
26+
ENV/
27+
.venv
28+
29+
# IDE
30+
.vscode/
31+
.idea/
32+
*.swp
33+
*.swo
34+
*~
35+
.DS_Store
36+
37+
# Jupyter
38+
.ipynb_checkpoints
39+
*.ipynb
40+
41+
# Model checkpoints
42+
checkpoints/
43+
/models/
44+
*.pth
45+
*.pt
46+
*.bin
47+
*.onnx
48+
*.pb
49+
50+
# Data
51+
datasets/
52+
/data/
53+
*.csv
54+
*.json
55+
*.jsonl
56+
*.parquet
57+
58+
# Allow source code files
59+
!src/**/*.py
60+
!configs/*.yaml
61+
!*.json
62+
63+
# Logs
64+
logs/
65+
*.log
66+
runs/
67+
tensorboard/
68+
69+
# Outputs
70+
outputs/
71+
exports/
72+
benchmarks/
73+
74+
# Environment variables
75+
.env
76+
.env.local
77+
.env.*.local
78+
79+
# Testing
80+
.coverage
81+
.pytest_cache/
82+
htmlcov/
83+
.tox/
84+
85+
# Documentation
86+
docs/_build/
87+
site/
88+
89+
# Cache
90+
.cache/
91+
*.cache
92+
93+
# Wandb
94+
wandb/
95+
96+
# Misc
97+
*.bak
98+
*.tmp
99+
.benchmarks

CONTRIBUTING.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Contributing to LLM Fine-Tuning Lab
2+
3+
Thank you for your interest in contributing to the LLM Fine-Tuning Lab! This document provides guidelines and instructions for contributing.
4+
5+
## Getting Started
6+
7+
1. **Fork the repository**
8+
2. **Clone your fork**
9+
```bash
10+
git clone https://github.com/YOUR_USERNAME/LLM-Finetuning-Lab.git
11+
cd LLM-Finetuning-Lab
12+
```
13+
14+
3. **Set up development environment**
15+
```bash
16+
make dev-install
17+
```
18+
19+
4. **Create a feature branch**
20+
```bash
21+
git checkout -b feature/your-feature-name
22+
```
23+
24+
## Development Guidelines
25+
26+
### Code Style
27+
28+
We follow PEP 8 style guidelines with some modifications:
29+
30+
- Maximum line length: 120 characters
31+
- Use type hints for function signatures
32+
- Write docstrings for all public functions and classes
33+
34+
Format your code before committing:
35+
36+
```bash
37+
make format
38+
```
39+
40+
### Testing
41+
42+
All new features must include tests:
43+
44+
```bash
45+
# Run tests
46+
make test
47+
48+
# Run specific test file
49+
pytest tests/test_training.py
50+
```
51+
52+
### Commit Messages
53+
54+
Follow conventional commit format:
55+
56+
```
57+
<type>(<scope>): <subject>
58+
59+
<body>
60+
61+
<footer>
62+
```
63+
64+
Types:
65+
- `feat`: New feature
66+
- `fix`: Bug fix
67+
- `docs`: Documentation changes
68+
- `style`: Code style changes
69+
- `refactor`: Code refactoring
70+
- `test`: Adding tests
71+
- `chore`: Maintenance tasks
72+
73+
Example:
74+
```
75+
feat(training): add LoRA fine-tuning support
76+
77+
Implemented LoRA-based fine-tuning for efficient parameter
78+
adaptation. Includes configuration options and example scripts.
79+
80+
Closes #123
81+
```
82+
83+
## Pull Request Process
84+
85+
1. **Update documentation** if you're adding new features
86+
2. **Add tests** for new functionality
87+
3. **Run linters** and ensure all tests pass
88+
```bash
89+
make lint
90+
make test
91+
```
92+
4. **Update CHANGELOG.md** with your changes
93+
5. **Submit PR** with a clear description
94+
95+
### PR Checklist
96+
97+
- [ ] Code follows style guidelines
98+
- [ ] Tests added and passing
99+
- [ ] Documentation updated
100+
- [ ] CHANGELOG.md updated
101+
- [ ] Commit messages follow convention
102+
103+
## Areas for Contribution
104+
105+
### High Priority
106+
107+
- [ ] Implement RLHF training pipeline
108+
- [ ] Add support for multimodal models
109+
- [ ] Optimize distributed training
110+
- [ ] Improve documentation
111+
112+
### Medium Priority
113+
114+
- [ ] Add more evaluation metrics
115+
- [ ] Create tutorial notebooks
116+
- [ ] Implement model distillation
117+
- [ ] Add CI/CD improvements
118+
119+
### Good First Issues
120+
121+
Look for issues labeled `good-first-issue` in the issue tracker.
122+
123+
## Code Review Process
124+
125+
1. At least one maintainer review required
126+
2. All CI checks must pass
127+
3. Documentation must be updated
128+
4. Changes must be tested
129+
130+
## Questions?
131+
132+
Feel free to:
133+
- Open an issue for bugs or feature requests
134+
- Join our Discord community
135+
- Email: hoangson091104@gmail.com
136+
137+
Thank you for contributing to SynthoraAI!

0 commit comments

Comments
 (0)