Skip to content

Commit 70a27c9

Browse files
Update README.md to reflect changes in versioning and enhance documentation for Version 4 features
1 parent 54894eb commit 70a27c9

File tree

1 file changed

+36
-3
lines changed

1 file changed

+36
-3
lines changed

README.md

Lines changed: 36 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ This document outlines the system's naming conventions, lifecycle, and model con
1414
> [!IMPORTANT]
1515
> Old documentation is available in the `Archived Models` directory of this [repository](https://github.com/DefinetlyNotAI/VulnScan_Data)
1616
>
17-
> This documentation is covers test data, metrics and niche features.
17+
> This documentation covers test data, metrics and niche features.
1818
1919
---
2020

@@ -87,7 +87,10 @@ This document outlines the system's naming conventions, lifecycle, and model con
8787

8888
---
8989

90-
### Version 3 (Current)
90+
### Version 3 (Superseded)
91+
- **Superseded by Version 4**
92+
- Retained for reference and backward compatibility.
93+
9194
1. **Read Config**: Load model and training parameters.
9295
2. **Load Data**: Collect and preprocess sensitive data.
9396
3. **Split Data**: Separate into training and validation sets.
@@ -100,6 +103,36 @@ This document outlines the system's naming conventions, lifecycle, and model con
100103

101104
---
102105

106+
### Version 4 (Current)
107+
- **Current Release**: Major improvements in scalability, modularity, and embedding-based training.
108+
- **Key Features**:
109+
- **Dynamic Dataset Generation**: Uses GPT-Neo for synthetic sensitive data generation, scaling from small to large datasets.
110+
- **Embedding-Based Training**: Employs MiniLM sentence embeddings for all text samples, improving feature representation.
111+
- **Multi-Round Training**: Supports multiple training rounds per dataset size for robust model evaluation.
112+
- **Automated Caching**: Datasets and embeddings are cached for reuse, reducing redundant computation.
113+
- **Configurable Model Naming**: Model names reflect dataset size, type, version, and training round.
114+
- **Progress Tracking**: Training history and metrics are saved per round for analysis.
115+
- **Extensible Framework**: Easily integrates new models, datasets, and training strategies.
116+
117+
#### Version 4 Workflow
118+
1. **Initialize Resources**: Load GPT-Neo and MiniLM models for generation and embedding.
119+
2. **Dataset Generation**: Create or load datasets of varying sizes, using cached data when available.
120+
3. **Embedding Generation**: Compute sentence embeddings for train, validation, and test splits.
121+
4. **Split Data**: Partition data into train, validation, and test sets based on configurable ratios.
122+
5. **Model Training**: Train a neural network using embeddings, with support for early stopping and learning rate scheduling.
123+
6. **Multi-Round Evaluation**: Repeat training for each dataset size and round, saving metrics and model states.
124+
7. **Progress Logging**: Save training history, plots, and logs for each round and model.
125+
8. **Extensibility**: Easily add new dataset sizes, model types, or embedding strategies.
126+
127+
#### Example Model Name
128+
`Model_Sense.4n1`:
129+
- Dataset: `Sense` (50k to 100k files).
130+
- Version: 4 (current major version).
131+
- Model: NeuralNetwork (`n`).
132+
- Training Round: 1.
133+
134+
---
135+
103136
## Preferred Model
104137
**NeuralNetwork (`n`)**
105138
- Proven to be the most effective for detecting sensitive data in the project.
@@ -108,7 +141,7 @@ This document outlines the system's naming conventions, lifecycle, and model con
108141

109142
## Notes
110143
- **Naming System**: Helps track model versions, datasets, and training iterations for transparency and reproducibility.
111-
- **Current Focus**: Transition to `v3` for improved accuracy, flexibility, and robust performance.
144+
- **Current Focus**: Version 4 for improved scalability, embedding-based training, and robust performance.
112145

113146
---
114147

0 commit comments

Comments
 (0)