@@ -14,7 +14,7 @@ This document outlines the system's naming conventions, lifecycle, and model con
1414> [ !IMPORTANT]
1515> Old documentation is available in the ` Archived Models ` directory of this [ repository] ( https://github.com/DefinetlyNotAI/VulnScan_Data )
1616>
17- > This documentation is covers test data, metrics and niche features.
17+ > This documentation covers test data, metrics and niche features.
1818
1919---
2020
@@ -87,7 +87,10 @@ This document outlines the system's naming conventions, lifecycle, and model con
8787
8888---
8989
90- ### Version 3 (Current)
90+ ### Version 3 (Superseded)
91+ - ** Superseded by Version 4**
92+ - Retained for reference and backward compatibility.
93+
91941 . ** Read Config** : Load model and training parameters.
92952 . ** Load Data** : Collect and preprocess sensitive data.
93963 . ** Split Data** : Separate into training and validation sets.
@@ -100,6 +103,36 @@ This document outlines the system's naming conventions, lifecycle, and model con
100103
101104---
102105
106+ ### Version 4 (Current)
107+ - ** Current Release** : Major improvements in scalability, modularity, and embedding-based training.
108+ - ** Key Features** :
109+ - ** Dynamic Dataset Generation** : Uses GPT-Neo for synthetic sensitive data generation, scaling from small to large datasets.
110+ - ** Embedding-Based Training** : Employs MiniLM sentence embeddings for all text samples, improving feature representation.
111+ - ** Multi-Round Training** : Supports multiple training rounds per dataset size for robust model evaluation.
112+ - ** Automated Caching** : Datasets and embeddings are cached for reuse, reducing redundant computation.
113+ - ** Configurable Model Naming** : Model names reflect dataset size, type, version, and training round.
114+ - ** Progress Tracking** : Training history and metrics are saved per round for analysis.
115+ - ** Extensible Framework** : Easily integrates new models, datasets, and training strategies.
116+
117+ #### Version 4 Workflow
118+ 1 . ** Initialize Resources** : Load GPT-Neo and MiniLM models for generation and embedding.
119+ 2 . ** Dataset Generation** : Create or load datasets of varying sizes, using cached data when available.
120+ 3 . ** Embedding Generation** : Compute sentence embeddings for train, validation, and test splits.
121+ 4 . ** Split Data** : Partition data into train, validation, and test sets based on configurable ratios.
122+ 5 . ** Model Training** : Train a neural network using embeddings, with support for early stopping and learning rate scheduling.
123+ 6 . ** Multi-Round Evaluation** : Repeat training for each dataset size and round, saving metrics and model states.
124+ 7 . ** Progress Logging** : Save training history, plots, and logs for each round and model.
125+ 8 . ** Extensibility** : Easily add new dataset sizes, model types, or embedding strategies.
126+
127+ #### Example Model Name
128+ ` Model_Sense.4n1 ` :
129+ - Dataset: ` Sense ` (50k to 100k files).
130+ - Version: 4 (current major version).
131+ - Model: NeuralNetwork (` n ` ).
132+ - Training Round: 1.
133+
134+ ---
135+
103136## Preferred Model
104137** NeuralNetwork (` n ` )**
105138- Proven to be the most effective for detecting sensitive data in the project.
@@ -108,7 +141,7 @@ This document outlines the system's naming conventions, lifecycle, and model con
108141
109142## Notes
110143- ** Naming System** : Helps track model versions, datasets, and training iterations for transparency and reproducibility.
111- - ** Current Focus** : Transition to ` v3 ` for improved accuracy, flexibility , and robust performance.
144+ - ** Current Focus** : Version 4 for improved scalability, embedding-based training , and robust performance.
112145
113146---
114147
0 commit comments