This example demonstrates how to use Codegen to generate training data for large-scale LLM pre-training by extracting function implementations along with their dependencies and usages. The approach is inspired by node2vec, leveraging code graphs for learning.
The script analyzes your codebase and generates training data by:
-
Finding All Functions
- Scans the entire codebase to identify function definitions
- Filters out trivial functions (less than 2 lines)
-
Capturing Implementation Context
{"implementation": {"source": "def process_data():\n ...", "filepath": "src/process.py"}} -
Extracting Dependencies
{"dependencies": [{"source": "def helper_function():\n ...", "filepath": "src/helpers.py"}]} -
Recording Usages
{"usages": [{"source": "result = process_data()", "filepath": "src/main.py"}]}
# Install Codegen
pip install codegen
# Run the data generation
python run.pyThe script will analyze your codebase and output a training_data.json file containing the structured training data.
run.py- The main script that generates the training data- Uses
get_function_context()to extract implementation, dependencies, and usages - Processes each function and builds a comprehensive context graph
- Outputs structured JSON data with metadata about the processing
- Uses
The generated training_data.json follows this structure:
{
"functions": [
{
"implementation": {
"source": "...",
"filepath": "..."
},
"dependencies": [
{
"source": "...",
"filepath": "..."
}
],
"usages": [
{
"source": "...",
"filepath": "..."
}
]
}
],
"metadata": {
"total_functions": 100,
"total_processed": 85,
"avg_dependencies": 2.5,
"avg_usages": 3.2
}
}