Generate Codebase Pre-Training Data

This example demonstrates how to use Codegen to generate training data for large-scale LLM pre-training by extracting function implementations along with their dependencies and usages. The approach is inspired by node2vec, leveraging code graphs for learning.

What This Example Does

The script analyzes your codebase and generates training data by:

Finding All Functions
- Scans the entire codebase to identify function definitions
- Filters out trivial functions (less than 2 lines)

Capturing Implementation Context

{"implementation": {"source": "def process_data():\n    ...", "filepath": "src/process.py"}}

Extracting Dependencies

{"dependencies": [{"source": "def helper_function():\n    ...", "filepath": "src/helpers.py"}]}

Recording Usages

{"usages": [{"source": "result = process_data()", "filepath": "src/main.py"}]}

Running the Example

# Install Codegen
pip install codegen

# Run the data generation
python run.py

The script will analyze your codebase and output a training_data.json file containing the structured training data.

Understanding the Code

run.py - The main script that generates the training data
- Uses get_function_context() to extract implementation, dependencies, and usages
- Processes each function and builds a comprehensive context graph
- Outputs structured JSON data with metadata about the processing

Output Format

The generated training_data.json follows this structure:

{
  "functions": [
    {
      "implementation": {
        "source": "...",
        "filepath": "..."
      },
      "dependencies": [
        {
          "source": "...",
          "filepath": "..."
        }
      ],
      "usages": [
        {
          "source": "...",
          "filepath": "..."
        }
      ]
    }
  ],
  "metadata": {
    "total_functions": 100,
    "total_processed": 85,
    "avg_dependencies": 2.5,
    "avg_usages": 3.2
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate Codebase Pre-Training Data

What This Example Does

Running the Example

Understanding the Code

Output Format

Learn More

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Generate Codebase Pre-Training Data

What This Example Does

Running the Example

Understanding the Code

Output Format

Learn More