deeptutor-rs

AI learning assistant with RAG + knowledge graph. A Rust reimplementation of DeepTutor.

Why This Exists

DeepTutor is a powerful AI tutoring platform, but its Python implementation carries a massive dependency tree (raganything, docling, llama-index, PyMuPDF, etc.) totaling 500MB+ and requiring complex environment setup. The knowledge graph is a thin wrapper around LightRAG with no direct control over entity extraction.

deeptutor-rs reimplements the core learning pipeline in Rust with:

Single binary (~10MB) vs Python's 500MB+ dependency tree
Custom knowledge graph with entity extraction (not a LightRAG wrapper)
Multi-strategy retrieval: BM25 keyword search, TF-IDF semantic similarity, graph-based traversal
Zero-install deployment: download and run, no Python/pip/conda needed
Memory safety guaranteed at compile time
True async I/O with tokio (no Python GIL)

Features

Document Ingestion: Parse PDF, Markdown, and text files into semantic chunks
Sentence-Aware Chunking: Fixed-size and semantic chunkers with configurable overlap
Knowledge Graph: Extract entities (concepts, acronyms, tech terms) and relationships, stored in petgraph
RAG Pipeline: Query -> multi-strategy retrieval -> context assembly -> LLM prompt
BM25 Keyword Search: Term frequency index with IDF weighting
TF-IDF Semantic Search: Cosine similarity over TF-IDF vectors
Graph-Based Retrieval: Traverse entity relationships to find related content
Hybrid Search: Weighted combination of all three strategies
Interactive Tutoring: Socratic questioning, question decomposition, follow-up generation
Session Management: Persistent conversation history with knowledge state tracking
LLM Provider Abstraction: OpenAI and Anthropic APIs via feature flags
CLI Interface: Full-featured CLI with subcommands for all operations

Quick Start

# Initialize configuration
deeptutor init

# Set your API key
export OPENAI_API_KEY=your-key

# Ingest documents
deeptutor ingest lecture-notes.pdf textbook.md -n my-course

# Ask questions
deeptutor ask "What is machine learning?" -n my-course

# Deep questioning (auto-decomposes complex questions)
deeptutor ask "How does backpropagation work in neural networks?" -n my-course --deep

# View knowledge graph
deeptutor graph stats -n my-course
deeptutor graph top -n 10 --name my-course
deeptutor graph search "neural" -n my-course
deeptutor graph neighbors "machine learning" -n my-course
deeptutor graph path "neural networks" "optimization" -n my-course

# Session management
deeptutor session list
deeptutor session show <session-id>
deeptutor session summary <session-id>

# Knowledge base info
deeptutor info -n my-course

CLI Reference

`deeptutor ingest`

Ingest documents into a knowledge base.

deeptutor ingest <files...> [-n <name>]

Supported formats: .pdf, .md, .markdown, .txt, .text, .rst

`deeptutor ask`

Ask a question using the knowledge base.

deeptutor ask <question> [-n <name>] [--deep] [--session <id>]

--deep: Decompose complex questions into sub-questions
--session: Continue an existing session

`deeptutor graph`

View and manage the knowledge graph.

Subcommand	Description
`stats`	Show entity/relationship counts
`search <query>`	Find entities by name
`neighbors <entity>`	Show related concepts
`path <from> <to>`	Find shortest path between entities
`top [-n <count>]`	Show top entities by frequency

`deeptutor session`

Manage learning sessions.

Subcommand	Description
`list`	List all sessions
`show <id>`	Show conversation history
`summary <id>`	Learning progress summary
`delete <id>`	Delete a session

`deeptutor info`

Show knowledge base statistics.

`deeptutor init`

Create default configuration file.

Architecture

                    +------------------+
                    |   CLI (clap)     |
                    +--------+---------+
                             |
                    +--------v---------+
                    |     Tutor        |
                    | - Socratic mode  |
                    | - Q decomposition|
                    | - Follow-ups     |
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
     +--------v---------+         +--------v---------+
     |   RAG Pipeline   |         |  Session Manager |
     | - Ingest         |         | - History        |
     | - Search         |         | - Knowledge state|
     | - Context format |         | - Persistence    |
     +--------+---------+         +------------------+
              |
    +---------+---------+---------+
    |         |         |         |
+---v---+ +---v---+ +---v---+ +---v---+
|Keyword| |Semantic| | Graph | |Hybrid |
| BM25  | |TF-IDF | |Traverse| |Merge  |
+-------+ +-------+ +---+---+ +-------+
                         |
                    +----v----+
                    |Knowledge|
                    |  Graph  |
                    |petgraph |
                    +---------+

Module Structure

Module	Purpose
`config`	Application configuration with serialization
`document`	Document parsing (PDF, Markdown, Text)
`chunk`	Fixed and semantic text chunking
`graph`	Knowledge graph with entity/relationship extraction
`retrieval`	BM25, TF-IDF, graph-based, and hybrid search
`rag`	Knowledge base management and RAG pipeline
`llm`	LLM provider abstraction (OpenAI, Anthropic, Mock)
`session`	Session management with knowledge state tracking
`tutor`	Interactive tutoring with Socratic mode

Configuration

Create deeptutor.json with deeptutor init:

{
  "chunk": {
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "strategy": "semantic"
  },
  "graph": {
    "min_entity_length": 2,
    "max_entities_per_chunk": 20,
    "min_confidence": 0.3
  },
  "retrieval": {
    "top_k": 5,
    "strategy": "hybrid",
    "keyword_weight": 0.3,
    "semantic_weight": 0.4,
    "graph_weight": 0.3
  },
  "llm": {
    "provider": "openai",
    "model": "gpt-4o",
    "temperature": 0.7,
    "max_tokens": 4096
  },
  "tutor": {
    "max_history_tokens": 4000,
    "socratic_mode": true,
    "generate_followups": true,
    "followup_count": 3
  }
}

LLM Providers

Provider	Environment Variable	Models
OpenAI	`OPENAI_API_KEY`	gpt-4o, gpt-4o-mini, etc.
Anthropic	`ANTHROPIC_API_KEY`	claude-3-opus, claude-3-sonnet, etc.
Mock	(none)	For testing without API keys

Override via CLI: --provider anthropic --model claude-3-sonnet-20240229

Retrieval Strategies

BM25 Keyword Search

Classic term-frequency-based ranking with inverse document frequency weighting. Best for exact term matching.

TF-IDF Semantic Search

Term frequency-inverse document frequency vectors with cosine similarity. Captures semantic similarity through term co-occurrence patterns.

Graph-Based Retrieval

Traverses the knowledge graph starting from entities matching query terms, collecting chunks associated with related entities up to configurable depth.

Hybrid Search (Default)

Weighted combination of all three strategies:

Keyword weight: 0.3
Semantic weight: 0.4
Graph weight: 0.3

Results are merged, deduplicated, and re-ranked by combined score.

Knowledge Graph

The knowledge graph extracts four types of entities:

Type	Example	Detection
Concept	"Machine Learning"	Capitalized phrases
Acronym	"API", "REST"	All-caps words (2-6 chars)
Technology	"petgraph-0.7"	Hyphenated/dotted terms
Term	"knowledge graph"	Quoted phrases

Relationships are detected through:

Pattern matching: "X is a Y", "X uses Y", "X part of Y"
Co-occurrence: Entities in the same chunk are related
Proximity weighting: Closer entities get higher relationship weight

Testing

# Run all tests
cargo test

# Run specific module tests
cargo test chunk::tests
cargo test graph::tests
cargo test retrieval::tests
cargo test rag::tests
cargo test session::tests
cargo test tutor::tests
cargo test llm::tests

# Run with output
cargo test -- --nocapture

220 tests covering all modules.

Comparison: deeptutor-rs vs DeepTutor

Metric	deeptutor-rs (Rust)	DeepTutor (Python)
Binary size	~10 MB	500+ MB (with deps)
Startup time	<50ms	2-5s (Python import)
Dependencies	15 direct	30+ (raganything, docling, llama-index...)
Knowledge graph	Custom (petgraph)	LightRAG wrapper
Entity extraction	Built-in pattern-based	Delegated to LightRAG
Retrieval strategies	4 (keyword, semantic, graph, hybrid)	4 (via LightRAG modes)
Memory safety	Compile-time guaranteed	Runtime (Python GC)
Cross-platform	Single binary	Requires Python 3.10+ setup
Session tracking	With knowledge state	Conversation history only
Tests	220	~20
Async	Native tokio	asyncio + nest_asyncio hack
Web UI	CLI only	React + Next.js
LLM providers	OpenAI, Anthropic	OpenAI, Azure, Anthropic, DashScope

What deeptutor-rs Does Better

No mega-dependencies: No raganything, docling, or llama-index
Custom knowledge graph: Full control over entity extraction and graph traversal
Knowledge state tracking: Sessions track what the student understands vs. is confused about
Single binary distribution: Download and run, no environment setup
Type-safe configuration: Compile-time checked config with serde

What DeepTutor Does Better

Web UI: Full React/Next.js frontend
More LLM providers: Azure, DashScope, local models
Web search integration: 6 search providers for real-time info
Richer agents: Co-writer, IdeaGen, guided learning modules
Embedding-based search: Real embedding models (vs our TF-IDF approximation)
Docling integration: Advanced PDF/document parsing

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
ROUND_LOG.md		ROUND_LOG.md
comparison-report.md		comparison-report.md
study-notes.md		study-notes.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deeptutor-rs

Why This Exists

Features

Quick Start

CLI Reference

`deeptutor ingest`

`deeptutor ask`

`deeptutor graph`

`deeptutor session`

`deeptutor info`

`deeptutor init`

Architecture

Module Structure

Configuration

LLM Providers

Retrieval Strategies

BM25 Keyword Search

TF-IDF Semantic Search

Graph-Based Retrieval

Hybrid Search (Default)

Knowledge Graph

Testing

Comparison: deeptutor-rs vs DeepTutor

What deeptutor-rs Does Better

What DeepTutor Does Better

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

deeptutor-rs

Why This Exists

Features

Quick Start

CLI Reference

deeptutor ingest

deeptutor ask

deeptutor graph

deeptutor session

deeptutor info

deeptutor init

Architecture

Module Structure

Configuration

LLM Providers

Retrieval Strategies

BM25 Keyword Search

TF-IDF Semantic Search

Graph-Based Retrieval

Hybrid Search (Default)

Knowledge Graph

Testing

Comparison: deeptutor-rs vs DeepTutor

What deeptutor-rs Does Better

What DeepTutor Does Better

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`deeptutor ingest`

`deeptutor ask`

`deeptutor graph`

`deeptutor session`

`deeptutor info`

`deeptutor init`

Packages