Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions apps/git-second-brain/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Git Second Brain

A RAG (Retrieval-Augmented Generation) application that lets you ask
natural-language questions about **any Git repository** by analysing its
commit history. The included example uses the **FastAPI** open-source project.

Commits are embedded as vectors and stored in **Oracle AI Database 26ai**.
At query time the most relevant commits are retrieved via `VECTOR_DISTANCE`
and passed as context to an OpenAI model through **LangChain**, producing
grounded answers with commit citations.

## Project structure

```
git-second-brain/
├── database/ # SQL scripts: user creation + schema setup
├── data-loader/ # One-time ETL: parse commits, embed, load into Oracle 26ai
├── app/ # Streamlit chat UI + LangChain RAG chain
├── diffs/ # Pre-extracted per-commit diff files
└── fastapi_commits.txt # Delimited commit metadata
```

| Folder | Purpose | Details |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
| **database/** | SQL scripts to create the Oracle user, table, indexes, and (optionally) the vector index. | [database/README.md](database/README.md) |
| **data-loader/** | Reads the extracted commit metadata and diff files, generates 384-dim vector embeddings with `sentence-transformers`, and bulk-inserts everything into Oracle 26ai. | [data-loader/README.md](data-loader/README.md) |
| **app/** | Streamlit chat interface where users ask questions. A custom LangChain retriever queries Oracle 26ai vector search, and the retrieved commits are sent to OpenAI to generate a cited answer. | [app/README.md](app/README.md) |

## Extracting repo data

The examples below use **FastAPI**, but this works with **any Git repository**.

```bash
# Clone the target repo
git clone https://github.com/tiangolo/fastapi.git
mkdir diffs
cd fastapi

# Extract commit metadata with safe delimiters
git log --all --no-merges \
--pretty=format:"<<<COMMIT>>>%n%H%n%an%n%aI%n%s%n<<<BODY>>>%n%b%n<<<END>>>%n" \
> ../fastapi_commits.txt

# Extract diff stats as a single file
git log --all --no-merges \
--pretty=format:"===SHA:%H===" --stat \
> ../diffs/all_diffs.txt

cd ..
```

> **Tip:** The data loader caps at 3 000 commits by default, which keeps
> indexing time under 10 minutes and covers roughly 2015–today for FastAPI.

## Prerequisites

- Python 3.10+
- Oracle AI Database 26ai (running and accessible)
- OpenAI API key (for the chat app)

## Quick start

> **Important:** Load the environment variables from each folder's `.env` file
> before running Python scripts. See each folder's README for details.

```bash
# 0. Set up the database
cd database
sqlplus system/Welcome_123@//localhost:1521/FREEPDB1 @01_create_user.sql
sqlplus system/Welcome_123@//localhost:1521/FREEPDB1 @02_create_schema.sql
cd ..

# 1. Extract repo data (see "Extracting repo data" above)

# 2. Load data into Oracle 26ai
cd data-loader
python -m venv .venv && .venv\Scripts\activate # or source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # fill in your Oracle credentials
# load env vars, then:
python load_data.py
cd ..

# 3. Run the app
cd app
python -m venv .venv && .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # fill in Oracle + OpenAI credentials
# load env vars, then:
streamlit run app.py
```

See each folder's README for full setup and configuration details.
7 changes: 7 additions & 0 deletions apps/git-second-brain/app/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Oracle AI Database 26ai connection
ORACLE_USER=GITHUB_SECOND_BRAIN
ORACLE_PASSWORD=<your-password>
ORACLE_DSN=localhost:1521/FREEPDB1

# OpenAI (can also be entered in the Streamlit sidebar)
OPENAI_API_KEY=sk-...
101 changes: 101 additions & 0 deletions apps/git-second-brain/app/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Git Second Brain — App

Streamlit chat UI that lets you ask natural-language questions about a
repository's commit history, powered by **Oracle AI Database 26ai Vector Search**,
LangChain, and OpenAI.

## Architecture

```
User question
┌────────────────────┐ ┌──────────────────────────┐
│ Streamlit (app.py)│─────▶│ OracleCommitRetriever │
│ Chat interface │ │ sentence-transformers │
└────────┬───────────┘ │ + Oracle 26ai vector │
│ │ VECTOR_DISTANCE search │
│ context docs └──────────────────────────┘
┌────────────────────┐
│ LangChain RAG │
│ ChatOpenAI (GPT) │
└────────────────────┘
```

## Prerequisites

| Requirement | Version |
| ----------------------- | ----------------------------- |
| Python | 3.10+ |
| Oracle AI Database 26ai | Running and accessible |
| OpenAI API key | Any `gpt-4o-mini` capable key |

The `data-loader/` must have been run first so the `FASTAPI_COMMITS` table is
populated with embeddings.

## Setup

```bash
cd app
python -m venv .venv

# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activate

pip install -r requirements.txt
```

Copy `.env.example` to `.env` and fill in your credentials:

```bash
cp .env.example .env
```

## Running

The app reads Oracle credentials from environment variables. Load them before
starting Streamlit:

```bash
# Load env vars from .env (use your preferred method)
# Windows PowerShell:
Get-Content .env | ForEach-Object { if ($_ -match '^([^#].+?)=(.*)$') { [Environment]::SetEnvironmentVariable($Matches[1], $Matches[2]) } }

# Linux / macOS:
# export $(grep -v '^#' .env | xargs)

streamlit run app.py
```

The app opens at <http://localhost:8501>.

## Smoke test

A standalone script that verifies the vector-search round trip without
Streamlit or OpenAI. Requires the same environment variables:

```bash
python smoke_test.py
```

## Files

| File | Purpose |
| ------------------ | ------------------------------------------------------------- |
| `app.py` | Streamlit chat UI + LangChain RAG chain |
| `retriever.py` | LangChain `BaseRetriever` backed by Oracle 26ai vector search |
| `smoke_test.py` | Minimal end-to-end connectivity & vector-search test |
| `requirements.txt` | Pinned Python dependencies |
| `.env.example` | Template for required environment variables |

## Environment variables

| Variable | Required | Default | Description |
| ----------------- | -------- | ------- | ---------------------------------------------- |
| `ORACLE_USER` | Yes | — | Database username |
| `ORACLE_PASSWORD` | Yes | — | Database password |
| `ORACLE_DSN` | Yes | — | Connect string, e.g. `localhost:1521/FREEPDB1` |
| `OPENAI_API_KEY` | No | — | Can also be entered in the Streamlit sidebar |
171 changes: 171 additions & 0 deletions apps/git-second-brain/app/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
"""
Git Second Brain - Streamlit Chat UI
Ask natural-language questions about FastAPI's commit history,
powered by Oracle AI Database 26ai Vector Search + LangChain + OpenAI.

Run:
streamlit run app.py
"""

import os

import streamlit as st
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from retriever import OracleCommitRetriever

# ========================= Page config =========================
st.set_page_config(
page_title="Git Second Brain",
page_icon="🧠",
layout="wide",
)

# ========================= Sidebar =============================
with st.sidebar:
st.title("Git Second Brain")
st.caption("Oracle AI Database 26ai + LangChain + OpenAI")

openai_key = st.text_input(
"OpenAI API Key",
type="password",
value=os.getenv("OPENAI_API_KEY", ""),
help="Stored only in this session, never persisted.",
)

model_name = st.selectbox(
"Model",
["gpt-4o-mini", "gpt-4o", "gpt-4.1-mini", "gpt-4.1-nano"],
index=0,
)

top_k = st.slider("Commits to retrieve", min_value=3, max_value=15, value=8)

temperature = st.slider("Temperature", min_value=0.0, max_value=1.0, value=0.2, step=0.05)

st.divider()
st.markdown(
"**How it works**\n\n"
"1. Your question is embedded with sentence-transformers\n"
"2. Oracle 26ai runs `VECTOR_DISTANCE` to find the most relevant commits\n"
"3. LangChain passes those commits as context to OpenAI\n"
"4. You get a grounded answer with commit citations"
)

st.divider()
st.markdown("**Sample questions**")
sample_questions = [
"Why did FastAPI switch to Pydantic v2?",
"How has dependency injection evolved?",
"What were the biggest breaking changes in the last 2 years?",
"When did lifespan replace startup/shutdown events?",
"What security fixes were applied recently?",
]
for q in sample_questions:
if st.button(q, use_container_width=True):
st.session_state["prefill"] = q

# ========================= System prompt =======================
SYSTEM_PROMPT = """\
You are Git Second Brain, an AI assistant that answers questions about the
FastAPI open-source project by analyzing its Git commit history.

You will receive a set of relevant commits retrieved from Oracle AI Database 26ai
via vector similarity search. Use ONLY these commits to answer the question.
If the commits do not contain enough information, say so honestly.

Rules:
- Cite specific commits by their short SHA and date when supporting a claim.
- Summarize the narrative arc when multiple commits tell a story.
- Keep answers concise but thorough (3-6 paragraphs max).
- If you are unsure, say "Based on the commits I found..." to hedge.
- Never invent commit SHAs or dates.
"""

RAG_TEMPLATE = ChatPromptTemplate.from_messages(
[
("system", SYSTEM_PROMPT),
("human", "Retrieved commits:\n\n{context}\n\n---\nQuestion: {question}"),
]
)

# ========================= Init state ==========================
if "messages" not in st.session_state:
st.session_state.messages = []

if "retriever" not in st.session_state:
with st.spinner("Connecting to Oracle AI Database 26ai ..."):
st.session_state.retriever = OracleCommitRetriever(top_k=top_k)

# ========================= Chat display ========================
st.header("Ask your repo anything")

for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
if msg.get("sources"):
with st.expander(f"Retrieved commits ({len(msg['sources'])})"):
for doc in msg["sources"]:
meta = doc.metadata
st.markdown(
f"**`{meta['sha'][:10]}`** | {meta['date']} | "
f"*{meta['author']}*\n\n"
f"> {meta['subject']}"
)
st.divider()

# ========================= Chat input ==========================
prefill = st.session_state.pop("prefill", None)
user_input = st.chat_input("Ask about FastAPI's history ...") or prefill

if user_input:
if not openai_key:
st.error("Please enter your OpenAI API key in the sidebar.")
st.stop()

# Show user message
st.session_state.messages.append({"role": "user", "content": user_input})
with st.chat_message("user"):
st.markdown(user_input)

# Retrieve from Oracle 26ai
with st.chat_message("assistant"):
with st.spinner("Searching Oracle 26ai Vector Search ..."):
retriever = st.session_state.retriever
retriever.top_k = top_k
docs = retriever.invoke(user_input)

context = "\n\n---\n\n".join(doc.page_content for doc in docs)

# LangChain RAG chain
llm = ChatOpenAI(
model=model_name,
temperature=temperature,
api_key=openai_key,
)
chain = RAG_TEMPLATE | llm | StrOutputParser()

with st.spinner("Generating answer ..."):
answer = chain.invoke({"context": context, "question": user_input})

st.markdown(answer)

# Show retrieved commits
with st.expander(f"Retrieved commits ({len(docs)})"):
for doc in docs:
meta = doc.metadata
st.markdown(
f"**`{meta['sha'][:10]}`** | {meta['date']} | "
f"*{meta['author']}*\n\n"
f"> {meta['subject']}"
)
st.divider()

st.session_state.messages.append(
{
"role": "assistant",
"content": answer,
"sources": docs,
}
)
6 changes: 6 additions & 0 deletions apps/git-second-brain/app/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
oracledb>=2.2.0,<4
sentence-transformers>=5.0,<6
langchain>=1.2,<2
langchain-core>=1.2,<2
langchain-openai>=1.1,<2
streamlit>=1.38,<2
Loading