Skip to content

Commit cb63ac2

Browse files
DE Zoomcamp docs wip
1 parent dba234d commit cb63ac2

File tree

173 files changed

+1723
-384
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

173 files changed

+1723
-384
lines changed

.claude/commands/article.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
---
2+
description: Create an article from a transcript with illustration placeholders
3+
arguments:
4+
- name: transcript
5+
description: Path to the transcript file
6+
required: true
7+
---
8+
9+
# Create Article from Transcript
10+
11+
## Process Overview
12+
13+
Read the transcript file and create a well-structured article based on its content. Include illustration placeholders with estimated timestamps for later extraction from the video.
14+
15+
## Formatting Rules
16+
17+
### Header Structure
18+
- Use `#` for the title only (first line)
19+
- Use `##` for major sections
20+
- Use `###` for subsections
21+
22+
### Dashes
23+
- Always have spaces around dashes: `` (en-dash)
24+
25+
### Section Dividers
26+
- Never use `---` to divide sections
27+
28+
### Bold Formatting
29+
- Keep bold to minimum
30+
- Only use for essential emphasis (key warnings, critical points)
31+
- Don't bold names, list items, or regular emphasis
32+
33+
### Illustration Placeholders
34+
35+
Format: `**[Illustration placeholder: description - timestamp ~MM:SS]**`
36+
37+
#### When Illustrations Are Needed
38+
39+
Only add illustrations when text alone is insufficient to understand what's happening:
40+
41+
- **UI screenshots**: Interfaces students need to navigate (GitHub repo, course platform, Slack, etc.)
42+
- **Diagrams**: Architecture diagrams, flowcharts, system layouts described but not shown
43+
- **Data visualizations**: Charts, graphs, plots containing key information
44+
- **Code/terminal**: Actual code snippets or terminal output shown on screen
45+
- **Technical concepts**: Docker containers, workflows, pipelines explained visually
46+
- **Illustrative examples**: Sample outputs, posts, or results that demonstrate format
47+
- **Complex relationships**: Mind maps or diagrams showing connections between topics
48+
49+
#### When Illustrations Are NOT Needed
50+
51+
Skip illustrations for:
52+
- Simple lists or bullet points (even if presented as infographic in video)
53+
- People photos or headshots (names are sufficient)
54+
- Text-only explanations clearly described in words
55+
- Decorative elements (logos, title cards, motivational graphics)
56+
- Quotes or spoken content
57+
- Things already visible in nearby illustrations
58+
- Checklists when text already has numbered steps
59+
60+
#### Placement
61+
62+
- Place immediately before or after the relevant content
63+
- Don't add redundant illustrations for the same thing
64+
- Estimate timestamps based on content flow (rough percentage of total length)
65+
66+
#### Timestamp Estimates Are Approximate
67+
68+
**Important**: Timestamps in placeholders are estimates based on content flow. During illustration extraction:
69+
70+
1. **Extract frames** around the estimated timestamp (±5 seconds)
71+
2. **Verify the frame matches** the article description - read the article text around the placeholder to understand what should be visible
72+
3. **If mismatch occurs**: The timestamp estimate was wrong. Search the transcript for relevant keywords discussed in that section to find the actual timestamp
73+
4. **Re-extract** from the corrected timestamp
74+
75+
Example: Article says "taxi trip data visualization" at ~12:00, but extracted frame shows GitHub repo. Search transcript for "taxi data" or "NYC taxi" to find where that topic was actually discussed (may be ~10:50 instead).
76+
77+
## Steps
78+
79+
0. **Check for timestamps**: The transcript MUST contain timestamps for illustration placement. If the transcript doesn't have timestamps (format like `0:00`, `1:23`, etc. on separate lines), DO NOT proceed - illustrations cannot be extracted without accurate timestamps.
80+
81+
1. Read the transcript file at `{{transcript}}`
82+
83+
2. Analyze the content and identify:
84+
- Main topics and sections
85+
- Key points and quotes
86+
- Natural breaks for illustrations
87+
- **Actual timestamps** from the transcript for each illustration topic
88+
- Overall narrative arc
89+
90+
3. Structure the article:
91+
- Title (clear, descriptive)
92+
- Introduction (what is this about)
93+
- Main sections (group related content)
94+
- Subsections as needed
95+
- Conclusion/final thoughts
96+
97+
4. Write the content:
98+
- Use clear, concise language
99+
- Preserve key quotes accurately
100+
- Maintain speaker voice where appropriate
101+
- Add illustration placeholders with **ACTUAL TIMESTAMPS FROM TRANSCRIPT** (not estimates)
102+
103+
5. Save to `_temp/` directory with descriptive filename
104+
105+
## Output Format
106+
107+
Save as markdown file in `_temp/`:
108+
- Filename: `[topic-name].md`
109+
- Title: `# Title`
110+
- Sections: `## Section Name`
111+
- Subsections: `### Subsection Name`
112+
- Illustration placeholders: `**[Illustration placeholder: description - timestamp ~MM:SS]**`
113+
114+
## Content Guidelines
115+
116+
- Prioritize clarity over completeness
117+
- Group related ideas together
118+
- Use lists for sequential information
119+
- Preserve important quotes in blockquotes
120+
- Keep paragraphs short (3-5 sentences)
121+
- Use tables for structured comparisons when appropriate
Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
---
2+
description: Extract frames from video around timestamp and select best illustration
3+
arguments:
4+
- name: video
5+
description: Path to video file
6+
required: true
7+
- name: timestamp
8+
description: Timestamp in MM:SS or HH:MM:SS format (estimated)
9+
required: true
10+
- name: name
11+
description: Illustration name (used for directory: _temp/frames/{name}/)
12+
required: true
13+
- name: description
14+
description: What content we're looking for (helps selection)
15+
required: true
16+
- name: context
17+
description: Article text around the illustration (for verification)
18+
required: true
19+
---
20+
21+
# Extract Illustration Frames from Video
22+
23+
## Dependencies
24+
25+
- **ffmpeg** - For extracting keyframes from video
26+
- **ImageMagick** (convert command) - For cropping and JPEG conversion
27+
28+
Install if needed:
29+
```bash
30+
# Windows (with Chocolatey)
31+
choco install ffmpeg imagemagick
32+
33+
# macOS (with Homebrew)
34+
brew install ffmpeg imagemagick
35+
36+
# Linux (Ubuntu/Debian)
37+
sudo apt install ffmpeg imagemagick
38+
```
39+
40+
## Process Overview
41+
42+
Extract keyframes in a narrow window around the timestamp, remove duplicates, then select the best frame for the article illustration.
43+
44+
## Steps
45+
46+
### 0. Setup Aliases
47+
48+
```bash
49+
# Arguments: video=$1, timestamp=$2, name=$3, description=$4, context=$5
50+
video=$1
51+
timestamp=$2
52+
name=$3
53+
description=$4
54+
context=$5
55+
```
56+
57+
### 1. Create Output Directories (Parallel-Safe)
58+
59+
```bash
60+
# Create dedicated directory for this illustration extraction
61+
mkdir -p "_temp/frames/$name"
62+
mkdir -p _temp/illustrations
63+
```
64+
65+
### 2. Extract Frames Using FFmpeg
66+
67+
**First, try keyframes (natural scene changes):**
68+
69+
```bash
70+
# Seek to timestamp, then extract keyframes from ±5 second window
71+
ffmpeg -ss $timestamp -i "$video" -ss -00:00:05 -t 00:00:10 \
72+
-vf "select=eq(pict_type\,I)+setpts=N/TB" -vsync 0 \
73+
"_temp/frames/$name/%{pts:hms}.png"
74+
```
75+
76+
This produces files like: `00_09_47.321.png` where the timestamp is the actual video time.
77+
78+
**Check if we got any keyframes:**
79+
```bash
80+
ls "_temp/frames/$name/" | wc -l
81+
```
82+
83+
**If 0 keyframes found, extract 7 specific frames at exact offsets:**
84+
85+
```bash
86+
# Extract frames at: -2, -1, -0.5, 0, +0.5, +1, +2 seconds
87+
# Example: for timestamp 10:00, extract at 9:58, 9:59, 9:59.5, 10:00, 10:00.5, 10:01, 10:02
88+
89+
# Calculate timestamps based on $timestamp - adjust these values for your timestamp
90+
# For $timestamp = 10:00:
91+
ffmpeg -ss 00:09:58 -i "$video" -vframes 1 "_temp/frames/$name/00_09_58.000.png"
92+
ffmpeg -ss 00:09:59 -i "$video" -vframes 1 "_temp/frames/$name/00_09_59.000.png"
93+
ffmpeg -ss 00:09:59.5 -i "$video" -vframes 1 "_temp/frames/$name/00_09_59.500.png"
94+
ffmpeg -ss 00:10:00 -i "$video" -vframes 1 "_temp/frames/$name/00_10_00.000.png"
95+
ffmpeg -ss 00:10:00.5 -i "$video" -vframes 1 "_temp/frames/$name/00_10_00.500.png"
96+
ffmpeg -ss 00:10:01 -i "$video" -vframes 1 "_temp/frames/$name/00_10_01.000.png"
97+
ffmpeg -ss 00:10:02 -i "$video" -vframes 1 "_temp/frames/$name/00_10_02.000.png"
98+
```
99+
100+
### 3. Remove Duplicate Frames (keyframes only)
101+
102+
First pass: Remove exact duplicates by file size:
103+
104+
```python
105+
from pathlib import Path
106+
107+
frames_dir = Path("_temp/frames/$name")
108+
sizes = {}
109+
for f in frames_dir.glob("*.png"):
110+
size = f.stat().st_size
111+
if size not in sizes:
112+
sizes[size] = f
113+
else:
114+
f.unlink() # Remove duplicate
115+
116+
print(f"After dedup: {len(sizes)} unique frames")
117+
```
118+
119+
Second pass: Remove visually similar frames (optional, if still too many):
120+
121+
```python
122+
from PIL import Image
123+
import numpy as np
124+
125+
frames = sorted(Path("_temp/frames/$name").glob("*.png"))
126+
to_remove = set()
127+
128+
for i in range(len(frames) - 1):
129+
img1 = np.array(Image.open(frames[i]))
130+
img2 = np.array(Image.open(frames[i + 1]))
131+
132+
# Simple difference: mean absolute error
133+
diff = np.abs(img1.astype(float) - img2.astype(float)).mean()
134+
135+
# Threshold: if difference < 5% of pixel range, consider duplicate
136+
if diff < 12.75: # 255 * 0.05
137+
to_remove.add(frames[i + 1])
138+
139+
for f in to_remove:
140+
f.unlink()
141+
142+
print(f"After visual dedup: {len(frames) - len(to_remove)} frames remaining")
143+
```
144+
145+
### 4. Review Remaining Frames and Verify Match
146+
147+
Read each frame and evaluate based on:
148+
149+
**Selection Criteria:**
150+
- **Clarity**: Text is readable, not motion-blurred
151+
- **Completeness**: Full content visible (no cut-off elements)
152+
- **Relevance**: Shows exactly what the description asks for
153+
- **Visual Quality**: Good contrast, no visual artifacts
154+
- **UI State**: Buttons/menus in clear, useful state
155+
156+
**Verification Step (CRITICAL - MUST use analyze_image tool):**
157+
158+
The `Read` tool alone CANNOT reliably verify image content. You MUST use the `analyze_image` tool with a detailed prompt.
159+
160+
**What is analyze_image?**
161+
- A tool that analyzes images and returns detailed text descriptions
162+
- Can read text, identify UI elements, describe layouts, and understand content
163+
- Takes two parameters:
164+
- `imageSource`: URL of the image to analyze
165+
- `prompt`: What question to ask about the image
166+
167+
**Verification Process:**
168+
169+
1. **Use analyze_image with a dynamic prompt based on description and context:**
170+
171+
```
172+
We expect this image to show: $description
173+
174+
Context from article: $context
175+
176+
Please analyze:
177+
1. What does this image actually show? (describe type of page, main text, content)
178+
2. Does this match what we expect? If not, what DOES it show?
179+
3. For cropping: any browser chrome at top (how many pixels to remove)? Sidebars to crop?
180+
```
181+
182+
2. **Compare the analyze_image output:**
183+
- The prompt tells the tool what we EXPECT (from $description)
184+
- The tool tells us what it ACTUALLY sees
185+
- Compare: Does the actual content match the expected content?
186+
187+
3. **Decision based on comparison:**
188+
- If the content matches → Proceed to save
189+
- If the content does NOT match → Wrong timestamp. Search transcript for keywords to find correct time.
190+
191+
**Example verification using analyze_image:**
192+
193+
Prompt expects: "Certificate example showing requirements: complete final project successfully, participate in peer reviews"
194+
195+
analyze_image result: "This is a Q&A interface from Slido... showing anonymous user questions about getting DE jobs without degrees"
196+
197+
Verdict: MISMATCH - The image shows Slido Q&A, NOT certificate requirements. Timestamp is wrong.
198+
199+
**Common verification failures (detected by analyze_image):**
200+
- Article says "Docker & Infrastructure" → Image shows "Course logistics"
201+
- Article says "Certificate example" → Image shows "Slido Q&A interface"
202+
- Article says "YouTube channel" → Image shows only "LinkedIn"
203+
- Article says "Project pipeline" → Image shows "GitHub repo/commits"
204+
205+
**Timestamp Proximity Rule:**
206+
- We ONLY look within ±5 seconds of target timestamp - never more
207+
- Frame names include ACTUAL timestamp (e.g., `09_58.png`, `10_00.png`, `10_02.png`)
208+
- This makes it obvious the exact video time each frame represents
209+
210+
### 5. Select Best Frame
211+
212+
After reviewing all frames:
213+
1. Identify the best frame
214+
2. Explain why it was chosen
215+
3. Note if cropping is needed
216+
217+
### 6. Crop if Necessary (using ImageMagick)
218+
219+
If the best frame needs cropping:
220+
221+
```bash
222+
# First crop to temp filename to assess
223+
convert "_temp/frames/$name/keyframe_XXXX.png" -crop {width}x{height}+{x}+{y} "_temp/frames/$name/keyframe_XXXX-cropped.png"
224+
225+
# Read and assess the cropped version, then finalize
226+
convert "_temp/frames/$name/keyframe_XXXX-cropped.png" -quality 85 "_temp/illustrations/$name.jpg"
227+
```
228+
229+
Common crop patterns:
230+
- Browser chrome removal: `-crop 1280x650+0+70` (remove ~70px from top)
231+
- Sidebars: Adjust width/x-offset to crop left or right
232+
233+
### 7. Clean Up
234+
235+
```bash
236+
rm -rf "_temp/frames/$name"
237+
```
238+
239+
## Selection Guidelines by Content Type
240+
241+
| Content Type | What to Look For |
242+
|--------------|------------------|
243+
| **UI Screenshots** | No loading spinners, fully populated data, clear labels |
244+
| **Diagrams** | Complete diagram, no cutting off edges, clear text |
245+
| **Code/Terminal** | Complete commands visible, no partial lines |
246+
| **People** | Faces visible, not mid-blink, natural expression |
247+
| **Data Visualizations** | Axes/labels visible, clear data points |
248+
| **Websites/Pages** | Fully loaded, no broken images, header visible |
249+
250+
## Output Format
251+
252+
Save final illustration as:
253+
- Filename: `[descriptive-name].jpg`
254+
- Location: `_temp/illustrations/`
255+
- Format: JPEG at quality 85 (~65% smaller than PNG)
256+
- Reasoning: Document why this frame was chosen

0 commit comments

Comments
 (0)