Skip to content

Commit 2e2515d

Browse files
Fred add post about upcoming demo 2025082901 (#4)
* adds office plant pic * adds next steps post --------- Co-authored-by: Fred McDavid <fred@rejoiner.com>
1 parent 0a2fe66 commit 2e2515d

2 files changed

Lines changed: 36 additions & 0 deletions

File tree

101 KB
Loading
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
---
2+
title: "Demo post-mortem and next steps: GPU/CPU, VRAM and Context Length"
3+
description: Continuing studies in resources vs requirements for Gen AI.
4+
date: 2025-08-30
5+
---
6+
I just finished up a [demo](/posts/notes/post-2-demo-pgvector-2025082001/) that primarily looked at using PostgreSQL as a
7+
vector database to power similarity search for a simple RAG pipeline. Though
8+
it was very much a success, I hit a couple of issues during implementation that
9+
I want to explore in future demos.
10+
11+
## Reflecting on previous demos
12+
![](images/office_plant_compressed.jpg "I have an office plant now!!")
13+
Running large language models on consumer hardware is an ongoing experiment, and
14+
each trial uncovers something new. My first retrieval-augmented generation demo
15+
used a 7B parameter model but quickly ran into GPU memory limits, forcing me to
16+
run inference on the CPU. That wasn’t the performance boost I had hoped for
17+
after investing in new GPUs, but it confirmed that local LLMs were at least
18+
feasible.
19+
20+
My [most recent demo](/posts/notes/post-2-demo-pgvector-2025082001/) with DeepSeek’s Qwen-1.5B highlighted a different challenge:
21+
RAM exhaustion when multiple models were loaded at once. That limitation came to
22+
light after I realized my singleton was two-timing me, but I'd been kinda
23+
thinking I might fit several of those into VRAM at one time.
24+
25+
I also encountered issues with context length during ingestion, which I solved
26+
by tuning the max_model_len parameter in vLLM. While that fix worked, it raised
27+
a broader question about how advertised context windows relate to configurable
28+
parameters.
29+
30+
## Next demo(s): managing resources
31+
To get a better handle on the gen AI capabilities of consumer-grade hardware, I
32+
want to try:
33+
* to run multiple models at once
34+
* to run with the GPU making use of both VRAM and system DRAM
35+
* to figure out what parameters are required to handle longer context windows
36+
and to figure out what the memory implications are when scaling these up

0 commit comments

Comments
 (0)