|
| 1 | +--- |
| 2 | +title: "Demo post-mortem and next steps: GPU/CPU, VRAM and Context Length" |
| 3 | +description: Continuing studies in resources vs requirements for Gen AI. |
| 4 | +date: 2025-08-30 |
| 5 | +--- |
| 6 | +I just finished up a [demo](/posts/notes/post-2-demo-pgvector-2025082001/) that primarily looked at using PostgreSQL as a |
| 7 | +vector database to power similarity search for a simple RAG pipeline. Though |
| 8 | +it was very much a success, I hit a couple of issues during implementation that |
| 9 | +I want to explore in future demos. |
| 10 | + |
| 11 | +## Reflecting on previous demos |
| 12 | + |
| 13 | +Running large language models on consumer hardware is an ongoing experiment, and |
| 14 | +each trial uncovers something new. My first retrieval-augmented generation demo |
| 15 | +used a 7B parameter model but quickly ran into GPU memory limits, forcing me to |
| 16 | +run inference on the CPU. That wasn’t the performance boost I had hoped for |
| 17 | +after investing in new GPUs, but it confirmed that local LLMs were at least |
| 18 | +feasible. |
| 19 | + |
| 20 | +My [most recent demo](/posts/notes/post-2-demo-pgvector-2025082001/) with DeepSeek’s Qwen-1.5B highlighted a different challenge: |
| 21 | +RAM exhaustion when multiple models were loaded at once. That limitation came to |
| 22 | +light after I realized my singleton was two-timing me, but I'd been kinda |
| 23 | +thinking I might fit several of those into VRAM at one time. |
| 24 | + |
| 25 | +I also encountered issues with context length during ingestion, which I solved |
| 26 | +by tuning the max_model_len parameter in vLLM. While that fix worked, it raised |
| 27 | +a broader question about how advertised context windows relate to configurable |
| 28 | +parameters. |
| 29 | + |
| 30 | +## Next demo(s): managing resources |
| 31 | +To get a better handle on the gen AI capabilities of consumer-grade hardware, I |
| 32 | +want to try: |
| 33 | +* to run multiple models at once |
| 34 | +* to run with the GPU making use of both VRAM and system DRAM |
| 35 | +* to figure out what parameters are required to handle longer context windows |
| 36 | + and to figure out what the memory implications are when scaling these up |
0 commit comments