Common issues and solutions when running PortOS.
Symptom: Server fails to start with EADDRINUSE error.
Solution:
# Find what's using the port
lsof -i :5554
lsof -i :5555
# Kill the process or choose different ports in ecosystem.config.cjsSymptom: pm2 start ecosystem.config.cjs shows process but status is errored.
Solution:
# Check PM2 logs for errors
pm2 logs portos-server --lines 100
# Common causes:
# - Missing dependencies: npm run install:all
# - Missing data directory: mkdir -p data
# - Port conflict: check EADDRINUSE errorsSymptom: Server crashes with ENOENT errors about files in data/.
Solution:
# Copy sample data files
cp -r data.reference/* data/Symptom: PortOS works on localhost but not from phone/tablet.
Causes and Solutions:
- Tailscale not connected: Ensure both devices are on same Tailscale network
- Firewall blocking: Check local firewall allows ports 5554-5555
- Server bound to localhost: PortOS should bind to 0.0.0.0 (default)
# Verify server is listening on all interfaces
netstat -an | grep 5555
# Should show: *.5555 or 0.0.0.0:5555Symptom: Real-time features (logs, CoS updates) stop working.
Solution:
- Check browser console for WebSocket errors
- Verify server is running:
pm2 status - Restart server:
pm2 restart ecosystem.config.cjs
Symptom: DevTools runs fail with "command not found".
Solution:
# Install Claude Code globally
npm install -g @anthropic-ai/claude-code
# Verify installation
which claude
claude --versionSymptom: AI runs fail with authentication errors.
Solution:
- Check provider configuration in PortOS Settings
- Verify API key is valid and has credits
- For Claude: ensure
ANTHROPIC_API_KEYis set
Symptom: Error "model: xyz not found" or similar.
Solution:
- Verify model name matches provider's available models
- Check provider documentation for correct model identifiers
- Common models:
- Claude:
claude-sonnet-5,claude-opus-4-8,claude-haiku-4-5-20251001 - OpenAI:
gpt-5,gpt-5-mini - Ollama: Model must be pulled first (
ollama pull llama3)
- Claude:
Symptom: CoS page shows "Stopped" status.
Solution:
- Click "Start" button in CoS UI
- Or enable
alwaysOn: truein CoS config - Check server logs for startup errors
Symptom: Tasks stay in "pending" status, no agents start.
Solution:
# Check CoS runner is running
pm2 status | grep portos-cos
# Check runner logs
pm2 logs portos-cos --lines 100
# Verify Claude CLI is available
which claudeSymptom: Added tasks to TASKS.md but CoS ignores them.
Solution:
- Verify task format matches expected syntax:
## Pending - [ ] #task-001 | HIGH | Task description
- Check file path in CoS config matches your TASKS.md location
- Trigger manual evaluation via UI
Symptom: Memory search returns no results, embeddings fail.
Solution:
- Ensure LM Studio is running on port 1234
- Load an embedding model in LM Studio (e.g.,
nomic-embed-text) - Check memory embeddings status:
GET /api/memory/embeddings/status
Symptom: PM2 shows high restart count, app unstable.
Solution:
# Check for crash reason
pm2 logs portos-server --lines 200
# Common causes:
# - Unhandled exceptions (check error handling)
# - Memory limit exceeded (increase max_memory_restart)
# - Missing environment variablesSymptom: pm2 stop doesn't work or processes restart.
Solution:
# Stop specific ecosystem
pm2 stop ecosystem.config.cjs
# Never use these (affects all PM2 apps):
# pm2 kill ← Don't use
# pm2 delete all ← Don't useSymptom: Code changes don't take effect.
Solution:
# Restart to pick up changes
pm2 restart ecosystem.config.cjs
# For frontend changes, Vite hot-reload should work
# For server changes, PM2 watch mode can help (if enabled)Symptom: Server fails fast at startup with a database health error.
PostgreSQL is a mandatory dependency (see STORAGE.md) — there is no silent file fallback.
Solution:
# Re-run DB provisioning (system pg on :5432 or Docker on :5561)
npm run setup:db
# Docker mode: make sure the container is up
docker compose up -d
# Check what's answering
pg_isready -h localhost -p 5432 || pg_isready -h localhost -p 5561Universes, series, catalog ingredients, memories, and other relational records live in PostgreSQL, not data/ files. Inspect them via the Database settings tab or psql. To recover, restore a snapshot's portos-db.sql from the Backup tab (see BACKUP.md).
Symptom: Apps disappear after restart.
Causes:
data/apps.jsonwas deleted or corrupted (app registry is file-backed)- File permissions prevent writing
Solution:
# Check file exists and is valid JSON
cat data/apps.json | jq .
# If corrupted, restore from backup or recreateSymptom: Action history clears on restart.
Solution:
- Check
data/history.jsonlexists and is writable - Verify disk space available
Causes and Solutions:
- Large log files: Clear old logs with
pm2 flush - Many apps: Pagination added in recent versions
- Network latency: Use local access when possible
Solution:
# Check PM2 memory usage
pm2 monit
# Set memory limits in ecosystem.config.cjs
max_memory_restart: '500M'Symptom: AI runs hit timeout before completing.
Solution:
- Increase timeout in provider settings
- Break large tasks into smaller chunks
- Check network connectivity to AI provider
Symptom: Frontend changes require manual refresh.
Solution:
- Check Vite is running:
pm2 logs portos-client - Ensure file watchers aren't exhausted:
fs.inotify.max_user_watches
Solution:
cd server
npm test
# For specific test file
npm test -- taskParser.test.js
# Watch mode for development
npm run test:watchSymptom: The whole machine hard-reboots while an mflux LoRA training run is
active. After reboot you may see downstream PortOS errors — training failed with
SIGINT/KeyboardInterrupt, Tombstone sweep failed: timeout … connect, CoS
xhr poll error. The crash report under /Library/Logs/DiagnosticReports/ reads:
panic(cpu N caller 0x…): watchdog timeout: no checkins from watchdogd in 90 seconds
Cause: A system-level hang (not a PortOS or training-script bug) — the machine
stopped making forward progress long enough that the hardware watchdog
force-rebooted it. On new Apple Silicon (M5 / Mac17,7) under sustained Metal/GPU
load this is most likely a GPU/Metal driver hang; thermal/power or severe swap
thrash are secondary possibilities. First observed 2026-06-13 (twice in one day).
Upstream root cause (mlx #3267 / #3186): This is an Apple Metal/IOGPU driver
behavior, confirmed across M2/M3/M4/M5 hardware, not a PortOS bug. Two related
forms: the GPU watchdog kills a training process whose command buffers compete
with active-display WindowServer compositing
(mlx #3267,
kIOGPUCommandBufferCallbackErrorImpactingInteractivity), escalating on M5-class
silicon to the full watchdogd kernel panic above; and a separate IOGPU
memory-management panic (mlx #3186,
filed with Apple as FB22091885). The MLX maintainer's confirmed workaround is the
AGX_RELAX_CDM_CTXSTORE_TIMEOUT=1 env var — PortOS now sets this automatically
for the trainer (scripts/train_mflux_lora.py). Note per #3267 the kill is at the
IOGPU layer above the process boundary, so process-teardown segmentation alone
does not prevent it; the env var attacks the actual cause.
Strongest user-side mitigation — keep the display off the GPU. The watchdog
fires hardest when training competes with the active display. Run training with the
display asleep / lid closed under caffeinate -s &, or drive the box headless over
SSH from another machine. With no WindowServer compositing the watchdog has nothing
to protect and largely stops firing.
Validated on M5 Max (2026-06-27) — keeping the display off is what works. A clean A/B on the same 9B-bf16 segmentation-OFF config, both with
AGX_RELAX_CDM_CTXSTORE_TIMEOUT=1set:
- Display active → hard-rebooted within minutes (at step 0). Telemetry: Nominal thermal pressure (a GPU driver hang, not heat); paniclog top-CPU thread was
WindowServer(active-display contention).- Display asleep (
pmset displaysleepnow, lid open) → reached step 302, clearing the documented 150–300 panic window with no panic.So for a 9B / heavy bf16 run, the env var is necessary but not sufficient — you must also keep the display off: drive the box over SSH (cleanest), or run
pmset displaysleepnowright after launching.caffeinate -salone does not turn the display off; it only prevents system sleep, and closing the lid sleeps the whole machine (suspending the run) unless an external display is attached. Also prefer segmentation ON (the shipped default), which completed a full 4B run cleanly on this box.
Mitigations already in place:
AGX_RELAX_CDM_CTXSTORE_TIMEOUT=1is set automatically by the trainer (the maintainer-confirmed workaround for mlx #3267); the run log showsSTATUS:watchdog mitigation · AGX_RELAX_CDM_CTXSTORE_TIMEOUT=1so a paniclog records whether it was active. Set it to0in the environment to disable.- The mlx/mlx-metal backend is pinned to the validated 0.31.2 trio in
scripts/setup-image-video.sh(the original panics were on 0.30.6). - Training checkpoints at least every
ceil(totalSteps/4)steps (MFLUX_MIN_CHECKPOINTS), so a crash loses at most ~¼ of a run. Resume from the newestcheckpoints/*.zipvia the UI's resume action or--resume-checkpoint. - Each run captures GPU/thermal/power telemetry to
<run>/powermetrics.log(a resume rolls to a timestampedpowermetrics.<ts>.logso the pre-crash log is preserved) when passwordlesspowermetricsis configured (see the incident record for the sudoers rule). - Memory pressure is bounded before each run (
memoryPrep.js): resident Ollama / LM Studio models on loopback-local backends are unloaded to free unified memory (a remote LAN backend is left untouched), the encoded-dataset cache always spills to disk (low_ram), the quantize tier is sized to available memory rather than total RAM, and a run refuses to start when under ~24 GB is free — so an oversubscribed run can't swap-thrash the box into a reboot. If a run won't start with a "not enough free memory" error, stop other model servers or close apps and retry.
What to do / how to investigate: see the full incident record and checklist in
docs/research/2026-06-13-mflux-training-watchdog-panic.md.
Short version: read the run's newest powermetrics*.log (climbing GPU temp → cooling/power;
log just stops at normal temps → driver hang), reduce batch size/resolution/rank as
a test, and update macOS + mflux/mlx.
- Check logs:
pm2 logsshows all process output - Browser console: F12 → Console for frontend errors
- Server logs: Look for emoji prefixes (❌ errors,
⚠️ warnings) - GitHub Issues: Report bugs at https://github.com/atomantic/PortOS/issues