Skip to content

[Enhancement] Enable LibreOffice document conversion in the sandbox #116

@djuillard

Description

@djuillard

Hi @usnavy13,

I have a small patch (~76 lines, 4 commits) that enables LibreOffice document conversion to work inside the existing nsjail sandbox, plus a couple of related fixes I hit along the way. I wanted to discuss the direction before opening a PR.

Context

LibreOffice is already installed in the upstream image (Dockerfile lines 54-55: libreoffice-impress libreoffice-writer libreoffice-calc libreoffice-common), but invoking soffice --headless --convert-to pdf from the sandboxed Python/Bash runtime currently fails with three distinct errors:

  1. ERROR: /proc not mounted - LibreOffice is unlikely to work well if at all — soffice requires /proc to be visible.
  2. seccomp bind syscall blocked — soffice's oosplash and soffice.bin communicate via AF_UNIX sockets internally.
  3. Font discovery fails — /etc/fonts and /usr/share/fonts are not bind-mounted into the sandbox, so soffice falls back to a single internal font and produces unreadable output.

The consequence is that the LibreOffice binary ships in the image but is unusable from the sandbox in practice.

What this enhancement would enable

Once these gaps are closed, LibreChat can deliver document-processing skills that shell out to soffice — DOCX → PDF, XLSX formula recalc, PPTX thumbnailing, etc. — without the user shipping a custom image (and the Anthropic-style office skills (docx/pptx/xlsx) are the obvious consumers of course).

Proposed changes

Branch: https://github.com/On-Behalf-AI/LibreCodeInterpreter/tree/enhanced-runtime
Diff vs. usnavy13/main: 9 files changed, ~60 lines added.

It took me 4 commits, but overall, the main additions are :
Commit 1 (On-Behalf-AI@c4573c0) + Commit 4 (On-Behalf-AI@291fe71) — runtime dependencies**

  • Dockerfile (+2 lines, into the existing apt-install block): add fonts-liberation, fonts-dejavu-core (for Latin-script rendering), qpdf (PDF post-processing).
  • docker/requirements/python-documents.txt (+3 lines): pypdf, pdfplumber, markitdown[pptx] + remove pdfminer>=20191125, keep only pdfminer.six>=20221105.
    The legacy pdfminer package (last release 2019, Python 2 era) and pdfminer.six (its maintained successor) both install to /pdfminer/, so the older one shadows the newer one. This breaks pdfplumber (and any consumer that imports recent symbols) with:
ImportError: cannot import name 'PDFStackT' from 'pdfminer.pdfinterp'

pdfminer.six is the correct dependency for all modern PDF tooling; the legacy pdfminer entry appears to have been an oversight in the initial commit.

  • docker/requirements/nodejs.txt (+2 lines): docx, pptxgenjs.

Commit 2 — sandbox patches (On-Behalf-AI@51365ef)

  • docker/nsjail-base.cfg (+26 lines): 3 read-only bind-mounts for /etc/libreoffice, /etc/fonts, /usr/share/fonts.
  • src/services/sandbox/executor.py: add py/python to the languages with /proc unmasked ({java, rs, bash}{java, rs, py, python, bash}).
    Add XDG_CONFIG_HOME=/tmp/.config so soffice can write its first-run profile.
  • src/services/sandbox/nsjail.py: allow bind syscall for {py, python, java, bash} ({bash}{py, python, java, bash}).
  • src/services/sandbox/pool.py + src/services/programmatic.py: remove the explicit /proc mask so REPL/PTC can invoke soffice too.

Commit 3 — tesseract languages + raised memory limit (On-Behalf-AI@b263215)

  • Dockerfile: add tesseract-ocr-{fra,deu,spa,ita} language packs. The base tesseract-ocr ships only English; non-English OCR fails silently. The four common Western European packs add ~30 MB to the image and unlock pytesseract.image_to_string(lang="fra+eng") etc.
  • docker/nsjail-base.cfg: raise rlimit_as from 512 MiB to 1024 MiB. PDF OCR workflows render pages at 200 DPI (pdf2image → tesseract) which can momentarily allocate 400-700 MiB per page on a 4-page multilingual PDF. 512 MiB OOM-killed even with del + gc.collect() between pages; 1024 MiB is sufficient.

Security considerations

The two sandbox relaxations widen what was originally a Bash-only carve-out to other interpreter runtimes (for /proc and for the bind syscall) . Worth being explicit about why this stays acceptable:

  1. /proc visibility: nsjail's PID namespace already restricts /proc to processes inside the sandbox. The only host info that leaks via /proc/cpuinfo and /proc/meminfo was already exposed to Bash users under the existing model — extending it to Python and the REPL doesn't change the threat surface for the trusted-tenant deployment model these languages target.
  2. bind(2) syscall: was blocked to prevent server sockets, but the network namespace isolation (--iface_no_lo in the existing config) already prevents external connections. So allowing AF_UNIX bind — which is what soffice needs — does not re-enable any network reachability path that the kernel-level network isolation hasn't already closed.

The three new bind-mounts (/etc/libreoffice, /etc/fonts, /usr/share/fonts) are all read-only and standard system locations — they expose no user data and no writable surface.

Next : your guidance

=> If you're broadly OK with the direction, I'll open a PR with the same four commits. If it sounds more reasonable to you, I can also consider add build-arg opt-in (e.g. (DOCUMENT_PROCESSING=1) so that users who don't need soffice in-sandbox keep the stricter security envelope.

And many thanks for all the work on this repo — it's been a great foundation for our LibreChat deployment !

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions