Skip to content

Support pg_search as an additional BM25 full-text search backend #21

Description

@Sanderhoff-alt

Hi pg0 maintainers,

I'd like to request support for the pg_search PostgreSQL extension from ParadeDB as an additional BM25 full-text search backend.

Why this would be valuable for pg0

pg0 already provides a strong zero-config PostgreSQL experience for local development, testing, CI, and AI/vector workloads by bundling PostgreSQL with pgvector.

Adding pg_search support would extend that same zero-config experience to modern full-text and hybrid search workloads:

  • BM25-ranked full-text search inside PostgreSQL
  • Configurable tokenizers for multilingual and CJK search
  • Better local/dev parity for applications that use Postgres for both vector search and keyword search
  • Fewer cases where local development needs a separate Elasticsearch/OpenSearch service
  • A stronger out-of-the-box story for AI/RAG/memory applications that combine dense and sparse retrieval

In other words, pgvector covers the dense/semantic side of retrieval, while a BM25 backend covers the sparse/keyword side. Supporting pg_search would make pg0 a more complete embedded/local Postgres backend for search-heavy applications.

Why pg_search in addition to pg_textsearch?

pg0 already documents support for installing pg_textsearch, which is valuable and covers BM25-style full-text search use cases.

This request is not meant to replace pg_textsearch, but to give pg0 users another important backend option with different tradeoffs:

  • pg_search is the ParadeDB search extension built around Elastic-style full-text search inside Postgres.
  • It provides BM25 scoring and a rich tokenizer story, including tokenizers useful for multilingual and CJK workloads.
  • It is useful for applications that want local/dev parity with deployments already targeting the ParadeDB pg_search API.
  • It is relevant for hybrid search applications that combine pgvector dense retrieval with pg_search sparse/BM25 retrieval.

So pg_textsearch and pg_search can coexist as complementary options. pg_textsearch gives pg0 users one BM25 path today; pg_search would make pg0 more flexible for applications that specifically need ParadeDB-compatible search syntax, tokenizer configuration, or parity with pg_search deployments.

Preferred implementation direction

Bundling would likely be the preferred path if feasible, because it fits pg0's current architecture and zero-config/offline model.

From the current build.rs, pg0 does not vendor PostgreSQL or pgvector source directly. Instead, it downloads prebuilt artifacts at build time and embeds them into the pg0 binary:

  • PostgreSQL from theseus-rs/postgresql-binaries
  • pgvector from nicoloboschi/pgvector_compiled

Given that model, pg_search support could follow a similar pattern:

  1. Maintain or consume a prebuilt pg_search artifact release keyed by PostgreSQL major/version and target platform.

  2. Have pg0 download the matching tarball during build.

  3. Bundle it into the pg0 binary.

  4. Extract the extension files into the embedded PostgreSQL installation at runtime.

  5. Let users enable it with:

    CREATE EXTENSION IF NOT EXISTS pg_search CASCADE;

That would preserve pg0's zero-config/offline/predictable experience better than downloading arbitrary upstream artifacts at install time. An on-demand pg0 install-extension pg_search path could still be useful as a fallback if bundling is too large or too costly to support across all platforms.

Requested support

It would be useful if pg0 could support pg_search in one of these forms:

  1. Preferably bundle pg_search in pg0 releases, so users can run:

    CREATE EXTENSION IF NOT EXISTS pg_search CASCADE;
  2. Or provide an official on-demand installer that downloads pinned, pg0-tested artifacts known to match pg0's bundled PostgreSQL version and platform.

  3. Or document whether pg_search is intentionally out of scope due to build, licensing, portability, binary size, or maintenance constraints.

Example downstream use case: Hindsight

Hindsight currently uses pg0 as its default embedded PostgreSQL backend. The two projects are decoupled, but Hindsight is a concrete example of the kind of application that would benefit from this support.

Hindsight's retrieval pipeline combines:

  • semantic/vector retrieval through pgvector
  • keyword/BM25 retrieval through a configurable text-search backend

For multilingual memory banks, especially Chinese/Japanese/Korean or mixed-language content, pg_search is attractive because it provides BM25 scoring plus configurable tokenizers such as jieba, chinese_compatible, lindera(...), icu, ngram(...), and others.

A desired local setup would be:

CREATE EXTENSION IF NOT EXISTS vector CASCADE;
CREATE EXTENSION IF NOT EXISTS pg_search CASCADE;

with an application configured to use pgvector for vector search and pg_search for BM25 text search.

Thanks for considering it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions