|
1 | | -# pfc — DuckDB Extension for PFC-JSONL |
2 | | - |
3 | | -You have compressed log archives on disk. To query them you normally decompress everything first — even if you only need one hour out of thirty days. |
4 | | - |
5 | | -This extension changes that. Query `.pfc` files directly from DuckDB SQL. A block index tells the extension exactly which chunks of the file to decompress — the rest stays compressed. |
6 | | - |
7 | | -> **Requires:** The `pfc_jsonl` binary installed on your machine (Step 1 below). The extension calls it for decompression. |
8 | | -> |
9 | | -> **Platform:** Linux x86_64 and macOS Apple Silicon (ARM64). No native Windows binary — Windows users must use WSL2 or a Linux machine. |
10 | | -
|
11 | | -```sql |
12 | | -INSTALL pfc FROM community; |
13 | | -LOAD pfc; |
14 | | -LOAD json; |
15 | | - |
16 | | -SELECT |
17 | | - line->>'$.level' AS level, |
18 | | - line->>'$.message' AS message |
19 | | -FROM read_pfc_jsonl('/var/log/events.pfc') |
20 | | -WHERE line->>'$.level' = 'ERROR'; |
21 | | -``` |
22 | | - |
23 | | -[](https://github.com/davidgasquez/awesome-duckdb) |
24 | | - |
25 | | -## What is PFC-JSONL? |
26 | | - |
27 | | -[PFC-JSONL](https://github.com/ImpossibleForge/pfc-jsonl) is a high-performance compressed log format built for structured (JSONL) data. It achieves **better compression than gzip and zstd** on real log data while supporting **random block access** — meaning you can decompress only the time range you need. |
28 | | - |
29 | | -Key properties: |
30 | | -- Each file is split into independently compressible blocks |
31 | | -- A `.pfc.bidx` binary index stores the byte offset and timestamp range of every block |
32 | | -- The PFC binary can decompress any subset of blocks in a single call |
33 | | -- **Free for personal and open-source use** — no account, no signup required |
34 | | - |
35 | | -## How It Works (Architecture) |
36 | | - |
37 | | -``` |
38 | | -┌──────────────────────────────────────────────────────────────┐ |
39 | | -│ DuckDB │ |
40 | | -│ │ |
41 | | -│ SELECT * FROM read_pfc_jsonl('events.pfc', ts_from=...) │ |
42 | | -│ │ │ |
43 | | -│ ┌────────▼──────────┐ reads ┌─────────────────────┐ │ |
44 | | -│ │ pfc extension │─────────────▶│ events.pfc.bidx │ │ |
45 | | -│ │ (MIT, open src) │ block index │ (block timestamps) │ │ |
46 | | -│ └────────┬──────────┘ └─────────────────────┘ │ |
47 | | -│ │ popen() / subprocess │ |
48 | | -└───────────┼──────────────────────────────────────────────────┘ |
49 | | - │ |
50 | | - ▼ |
51 | | - ┌─────────────────────┐ |
52 | | - │ pfc_jsonl binary │ ← proprietary, closed source |
53 | | - │ (v3.4+, local) │ contains BWT+rANS compression |
54 | | - └─────────────────────┘ |
55 | | - │ |
56 | | - ▼ |
57 | | - decompressed JSON lines → back to DuckDB |
58 | | -``` |
59 | | - |
60 | | -The extension is a **thin open-source wrapper** — it reads the `.bidx` index in C++ to select which blocks are needed, then calls the PFC binary once to decompress only those blocks. The compression algorithm stays closed. |
61 | | - |
62 | | -## Installation |
63 | | - |
64 | | -### Step 1 — Install the PFC binary (once per machine) |
65 | | - |
66 | | -The extension calls the `pfc_jsonl` binary for decompression. |
67 | | -Download the latest release for your platform: |
68 | | - |
69 | | -**Linux x64:** |
70 | | -```bash |
71 | | -curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \ |
72 | | - -o /usr/local/bin/pfc_jsonl |
73 | | -chmod +x /usr/local/bin/pfc_jsonl |
74 | | -pfc_jsonl --help # verify install |
75 | | -``` |
76 | | - |
77 | | -**macOS (Apple Silicon M1/M2/M3/M4):** |
78 | | -```bash |
79 | | -curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \ |
80 | | - -o /usr/local/bin/pfc_jsonl |
81 | | -chmod +x /usr/local/bin/pfc_jsonl |
82 | | -pfc_jsonl --help # verify install |
83 | | -``` |
84 | | - |
85 | | -> **macOS Intel (x64):** Binary coming soon. |
86 | | -
|
87 | | -> **Custom path:** Set `PFC_JSONL_BINARY=/path/to/pfc_jsonl` in your environment to override the default `/usr/local/bin/pfc_jsonl`. |
88 | | -
|
89 | | -### Step 2 — Install the DuckDB extension |
90 | | - |
91 | | -```sql |
92 | | -INSTALL pfc FROM community; |
93 | | -LOAD pfc; |
94 | | -``` |
95 | | - |
96 | | -### Build from source (developers / early access) |
97 | | - |
98 | | -```bash |
99 | | -git clone --recurse-submodules https://github.com/ImpossibleForge/pfc-duckdb |
100 | | -cd pfc-duckdb |
101 | | -GEN=ninja make release |
102 | | -# Extension at: build/release/extension/pfc/pfc.duckdb_extension |
103 | | -``` |
104 | | - |
105 | | -## Usage |
106 | | - |
107 | | -### Basic query |
108 | | - |
109 | | -```sql |
110 | | -LOAD pfc; |
111 | | - |
112 | | -SELECT line FROM read_pfc_jsonl('/path/to/file.pfc'); |
113 | | -``` |
114 | | - |
115 | | -Each row contains one raw JSON string in the `line` column. |
116 | | -Use the DuckDB `json` extension to parse fields: |
117 | | - |
118 | | -```sql |
119 | | -LOAD json; |
120 | | - |
121 | | -SELECT |
122 | | - line->>'$.timestamp' AS ts, |
123 | | - line->>'$.level' AS level, |
124 | | - line->>'$.message' AS message, |
125 | | - line->>'$.service' AS service |
126 | | -FROM read_pfc_jsonl('/path/to/file.pfc'); |
127 | | -``` |
128 | | - |
129 | | -### Timestamp-based block filtering |
130 | | - |
131 | | -PFC files include a `.pfc.bidx` index with the timestamp range of each block. |
132 | | -Pass `ts_from` and/or `ts_to` (Unix seconds) to skip entire blocks before decompression: |
133 | | - |
134 | | -```sql |
135 | | --- Only decompress blocks that overlap the given time window |
136 | | -SELECT line |
137 | | -FROM read_pfc_jsonl( |
138 | | - '/path/to/file.pfc', |
139 | | - ts_from = 1735689600, -- 2026-01-01 00:00:00 UTC |
140 | | - ts_to = 1735775999 -- 2026-01-01 23:59:59 UTC |
141 | | -); |
142 | | -``` |
143 | | - |
144 | | -Convert a timestamp string to Unix seconds with `epoch()`: |
145 | | - |
146 | | -```sql |
147 | | -SELECT line |
148 | | -FROM read_pfc_jsonl( |
149 | | - '/path/to/file.pfc', |
150 | | - ts_from = epoch(TIMESTAMPTZ '2026-01-01 00:00:00+00'), |
151 | | - ts_to = epoch(TIMESTAMPTZ '2026-01-02 00:00:00+00') |
152 | | -); |
153 | | -``` |
154 | | - |
155 | | -### Combining block filter and row filter |
156 | | - |
157 | | -`ts_from`/`ts_to` skip entire **blocks** (coarse, fast). |
158 | | -Add a `WHERE` clause for **row-level** precision: |
159 | | - |
160 | | -```sql |
161 | | -LOAD json; |
162 | | - |
163 | | -SELECT line->>'$.message' AS msg |
164 | | -FROM read_pfc_jsonl( |
165 | | - '/var/log/api.pfc', |
166 | | - ts_from = epoch(TIMESTAMPTZ '2026-03-15 08:00:00+00'), |
167 | | - ts_to = epoch(TIMESTAMPTZ '2026-03-15 10:00:00+00') |
168 | | -) |
169 | | -WHERE line->>'$.level' = 'ERROR'; |
170 | | -``` |
171 | | - |
172 | | -### Analytics examples |
173 | | - |
174 | | -```sql |
175 | | -LOAD json; |
176 | | - |
177 | | --- Error rate per hour |
178 | | -SELECT |
179 | | - strftime(to_timestamp((line->>'$.ts')::BIGINT), '%Y-%m-%d %H:00') AS hour, |
180 | | - count(*) FILTER (WHERE line->>'$.level' = 'ERROR') AS errors, |
181 | | - count(*) AS total |
182 | | -FROM read_pfc_jsonl('/var/log/api.pfc') |
183 | | -GROUP BY hour ORDER BY hour; |
184 | | - |
185 | | --- Top 10 slowest endpoints |
186 | | -SELECT |
187 | | - line->>'$.path' AS endpoint, |
188 | | - avg((line->>'$.duration_ms')::DOUBLE) AS avg_ms, |
189 | | - count(*) AS requests |
190 | | -FROM read_pfc_jsonl('/var/log/api.pfc') |
191 | | -GROUP BY endpoint ORDER BY avg_ms DESC LIMIT 10; |
192 | | -``` |
193 | | - |
194 | | -## API Reference |
195 | | - |
196 | | -### `read_pfc_jsonl(path [, ts_from, ts_to])` |
197 | | - |
198 | | -| Parameter | Type | Default | Description | |
199 | | -|-----------|---------|---------|-------------| |
200 | | -| `path` | VARCHAR | — | Path to the `.pfc` file. A `.pfc.bidx` index must exist at `path + ".bidx"`. | |
201 | | -| `ts_from` | BIGINT | 0 | Lower bound for block selection (Unix seconds). `0` = no lower bound. | |
202 | | -| `ts_to` | BIGINT | 0 | Upper bound for block selection (Unix seconds). `0` = no upper bound. | |
203 | | - |
204 | | -**Returns:** table with one column `line VARCHAR` — one row per decompressed JSON line. |
205 | | - |
206 | | -**Block filtering semantics:** |
207 | | -A block is included if its timestamp range `[ts_start, ts_end]` overlaps `[ts_from, ts_to]`. |
208 | | -Blocks with unknown timestamps are always included. |
209 | | -If both `ts_from` and `ts_to` are `0`, all blocks are read. |
210 | | - |
211 | | -## File Requirements |
212 | | - |
213 | | -| File | Required | Description | |
214 | | -|------|----------|-------------| |
215 | | -| `file.pfc` | yes | Compressed PFC-JSONL file | |
216 | | -| `file.pfc.bidx` | yes | Binary block index (requires PFC-JSONL v3.4+) | |
217 | | - |
218 | | -Generate both with the PFC binary: |
219 | | - |
220 | | -```bash |
221 | | -pfc_jsonl compress input.jsonl output.pfc |
222 | | -# Produces: output.pfc + output.pfc.bidx |
223 | | -``` |
224 | | - |
225 | | -> **Note:** The Docker image on Docker Hub (`impossibleforge/pfc-jsonl`) is a server-side compression tool. It is **not** required for using the DuckDB extension — you only need the standalone `pfc_jsonl` binary from GitHub Releases. |
226 | | -
|
227 | | -## Performance |
228 | | - |
229 | | -Block-level filtering can skip the majority of a file. |
230 | | -Example: 30-day log file, 720 hourly blocks — a 1-hour query reads **1 block** instead of 720. |
231 | | - |
232 | | -| Query range | Blocks read | Speedup (720-block file) | |
233 | | -|-------------|-------------|--------------------------| |
234 | | -| 30 days | 720/720 | 1× | |
235 | | -| 1 day | ~24/720 | ~30× | |
236 | | -| 1 hour | ~1/720 | ~720× | |
237 | | - |
238 | | - |
239 | | ---- |
240 | | - |
241 | | -## Disclaimer |
242 | | - |
243 | | -PFC-DuckDB is an independent open-source project and is not affiliated with, endorsed by, or associated with the DuckDB Foundation or DuckDB Labs. |
244 | | -## License |
245 | | - |
246 | | -The PFC-JSONL binary is **free for personal and open-source use** — no account, no signup, no phone-home. |
247 | | - |
248 | | -Commercial use requires a license. Contact: [info@impossibleforge.com](mailto:info@impossibleforge.com) |
249 | | - |
250 | | -## Troubleshooting |
251 | | - |
252 | | -**`Cannot open index file: /path/to/file.pfc.bidx`** |
253 | | -The `.pfc.bidx` index is missing. Compress with PFC-JSONL v3.4+: |
254 | | -```bash |
255 | | -pfc_jsonl compress input.jsonl output.pfc |
256 | | -``` |
257 | | - |
258 | | - |
259 | | -**`PFC binary not found at '/usr/local/bin/pfc_jsonl'`** |
260 | | -Binary is missing or not executable. Re-run the curl install command, or set `PFC_JSONL_BINARY=/path/to/pfc_jsonl`. |
261 | | - |
262 | | -**`popen() failed — could not start PFC binary subprocess`** |
263 | | -The extension uses `popen()` to call the PFC binary. Windows is not supported — use WSL2 or a Linux machine. |
264 | | - |
265 | | -**`ts_from (...) must be <= ts_to (...)`** |
266 | | -You passed an inverted time range. Swap the values so `ts_from` comes before `ts_to`. |
267 | | - |
268 | | -## Related Projects |
269 | | - |
270 | | -| Project | Description | |
271 | | -|---------|-------------| |
272 | | -| [pfc-jsonl](https://github.com/ImpossibleForge/pfc-jsonl) | The core binary — compress, decompress, query | |
273 | | -| [pfc-fluentbit](https://github.com/ImpossibleForge/pfc-fluentbit) | Stream Fluent Bit logs directly to `.pfc` archives | |
274 | | -| [pfc-migrate](https://github.com/ImpossibleForge/pfc-migrate) | Convert existing gzip/zstd/lz4 archives to PFC — local, S3, Azure, GCS | |
275 | | -| [pfc-jsonl (PyPI)](https://pypi.org/project/pfc-jsonl/) | Python package — `pip install pfc-jsonl` | |
276 | | -| [pfc-vector](https://github.com/ImpossibleForge/pfc-vector) | High-performance Rust ingest daemon for Vector.dev and Telegraf | |
277 | | -| [pfc-otel-collector](https://github.com/ImpossibleForge/pfc-otel-collector) | OpenTelemetry OTLP/HTTP log exporter | |
278 | | -| [pfc-kafka-consumer](https://github.com/ImpossibleForge/pfc-kafka-consumer) | Kafka / Redpanda consumer | |
279 | | -| [pfc-telegraf](https://github.com/ImpossibleForge/pfc-telegraf) | Telegraf HTTP output plugin → PFC | |
280 | | -| [pfc-grafana](https://github.com/ImpossibleForge/pfc-grafana) | Grafana data source plugin for PFC archives | |
281 | | - |
282 | | ---- |
283 | | - |
284 | | -## License |
285 | | - |
286 | | -The **pfc DuckDB extension** (this repository) is released under the **MIT License** — see [LICENSE](https://github.com/ImpossibleForge/pfc-duckdb/blob/main/LICENSE). |
287 | | - |
288 | | -The **PFC-JSONL binary** (`pfc_jsonl`) is proprietary software — free for personal and open-source use. Commercial use requires a license: [info@impossibleforge.com](mailto:info@impossibleforge.com) |
0 commit comments