Skip to content

Commit 358ef6d

Browse files
committed
fix: pfc-vector description — Telegraf has its own repo (pfc-telegraf)
1 parent f745e8f commit 358ef6d

1 file changed

Lines changed: 0 additions & 288 deletions

File tree

README.md

Lines changed: 0 additions & 288 deletions
Original file line numberDiff line numberDiff line change
@@ -1,288 +0,0 @@
1-
# pfc — DuckDB Extension for PFC-JSONL
2-
3-
You have compressed log archives on disk. To query them you normally decompress everything first — even if you only need one hour out of thirty days.
4-
5-
This extension changes that. Query `.pfc` files directly from DuckDB SQL. A block index tells the extension exactly which chunks of the file to decompress — the rest stays compressed.
6-
7-
> **Requires:** The `pfc_jsonl` binary installed on your machine (Step 1 below). The extension calls it for decompression.
8-
>
9-
> **Platform:** Linux x86_64 and macOS Apple Silicon (ARM64). No native Windows binary — Windows users must use WSL2 or a Linux machine.
10-
11-
```sql
12-
INSTALL pfc FROM community;
13-
LOAD pfc;
14-
LOAD json;
15-
16-
SELECT
17-
line->>'$.level' AS level,
18-
line->>'$.message' AS message
19-
FROM read_pfc_jsonl('/var/log/events.pfc')
20-
WHERE line->>'$.level' = 'ERROR';
21-
```
22-
23-
[![Awesome DuckDB](https://awesome.re/mentioned-badge.svg)](https://github.com/davidgasquez/awesome-duckdb)
24-
25-
## What is PFC-JSONL?
26-
27-
[PFC-JSONL](https://github.com/ImpossibleForge/pfc-jsonl) is a high-performance compressed log format built for structured (JSONL) data. It achieves **better compression than gzip and zstd** on real log data while supporting **random block access** — meaning you can decompress only the time range you need.
28-
29-
Key properties:
30-
- Each file is split into independently compressible blocks
31-
- A `.pfc.bidx` binary index stores the byte offset and timestamp range of every block
32-
- The PFC binary can decompress any subset of blocks in a single call
33-
- **Free for personal and open-source use** — no account, no signup required
34-
35-
## How It Works (Architecture)
36-
37-
```
38-
┌──────────────────────────────────────────────────────────────┐
39-
│ DuckDB │
40-
│ │
41-
│ SELECT * FROM read_pfc_jsonl('events.pfc', ts_from=...) │
42-
│ │ │
43-
│ ┌────────▼──────────┐ reads ┌─────────────────────┐ │
44-
│ │ pfc extension │─────────────▶│ events.pfc.bidx │ │
45-
│ │ (MIT, open src) │ block index │ (block timestamps) │ │
46-
│ └────────┬──────────┘ └─────────────────────┘ │
47-
│ │ popen() / subprocess │
48-
└───────────┼──────────────────────────────────────────────────┘
49-
50-
51-
┌─────────────────────┐
52-
│ pfc_jsonl binary │ ← proprietary, closed source
53-
│ (v3.4+, local) │ contains BWT+rANS compression
54-
└─────────────────────┘
55-
56-
57-
decompressed JSON lines → back to DuckDB
58-
```
59-
60-
The extension is a **thin open-source wrapper** — it reads the `.bidx` index in C++ to select which blocks are needed, then calls the PFC binary once to decompress only those blocks. The compression algorithm stays closed.
61-
62-
## Installation
63-
64-
### Step 1 — Install the PFC binary (once per machine)
65-
66-
The extension calls the `pfc_jsonl` binary for decompression.
67-
Download the latest release for your platform:
68-
69-
**Linux x64:**
70-
```bash
71-
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
72-
-o /usr/local/bin/pfc_jsonl
73-
chmod +x /usr/local/bin/pfc_jsonl
74-
pfc_jsonl --help # verify install
75-
```
76-
77-
**macOS (Apple Silicon M1/M2/M3/M4):**
78-
```bash
79-
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
80-
-o /usr/local/bin/pfc_jsonl
81-
chmod +x /usr/local/bin/pfc_jsonl
82-
pfc_jsonl --help # verify install
83-
```
84-
85-
> **macOS Intel (x64):** Binary coming soon.
86-
87-
> **Custom path:** Set `PFC_JSONL_BINARY=/path/to/pfc_jsonl` in your environment to override the default `/usr/local/bin/pfc_jsonl`.
88-
89-
### Step 2 — Install the DuckDB extension
90-
91-
```sql
92-
INSTALL pfc FROM community;
93-
LOAD pfc;
94-
```
95-
96-
### Build from source (developers / early access)
97-
98-
```bash
99-
git clone --recurse-submodules https://github.com/ImpossibleForge/pfc-duckdb
100-
cd pfc-duckdb
101-
GEN=ninja make release
102-
# Extension at: build/release/extension/pfc/pfc.duckdb_extension
103-
```
104-
105-
## Usage
106-
107-
### Basic query
108-
109-
```sql
110-
LOAD pfc;
111-
112-
SELECT line FROM read_pfc_jsonl('/path/to/file.pfc');
113-
```
114-
115-
Each row contains one raw JSON string in the `line` column.
116-
Use the DuckDB `json` extension to parse fields:
117-
118-
```sql
119-
LOAD json;
120-
121-
SELECT
122-
line->>'$.timestamp' AS ts,
123-
line->>'$.level' AS level,
124-
line->>'$.message' AS message,
125-
line->>'$.service' AS service
126-
FROM read_pfc_jsonl('/path/to/file.pfc');
127-
```
128-
129-
### Timestamp-based block filtering
130-
131-
PFC files include a `.pfc.bidx` index with the timestamp range of each block.
132-
Pass `ts_from` and/or `ts_to` (Unix seconds) to skip entire blocks before decompression:
133-
134-
```sql
135-
-- Only decompress blocks that overlap the given time window
136-
SELECT line
137-
FROM read_pfc_jsonl(
138-
'/path/to/file.pfc',
139-
ts_from = 1735689600, -- 2026-01-01 00:00:00 UTC
140-
ts_to = 1735775999 -- 2026-01-01 23:59:59 UTC
141-
);
142-
```
143-
144-
Convert a timestamp string to Unix seconds with `epoch()`:
145-
146-
```sql
147-
SELECT line
148-
FROM read_pfc_jsonl(
149-
'/path/to/file.pfc',
150-
ts_from = epoch(TIMESTAMPTZ '2026-01-01 00:00:00+00'),
151-
ts_to = epoch(TIMESTAMPTZ '2026-01-02 00:00:00+00')
152-
);
153-
```
154-
155-
### Combining block filter and row filter
156-
157-
`ts_from`/`ts_to` skip entire **blocks** (coarse, fast).
158-
Add a `WHERE` clause for **row-level** precision:
159-
160-
```sql
161-
LOAD json;
162-
163-
SELECT line->>'$.message' AS msg
164-
FROM read_pfc_jsonl(
165-
'/var/log/api.pfc',
166-
ts_from = epoch(TIMESTAMPTZ '2026-03-15 08:00:00+00'),
167-
ts_to = epoch(TIMESTAMPTZ '2026-03-15 10:00:00+00')
168-
)
169-
WHERE line->>'$.level' = 'ERROR';
170-
```
171-
172-
### Analytics examples
173-
174-
```sql
175-
LOAD json;
176-
177-
-- Error rate per hour
178-
SELECT
179-
strftime(to_timestamp((line->>'$.ts')::BIGINT), '%Y-%m-%d %H:00') AS hour,
180-
count(*) FILTER (WHERE line->>'$.level' = 'ERROR') AS errors,
181-
count(*) AS total
182-
FROM read_pfc_jsonl('/var/log/api.pfc')
183-
GROUP BY hour ORDER BY hour;
184-
185-
-- Top 10 slowest endpoints
186-
SELECT
187-
line->>'$.path' AS endpoint,
188-
avg((line->>'$.duration_ms')::DOUBLE) AS avg_ms,
189-
count(*) AS requests
190-
FROM read_pfc_jsonl('/var/log/api.pfc')
191-
GROUP BY endpoint ORDER BY avg_ms DESC LIMIT 10;
192-
```
193-
194-
## API Reference
195-
196-
### `read_pfc_jsonl(path [, ts_from, ts_to])`
197-
198-
| Parameter | Type | Default | Description |
199-
|-----------|---------|---------|-------------|
200-
| `path` | VARCHAR || Path to the `.pfc` file. A `.pfc.bidx` index must exist at `path + ".bidx"`. |
201-
| `ts_from` | BIGINT | 0 | Lower bound for block selection (Unix seconds). `0` = no lower bound. |
202-
| `ts_to` | BIGINT | 0 | Upper bound for block selection (Unix seconds). `0` = no upper bound. |
203-
204-
**Returns:** table with one column `line VARCHAR` — one row per decompressed JSON line.
205-
206-
**Block filtering semantics:**
207-
A block is included if its timestamp range `[ts_start, ts_end]` overlaps `[ts_from, ts_to]`.
208-
Blocks with unknown timestamps are always included.
209-
If both `ts_from` and `ts_to` are `0`, all blocks are read.
210-
211-
## File Requirements
212-
213-
| File | Required | Description |
214-
|------|----------|-------------|
215-
| `file.pfc` | yes | Compressed PFC-JSONL file |
216-
| `file.pfc.bidx` | yes | Binary block index (requires PFC-JSONL v3.4+) |
217-
218-
Generate both with the PFC binary:
219-
220-
```bash
221-
pfc_jsonl compress input.jsonl output.pfc
222-
# Produces: output.pfc + output.pfc.bidx
223-
```
224-
225-
> **Note:** The Docker image on Docker Hub (`impossibleforge/pfc-jsonl`) is a server-side compression tool. It is **not** required for using the DuckDB extension — you only need the standalone `pfc_jsonl` binary from GitHub Releases.
226-
227-
## Performance
228-
229-
Block-level filtering can skip the majority of a file.
230-
Example: 30-day log file, 720 hourly blocks — a 1-hour query reads **1 block** instead of 720.
231-
232-
| Query range | Blocks read | Speedup (720-block file) |
233-
|-------------|-------------|--------------------------|
234-
| 30 days | 720/720 ||
235-
| 1 day | ~24/720 | ~30× |
236-
| 1 hour | ~1/720 | ~720× |
237-
238-
239-
---
240-
241-
## Disclaimer
242-
243-
PFC-DuckDB is an independent open-source project and is not affiliated with, endorsed by, or associated with the DuckDB Foundation or DuckDB Labs.
244-
## License
245-
246-
The PFC-JSONL binary is **free for personal and open-source use** — no account, no signup, no phone-home.
247-
248-
Commercial use requires a license. Contact: [info@impossibleforge.com](mailto:info@impossibleforge.com)
249-
250-
## Troubleshooting
251-
252-
**`Cannot open index file: /path/to/file.pfc.bidx`**
253-
The `.pfc.bidx` index is missing. Compress with PFC-JSONL v3.4+:
254-
```bash
255-
pfc_jsonl compress input.jsonl output.pfc
256-
```
257-
258-
259-
**`PFC binary not found at '/usr/local/bin/pfc_jsonl'`**
260-
Binary is missing or not executable. Re-run the curl install command, or set `PFC_JSONL_BINARY=/path/to/pfc_jsonl`.
261-
262-
**`popen() failed — could not start PFC binary subprocess`**
263-
The extension uses `popen()` to call the PFC binary. Windows is not supported — use WSL2 or a Linux machine.
264-
265-
**`ts_from (...) must be <= ts_to (...)`**
266-
You passed an inverted time range. Swap the values so `ts_from` comes before `ts_to`.
267-
268-
## Related Projects
269-
270-
| Project | Description |
271-
|---------|-------------|
272-
| [pfc-jsonl](https://github.com/ImpossibleForge/pfc-jsonl) | The core binary — compress, decompress, query |
273-
| [pfc-fluentbit](https://github.com/ImpossibleForge/pfc-fluentbit) | Stream Fluent Bit logs directly to `.pfc` archives |
274-
| [pfc-migrate](https://github.com/ImpossibleForge/pfc-migrate) | Convert existing gzip/zstd/lz4 archives to PFC — local, S3, Azure, GCS |
275-
| [pfc-jsonl (PyPI)](https://pypi.org/project/pfc-jsonl/) | Python package — `pip install pfc-jsonl` |
276-
| [pfc-vector](https://github.com/ImpossibleForge/pfc-vector) | High-performance Rust ingest daemon for Vector.dev and Telegraf |
277-
| [pfc-otel-collector](https://github.com/ImpossibleForge/pfc-otel-collector) | OpenTelemetry OTLP/HTTP log exporter |
278-
| [pfc-kafka-consumer](https://github.com/ImpossibleForge/pfc-kafka-consumer) | Kafka / Redpanda consumer |
279-
| [pfc-telegraf](https://github.com/ImpossibleForge/pfc-telegraf) | Telegraf HTTP output plugin → PFC |
280-
| [pfc-grafana](https://github.com/ImpossibleForge/pfc-grafana) | Grafana data source plugin for PFC archives |
281-
282-
---
283-
284-
## License
285-
286-
The **pfc DuckDB extension** (this repository) is released under the **MIT License** — see [LICENSE](https://github.com/ImpossibleForge/pfc-duckdb/blob/main/LICENSE).
287-
288-
The **PFC-JSONL binary** (`pfc_jsonl`) is proprietary software — free for personal and open-source use. Commercial use requires a license: [info@impossibleforge.com](mailto:info@impossibleforge.com)

0 commit comments

Comments
 (0)