Skip to content

Commit 6c2c2e7

Browse files
committed
Enhance dataset generation with new export formats and update documentation. Added support for 'chat_template' export format in dataset generation, updated README to reflect new output options, and improved dataset writer to handle dynamic schemas for Parquet format.
1 parent 22386fb commit 6c2c2e7

7 files changed

Lines changed: 329 additions & 31 deletions

File tree

packages/torque/AGENTS.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,28 @@
11
# @qforge/torque Agent Guide
22

33
## Product Surface
4+
45
- Declarative DSL for composing LLM datasets (messages, tools, metadata) with deterministic RNG helpers.
56
- Shipping artifacts: `dist/` bundle, README examples, StackBlitz templates under `stackblitz-templates/`.
67
- Consumers rely on stable builder APIs (`generatedUser`, `oneOf`, `times`, `metadata`, schema helpers) and Bun-friendly ESM output.
78

89
## Code Map
10+
911
- `src/generators.ts`, `schema.ts`, `schema-rng.ts`: core composition primitives and RNG utilities.
1012
- `src/faker.ts`, `src/seed.ts`, `src/utils.ts`: deterministic seeding & Faker wiring.
11-
- `src/writer.ts`, `src/dataset.ts`, `src/cli-renderer.ts`: dataset materialization, Parquet/JSONL writers, CLI UX.
13+
- `src/writer.ts`, `src/dataset.ts`, `src/cli-renderer.ts`, `src/formatter.ts`: dataset materialization, formatters, Parquet/JSONL writers, CLI UX.
1214
- Tests co-located in `src/*.test.ts` using `bun:test`; keep new tests near the code they cover.
1315

1416
## Implementation Guardrails
17+
1518
1. **Determinism first** – Always thread `seed` + `withSeed` helpers through new flows; never call `Math.random` or instantiate Faker ad-hoc.
1619
2. **Immutable schemas** – Treat schema objects as frozen after the `check` phase; copy before mutating and respect `phase` on `IMessageSchemaContext`.
1720
3. **Types are the contract** – Update `src/types.ts` alongside behavior changes and re-run `tsc -p tsconfig.build.json`.
1821
4. **Error messaging** – Use descriptive errors (see `schema.ts` for tone) and prefer `ZodError`-style aggregates when validating user structures.
1922
5. **CLI/story templates** – If a change affects example output, refresh snippets in the README and regenerate StackBlitz templates (`bun run generate:templates`).
2023

2124
## Testing & Verification
25+
2226
- Unit tests: `bun test packages/torque/src` (Bun discovers `*.test.ts`).
2327
- Type check + build: `bun run --filter @qforge/torque build`.
2428
- For RNG-sensitive code, add golden tests that fix a seed and assert exact arrays/messages.
@@ -40,11 +44,13 @@ const mockModel = new MockLanguageModelV2({
4044
- Document manual verification steps (e.g., running `examples/*.ts`) when automated tests are insufficient.
4145

4246
## When to Loop In a Human
47+
4348
- Introducing new public builder APIs or altering existing function signatures.
4449
- Changes that risk breaking template compatibility, dataset schemas, or CLI output formats.
4550
- Work that requires new dependencies, native bindings, or non-Bun tooling.
4651

4752
## Definition of Done
53+
4854
- Code is deterministic, typed, and tested.
4955
- README + templates reflect surface changes.
5056
- `dist/` is regenerated only during release—do not commit build artifacts.

packages/torque/README.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -354,31 +354,37 @@ When Bun workers are unavailable, Torque automatically falls back to in-process
354354

355355
### Output Formats
356356

357-
Choose your preferred output format for generated datasets:
357+
Choose your preferred output file format and data structure:
358358

359359
```typescript
360-
// Export as JSONL (default - line-delimited JSON)
360+
// Export as JSONL with default ai-sdk structure (default)
361361
await generateDataset(schema, {
362362
count: 100,
363363
model: openai("gpt-4o-mini"),
364-
format: "jsonl", // default, can be omitted
364+
format: "jsonl",
365365
output: "data/dataset.jsonl",
366366
});
367367

368-
// Export as Parquet (columnar format, efficient for analytics)
368+
// Export in OpenAI Chat Completions format (tools + messages structure)
369369
await generateDataset(schema, {
370370
count: 100,
371371
model: openai("gpt-4o-mini"),
372-
format: "parquet",
373-
output: "data/dataset.parquet",
372+
format: "jsonl",
373+
exportFormat: "chat_template",
374+
output: "data/finetune.jsonl",
374375
});
375376
```
376377

377-
**Supported formats:**
378+
**Supported File Formats (`format`):**
378379

379380
- **`jsonl`** (default) - JSON Lines format, one row per line. Best for streaming and line-by-line processing.
380381
- **`parquet`** - Apache Parquet columnar format. More efficient for large datasets and analytics tools (e.g., Pandas, DuckDB, Apache Spark).
381382

383+
**Supported Data Structures (`exportFormat`):**
384+
385+
- **`ai-sdk`** (default) - Internal Torque format, compatible with Vercel AI SDK. Includes schema metadata, tool definitions, and full message objects.
386+
- **`chat_template`** - OpenAI Chat Completions compatible format. Flattened message structure with `tools` and `messages` top-level keys. Ideal for fine-tuning or direct API usage.
387+
382388
Both formats write rows incrementally as they're generated, so large datasets won't consume excessive memory.
383389

384390
> 💡 When `format` is specified without `output`, the file extension is automatically set based on the format.

packages/torque/src/dataset.ts

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ import type {
2929
IGenerateDatasetArgsMultiSchema,
3030
} from "./types";
3131
import { createWriter } from "./writer";
32+
import { createFormatter } from "./formatter";
3233
import { TokenCounterPool } from "./token-counting/tokenCounterPool";
3334
import { hoistSystemMessages } from "./ai-message-order";
3435

@@ -67,6 +68,7 @@ export async function generateDataset(
6768
seed,
6869
output,
6970
format = "jsonl",
71+
exportFormat = "ai-sdk",
7072
model,
7173
concurrency = 5,
7274
generationContext,
@@ -99,7 +101,8 @@ export async function generateDataset(
99101
await fsp.mkdir(outputDir, { recursive: true });
100102

101103
// Initialize the writer for the specified format
102-
const writer = createWriter(format, outputPath);
104+
const formatter = createFormatter(exportFormat);
105+
const writer = createWriter(format, outputPath, formatter.parquetSchema);
103106
await writer.init();
104107

105108
// Initialize the CLI renderer
@@ -174,7 +177,8 @@ export async function generateDataset(
174177

175178
// Write row immediately after generation
176179
// Thread-safety is handled internally by the writer
177-
await writer.appendRow(row);
180+
const formattedRow = formatter.format(row);
181+
await writer.appendRow(formattedRow);
178182

179183
// Mark generation as completed
180184
renderer.completeGeneration(task.index);
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
import { describe, it, expect } from "bun:test";
2+
import { ChatTemplateFormatter } from "./formatter";
3+
import type { IDatasetRow } from "./types";
4+
5+
describe("ChatTemplateFormatter", () => {
6+
const formatter = new ChatTemplateFormatter();
7+
8+
it("should transform tools to OpenAI format", () => {
9+
const row: IDatasetRow = {
10+
messages: [],
11+
tools: [
12+
{
13+
name: "calculator",
14+
description: "Performs math",
15+
parameters: {
16+
type: "object",
17+
properties: { a: { type: "number" } },
18+
required: ["a"],
19+
},
20+
output: {},
21+
},
22+
],
23+
schema: {} as any,
24+
meta: {} as any,
25+
};
26+
27+
const result = formatter.format(row);
28+
expect(result.tools).toHaveLength(1);
29+
expect(result.tools[0]).toEqual({
30+
type: "function",
31+
function: {
32+
name: "calculator",
33+
description: "Performs math",
34+
parameters: {
35+
type: "object",
36+
properties: { a: { type: "number" } },
37+
required: ["a"],
38+
},
39+
},
40+
});
41+
});
42+
43+
it("should transform user messages", () => {
44+
const row: IDatasetRow = {
45+
messages: [
46+
{
47+
role: "user",
48+
content: "Hello",
49+
generationId: "1",
50+
},
51+
],
52+
tools: [],
53+
schema: {} as any,
54+
meta: {} as any,
55+
};
56+
57+
const result = formatter.format(row);
58+
expect(result.messages).toHaveLength(1);
59+
expect(result.messages[0]).toEqual({
60+
role: "user",
61+
content: "Hello",
62+
});
63+
});
64+
65+
it("should transform assistant messages with tool calls", () => {
66+
const row: IDatasetRow = {
67+
messages: [
68+
{
69+
role: "assistant",
70+
content: [
71+
{ type: "text", text: "Thinking..." },
72+
{
73+
type: "tool-call",
74+
toolCallId: "call_1",
75+
toolName: "calc",
76+
input: { a: 1 },
77+
},
78+
],
79+
generationId: "1",
80+
} as any, // Casting because IDatasetMessage content type is strict in tests
81+
],
82+
tools: [],
83+
schema: {} as any,
84+
meta: {} as any,
85+
};
86+
87+
const result = formatter.format(row);
88+
expect(result.messages).toHaveLength(1);
89+
expect(result.messages[0]).toEqual({
90+
role: "assistant",
91+
content: "Thinking...",
92+
tool_calls: [
93+
{
94+
id: "call_1",
95+
type: "function",
96+
function: {
97+
name: "calc",
98+
arguments: { a: 1 },
99+
},
100+
},
101+
],
102+
});
103+
});
104+
105+
it("should flatten tool result messages", () => {
106+
const row: IDatasetRow = {
107+
messages: [
108+
{
109+
role: "tool",
110+
content: [
111+
{
112+
type: "tool-result",
113+
toolCallId: "call_1",
114+
toolName: "calc",
115+
result: 2,
116+
output: 2, // dataset.ts populates output
117+
},
118+
{
119+
type: "tool-result",
120+
toolCallId: "call_2",
121+
toolName: "calc",
122+
result: 4,
123+
output: 4,
124+
},
125+
],
126+
generationId: "1",
127+
} as any,
128+
],
129+
tools: [],
130+
schema: {} as any,
131+
meta: {} as any,
132+
};
133+
134+
const result = formatter.format(row);
135+
expect(result.messages).toHaveLength(2);
136+
expect(result.messages[0]).toEqual({
137+
role: "tool",
138+
tool_call_id: "call_1",
139+
name: "calc",
140+
content: "2",
141+
});
142+
expect(result.messages[1]).toEqual({
143+
role: "tool",
144+
tool_call_id: "call_2",
145+
name: "calc",
146+
content: "4",
147+
});
148+
});
149+
});

packages/torque/src/formatter.ts

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
import type { IDatasetRow, DatasetExportFormat } from "./types";
2+
3+
export interface IDatasetFormatter {
4+
format(row: IDatasetRow): Record<string, any>;
5+
parquetSchema: Record<string, any>;
6+
}
7+
8+
export class AiSdkFormatter implements IDatasetFormatter {
9+
parquetSchema = {
10+
messages: { type: "UTF8" }, // JSON string
11+
tools: { type: "UTF8" }, // JSON string
12+
schema: { type: "UTF8" }, // JSON string
13+
meta: { type: "UTF8" }, // JSON string
14+
};
15+
16+
format(row: IDatasetRow): Record<string, any> {
17+
return row as unknown as Record<string, any>;
18+
}
19+
}
20+
21+
export class ChatTemplateFormatter implements IDatasetFormatter {
22+
parquetSchema = {
23+
tools: { type: "UTF8" }, // JSON string
24+
messages: { type: "UTF8" }, // JSON string
25+
};
26+
27+
format(row: IDatasetRow): Record<string, any> {
28+
const tools = row.tools.map((tool) => ({
29+
type: "function",
30+
function: {
31+
name: tool.name,
32+
description: tool.description,
33+
parameters: tool.parameters,
34+
},
35+
}));
36+
37+
const messages = row.messages.flatMap((msg) => {
38+
if (msg.role === "tool") {
39+
// Flatten tool results
40+
if (Array.isArray(msg.content)) {
41+
return msg.content
42+
.map((part: any) => {
43+
if (part.type === "tool-result") {
44+
return {
45+
role: "tool",
46+
tool_call_id: part.toolCallId,
47+
name: part.toolName,
48+
content: JSON.stringify(part.output),
49+
};
50+
}
51+
return null;
52+
})
53+
.filter(Boolean);
54+
}
55+
return [];
56+
}
57+
58+
if (msg.role === "assistant") {
59+
const toolCalls: any[] = [];
60+
let contentString = "";
61+
62+
if (Array.isArray(msg.content)) {
63+
for (const part of msg.content) {
64+
if (part.type === "tool-call") {
65+
toolCalls.push({
66+
id: part.toolCallId,
67+
type: "function",
68+
function: {
69+
name: part.toolName,
70+
arguments: part.input,
71+
},
72+
});
73+
} else if (part.type === "text") {
74+
contentString += part.text;
75+
}
76+
// Skip reasoning for chat_template
77+
}
78+
} else if (typeof msg.content === "string") {
79+
contentString = msg.content;
80+
}
81+
82+
const newMsg: any = {
83+
role: "assistant",
84+
content: contentString || null,
85+
};
86+
if (toolCalls.length > 0) {
87+
newMsg.tool_calls = toolCalls;
88+
}
89+
return [newMsg];
90+
}
91+
92+
// User / System
93+
let content = msg.content;
94+
if (Array.isArray(content)) {
95+
// Ensure content parts are compatible
96+
// OpenAI accepts array of text/image parts.
97+
// We'll assume they are compatible or simplify if needed.
98+
// For now, pass through.
99+
}
100+
101+
return [
102+
{
103+
role: msg.role,
104+
content,
105+
},
106+
];
107+
});
108+
109+
return { tools, messages };
110+
}
111+
}
112+
113+
export function createFormatter(
114+
format: DatasetExportFormat
115+
): IDatasetFormatter {
116+
switch (format) {
117+
case "ai-sdk":
118+
return new AiSdkFormatter();
119+
case "chat_template":
120+
return new ChatTemplateFormatter();
121+
default:
122+
throw new Error(`Unsupported export format: ${format}`);
123+
}
124+
}

0 commit comments

Comments
 (0)