Skip to content

Commit 182cfdf

Browse files
Copilotpelikhan
andauthored
Add line option to def() for dynamic range extraction with token budget support (#1823)
* Initial plan * Initial exploration and setup - understanding the codebase structure Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> * Implement line option for def() - dynamic range around center line Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> * Address review feedback: Add maxTokens support, use RangeOptions type, clarify slice usage Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> * Refactor token budget logic to eliminate code duplication Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> * Remove encoder from RangeOptions interface - encoder now passed separately by promptdom Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> * Update documentation for new line option with token budget support in def() function Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com> Co-authored-by: Peli de Halleux <pelikhan@users.noreply.github.com>
1 parent af6b5f7 commit 182cfdf

9 files changed

Lines changed: 671 additions & 10 deletions

File tree

docs/public/genaiscript.d.ts

Lines changed: 11 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/src/content/docs/fr/reference/scripts/context.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,51 @@ en définissant `ignoreEmpty` à `true`.
178178
def("FILE", env.files, { endsWith: ".md", ignoreEmpty: true })
179179
```
180180
181+
### Extraction basée sur les lignes
182+
183+
Vous pouvez extraire du contenu autour d'un numéro de ligne spécifique en utilisant l'option `line`. Ceci est particulièrement utile lorsque vous voulez vous concentrer sur une zone d'intérêt spécifique dans de gros fichiers.
184+
185+
```js "line: 25"
186+
// Se concentrer sur la ligne 25 avec un contexte dynamique
187+
def("FUNCTION_CODE", fileContent, { line: 25 })
188+
```
189+
190+
L'option `line` calcule dynamiquement le contexte environnant basé sur la taille du fichier :
191+
- Très petits fichiers (≤20 lignes) : Inclure la plupart du contenu
192+
- Petits fichiers (≤100 lignes) : 15 lignes de chaque côté
193+
- Fichiers moyens (≤500 lignes) : 25 lignes de chaque côté
194+
- Gros fichiers (≤2000 lignes) : 50 lignes de chaque côté
195+
- Très gros fichiers (>2000 lignes) : 75 lignes de chaque côté
196+
197+
#### Support du budget de tokens
198+
199+
Lorsque combinée avec `maxTokens`, l'option `line` effectue un calcul intelligent de la plage basé sur les tokens :
200+
201+
```js "line: 25, maxTokens: 500"
202+
// Se concentrer sur la ligne 25 avec contrainte de budget de tokens
203+
def("FUNCTION_CODE", fileContent, { line: 25, maxTokens: 500 })
204+
```
205+
206+
L'implémentation :
207+
- **Expansion intelligente** : Commence avec la ligne centrale et s'étend alternativement vers le haut/bas jusqu'à atteindre le budget de tokens
208+
- **Comptage précis** : Utilise une estimation précise des tokens pour un meilleur contrôle
209+
- **Fallback gracieux** : Revient au calcul basé sur la taille du fichier quand aucun `maxTokens` n'est spécifié
210+
- **Dépassement de budget** : Retourne juste la ligne centrale si elle dépasse déjà le budget de tokens
211+
212+
#### Règles de priorité
213+
214+
Les plages de lignes explicites ont la priorité sur l'option `line` :
215+
216+
```js
217+
// lineStart/lineEnd remplacent l'option line et maxTokens
218+
def("EXPLICIT_WINS", codeFile, {
219+
lineStart: 10,
220+
lineEnd: 20,
221+
line: 50,
222+
maxTokens: 100
223+
}) // Utilise les lignes 10-20
224+
```
225+
181226
### `maxTokens`
182227
183228
Il est possible de limiter le nombre de tokens générés par la fonction `def`. Cela peut être utile lorsque la sortie est trop volumineuse et que le modèle a une limite de tokens.
@@ -187,6 +232,8 @@ L'option `maxTokens` peut être définie à un nombre afin de limiter le nombre
187232
def("FILE", env.files, { maxTokens: 100 })
188233
```
189234
235+
Lorsque utilisée avec l'option `line`, `maxTokens` contrôle la taille totale de la plage extraite autour de la ligne centrale plutôt que de tronquer les fichiers individuels.
236+
190237
### Filtres de données
191238
192239
La fonction `def` traite de façon spéciale les fichiers de données comme [CSV](../../../reference/reference/scripts/csv/) et [XLSX](../../../reference/reference/scripts/xlsx/). Elle convertit automatiquement les données au format tableau markdown pour améliorer la tokenisation.

docs/src/content/docs/reference/scripts/context.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,51 @@ by setting `ignoreEmpty` to `true`.
221221
def("FILE", env.files, { endsWith: ".md", ignoreEmpty: true })
222222
```
223223
224+
### Line-based extraction
225+
226+
You can extract content around a specific line number using the `line` option. This is particularly useful when you want to focus on a specific area of interest in large files.
227+
228+
```js "line: 25"
229+
// Focus on line 25 with dynamic context
230+
def("FUNCTION_CODE", fileContent, { line: 25 })
231+
```
232+
233+
The `line` option dynamically calculates the surrounding context based on file size:
234+
- Very small files (≤20 lines): Include most content
235+
- Small files (≤100 lines): 15 lines on each side
236+
- Medium files (≤500 lines): 25 lines on each side
237+
- Large files (≤2000 lines): 50 lines on each side
238+
- Very large files (>2000 lines): 75 lines on each side
239+
240+
#### Token budget support
241+
242+
When combined with `maxTokens`, the `line` option performs intelligent token-aware range calculation:
243+
244+
```js "line: 25, maxTokens: 500"
245+
// Focus on line 25 with token budget constraint
246+
def("FUNCTION_CODE", fileContent, { line: 25, maxTokens: 500 })
247+
```
248+
249+
The implementation:
250+
- **Smart Expansion**: Starts with the center line and expands alternately up/down until token budget is reached
251+
- **Accurate Counting**: Uses precise token estimation for better control
252+
- **Graceful Fallback**: Falls back to file-size-based calculation when no `maxTokens` specified
253+
- **Budget Overflow**: Returns just the center line if it already exceeds the token budget
254+
255+
#### Priority rules
256+
257+
Explicit line ranges take precedence over the `line` option:
258+
259+
```js
260+
// lineStart/lineEnd override line option and maxTokens
261+
def("EXPLICIT_WINS", codeFile, {
262+
lineStart: 10,
263+
lineEnd: 20,
264+
line: 50,
265+
maxTokens: 100
266+
}) // Uses lines 10-20
267+
```
268+
224269
### `maxTokens`
225270
226271
It is possible to limit the number of tokens that are generated by the `def` function. This can be useful when the output is too large and the model has a token limit.
@@ -230,6 +275,8 @@ The `maxTokens` option can be set to a number to limit the number of tokens gene
230275
def("FILE", env.files, { maxTokens: 100 })
231276
```
232277
278+
When used with the `line` option, `maxTokens` controls the total size of the extracted range around the center line rather than truncating individual files.
279+
233280
### Data filters
234281
235282
The `def` function treats data files such as [CSV](/genaiscript/reference/scripts/csv) and [XLSX](/genaiscript/reference/scripts/xlsx) specially. It will automatically convert the data into a

genaiscript.d.ts

Lines changed: 11 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

packages/core/src/liner.ts

Lines changed: 163 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
import { llmifyDiff } from "./llmdiff.js";
88
import { MIN_LINE_NUMBER_LENGTH } from "./constants.js";
99
import { tryDiffParse } from "./diff.js";
10+
import type { RangeOptions, TokenEncoder } from "./types.js";
11+
import { approximateTokens } from "./tokens.js";
1012

1113
/**
1214
* Adds 1-based line numbers to each line of the input text.
@@ -55,21 +57,174 @@ export function removeLineNumbers(text: string) {
5557
* Extracts a line range from the text using 1-based inclusive line numbers.
5658
*
5759
* @param text - The input text from which to extract the range.
58-
* @param options - An object specifying the line range.
59-
* - lineStart: The 1-based starting line number of the range.
60-
* - lineEnd: The 1-based ending line number of the range.
60+
* @param options - Range options specifying line numbers or center line.
61+
* @param encoder - Optional token encoder for accurate token counting.
6162
* @returns The extracted range of text or the original text if no valid range is provided.
6263
*/
63-
export function extractRange(text: string, options?: { lineStart?: number; lineEnd?: number }) {
64-
const { lineStart, lineEnd } = options || {};
65-
if (isNaN(lineStart) && isNaN(lineEnd)) return text;
64+
export function extractRange(text: string, options?: RangeOptions, encoder?: TokenEncoder) {
65+
const { lineStart, lineEnd, line, maxTokens } = options || {};
66+
67+
// Handle existing lineStart/lineEnd logic first (takes priority)
68+
if (!isNaN(lineStart) || !isNaN(lineEnd)) {
69+
const lines = text.split("\n");
70+
const startLine = lineStart || 1;
71+
const endLine = lineEnd || lines.length;
72+
return lines.slice(startLine - 1, endLine).join("\n");
73+
}
74+
75+
// Handle center line option if lineStart/lineEnd not provided
76+
if (!isNaN(line)) {
77+
return extractRangeAroundLine(text, line, maxTokens, encoder);
78+
}
79+
80+
// If no valid range is provided, return original text
81+
return text;
82+
}
6683

84+
/**
85+
* Extracts a dynamic range around a center line.
86+
* The range size is calculated based on maxTokens budget and file size.
87+
*
88+
* @param text - The input text from which to extract the range.
89+
* @param centerLine - The 1-based center line number.
90+
* @param maxTokens - Optional maximum token budget for the extracted range.
91+
* @param encoder - Optional token encoder for accurate token counting.
92+
* @returns The extracted range of text around the center line.
93+
*/
94+
export function extractRangeAroundLine(
95+
text: string,
96+
centerLine: number,
97+
maxTokens?: number,
98+
encoder?: TokenEncoder
99+
): string {
67100
const lines = text.split("\n");
68-
const startLine = lineStart || 1;
69-
const endLine = lineEnd || lines.length;
101+
const totalLines = lines.length;
102+
103+
// Validate center line
104+
if (centerLine < 1 || centerLine > totalLines) {
105+
return text; // Return original text if center line is out of bounds
106+
}
107+
108+
// If maxTokens budget is specified, compute range based on token constraints
109+
if (maxTokens && maxTokens > 0) {
110+
return extractRangeWithTokenBudget(lines, centerLine, maxTokens, encoder);
111+
}
112+
113+
// Fallback to dynamic range based on file size
114+
const contextLines = calculateContextLines(totalLines);
115+
116+
// Calculate start and end lines around center
117+
const startLine = Math.max(1, centerLine - contextLines);
118+
const endLine = Math.min(totalLines, centerLine + contextLines);
119+
120+
// Extract the range (convert to 0-based indexing for slice)
121+
// Note: slice(start, end) where end is exclusive position, not length
70122
return lines.slice(startLine - 1, endLine).join("\n");
71123
}
72124

125+
/**
126+
* Extracts a range around a center line based on a token budget.
127+
* Expands symmetrically around the center line until the token budget is reached.
128+
*
129+
* @param lines - Array of text lines.
130+
* @param centerLine - The 1-based center line number.
131+
* @param maxTokens - Maximum token budget for the extracted range.
132+
* @param encoder - Optional token encoder for accurate counting.
133+
* @returns The extracted range of text that fits within the token budget.
134+
*/
135+
function extractRangeWithTokenBudget(
136+
lines: string[],
137+
centerLine: number,
138+
maxTokens: number,
139+
encoder?: TokenEncoder
140+
): string {
141+
const totalLines = lines.length;
142+
const centerIndex = centerLine - 1; // Convert to 0-based index
143+
144+
// Start with just the center line
145+
let startIndex = centerIndex;
146+
let endIndex = centerIndex;
147+
let currentContent = lines[centerIndex];
148+
let currentTokens = approximateTokens(currentContent, { encoder });
149+
150+
// If center line already exceeds budget, return just that line
151+
if (currentTokens >= maxTokens) {
152+
return currentContent;
153+
}
154+
155+
// Expand around the center line alternately (up and down)
156+
let expandUp = true;
157+
158+
while (currentTokens < maxTokens) {
159+
let nextStartIndex = startIndex;
160+
let nextEndIndex = endIndex;
161+
162+
if (expandUp && startIndex > 0) {
163+
// Try expanding upward
164+
nextStartIndex = startIndex - 1;
165+
} else if (!expandUp && endIndex < totalLines - 1) {
166+
// Try expanding downward
167+
nextEndIndex = endIndex + 1;
168+
} else if (startIndex > 0) {
169+
// If can't expand in preferred direction, try the other
170+
nextStartIndex = startIndex - 1;
171+
} else if (endIndex < totalLines - 1) {
172+
nextEndIndex = endIndex + 1;
173+
} else {
174+
// Can't expand further in either direction
175+
break;
176+
}
177+
178+
// Compute content for the new range
179+
const nextContent = lines.slice(nextStartIndex, nextEndIndex + 1).join("\n");
180+
181+
const nextTokens = approximateTokens(nextContent, { encoder });
182+
183+
// If adding this line would exceed the budget, stop expanding
184+
if (nextTokens > maxTokens) {
185+
break;
186+
}
187+
188+
// Accept the expansion
189+
currentContent = nextContent;
190+
currentTokens = nextTokens;
191+
startIndex = nextStartIndex;
192+
endIndex = nextEndIndex;
193+
194+
// Alternate expansion direction for next iteration
195+
expandUp = !expandUp;
196+
}
197+
198+
return currentContent;
199+
}
200+
201+
/**
202+
* Calculates the number of context lines to include around a center line
203+
* based on the total file size and other factors.
204+
*
205+
* @param totalLines - Total number of lines in the file.
206+
* @returns Number of lines to include on each side of the center line.
207+
*/
208+
function calculateContextLines(totalLines: number): number {
209+
// Dynamic calculation based on file size
210+
if (totalLines <= 20) {
211+
// For very small files, include most content
212+
return Math.floor(totalLines / 2);
213+
} else if (totalLines <= 100) {
214+
// For small files, include a reasonable chunk
215+
return 15;
216+
} else if (totalLines <= 500) {
217+
// For medium files, focus on the area around the line
218+
return 25;
219+
} else if (totalLines <= 2000) {
220+
// For large files, be more conservative
221+
return 50;
222+
} else {
223+
// For very large files, be very conservative
224+
return 75;
225+
}
226+
}
227+
73228
/**
74229
* Converts a string position index to a line number.
75230
* @param text - The text in which to find the line number.

packages/core/src/promptdom.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -836,7 +836,7 @@ async function resolvePromptNode(
836836
names.add(n.name);
837837
const value = await n.value;
838838
n.resolved = value;
839-
n.resolved.content = extractRange(n.resolved.content, n);
839+
n.resolved.content = extractRange(n.resolved.content, n, encoder);
840840
const rendered = renderDefNode(n);
841841
n.preview = rendered;
842842
n.tokens = approximateTokens(rendered);
@@ -1344,7 +1344,7 @@ export async function renderPromptNode(
13441344
else
13451345
prediction = {
13461346
type: "content",
1347-
content: extractRange(value.content, n),
1347+
content: extractRange(value.content, n, encoder),
13481348
};
13491349
}
13501350
}

packages/core/src/types.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1764,6 +1764,17 @@ export interface RangeOptions {
17641764
* The inclusive end of the line range, with a 1-based index
17651765
*/
17661766
lineEnd?: number;
1767+
/**
1768+
* Center line number around which the file will be truncated.
1769+
* Dynamically calculates the range around this line.
1770+
* This is different from lineStart/lineEnd as it specifies a center point.
1771+
*/
1772+
line?: number;
1773+
/**
1774+
* Maximum token budget for the extracted range when using line option.
1775+
* If specified, the range will be computed to fit within this token limit.
1776+
*/
1777+
maxTokens?: number;
17671778
}
17681779

17691780
export interface GitIgnoreFilterOptions {

0 commit comments

Comments
 (0)