Skip to content

Commit dd50ee5

Browse files
authored
feat: add preserve duplicate paths option (#9)
Add the preserve_duplicate_paths option for virtual-file/editor workflows that need distinct logical paths even when content is identical or empty. When enabled with SELECT memory_set_option('preserve_duplicate_paths', 1), storage hashes are scoped by path so dbmem_content can keep separate rows while the embedding cache still reuses chunk embeddings by text. Fix empty content handling so memory_add_content() and memory_add_file() can store zero-length entries without producing chunks, and keep default deduplication behavior unchanged when the option is 0. Document the option, bump the extension version to 1.3.2, and cover default dedupe, duplicate preservation, and empty file/content behavior with unit tests.
1 parent 04a6e39 commit dd50ee5

5 files changed

Lines changed: 274 additions & 20 deletions

File tree

API.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ sqlite-memory enables semantic search over text content stored in SQLite. It:
3535

3636
## Sync Behavior
3737

38-
All `memory_add_*` functions use **content-hash change detection** to avoid redundant embedding computation. Each piece of content is hashed before processing — if the hash already exists in the database, the content is skipped.
38+
By default, all `memory_add_*` functions use **content-hash change detection** to avoid redundant embedding computation. Each piece of content is hashed before processing — if the hash already exists in the database, the content is skipped. Set `preserve_duplicate_paths=1` to store distinct logical paths even when their content is identical or empty.
3939

4040
### Change Detection
4141

@@ -197,6 +197,9 @@ SELECT memory_set_option('engine_warmup', 1);
197197

198198
-- Set minimum score threshold
199199
SELECT memory_set_option('min_score', 0.75);
200+
201+
-- Preserve separate logical paths even when content is identical
202+
SELECT memory_set_option('preserve_duplicate_paths', 1);
200203
```
201204

202205
---
@@ -210,7 +213,7 @@ Retrieves a configuration option value.
210213
|-----------|------|-------------|
211214
| `key` | TEXT | Option name |
212215

213-
**Returns:** ANY - Option value, or NULL if not set
216+
**Returns:** ANY - Option value, or NULL if not set. `preserve_duplicate_paths` returns `0` by default.
214217

215218
**Example:**
216219
```sql
@@ -303,6 +306,7 @@ Indexes caller-provided file content without reading from the filesystem.
303306
- No row is added to `dbmem_content_source` because content was supplied by the caller rather than read from the local filesystem
304307
- If the path was previously indexed with different content, the old entry (chunks, embeddings, FTS) is deleted and new content is reindexed
305308
- If the new content is already indexed under another path, the stale path is removed and the existing content entry is reused
309+
- Set `preserve_duplicate_paths=1` to preserve separate rows for distinct paths with identical or empty content
306310
- Available even when compiled with `DBMEM_OMIT_IO`
307311

308312
**Example:**
@@ -828,6 +832,7 @@ sqlite3_memory_register_provider(db, "my-engine", &provider);
828832
| `embedding_cache` | INTEGER | 1 | Cache embeddings to avoid redundant computation |
829833
| `cache_max_entries` | INTEGER | 0 | Max cache entries (0 = no limit). When exceeded, oldest entries are evicted |
830834
| `search_oversample` | INTEGER | 0 | Search oversampling multiplier (0 = no oversampling). When set, retrieves N * multiplier candidates from each index before merging down to N final results |
835+
| `preserve_duplicate_paths` | INTEGER | 0 | Preserve distinct logical paths for identical or empty content. When enabled, `dbmem_content.hash` is path-scoped and identifies an entry rather than only the raw content |
831836
832837
---
833838

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -210,7 +210,7 @@ memories = recall("what's the project timeline")
210210

211211
## Intelligent Sync
212212

213-
All `memory_add_*` functions use content-hash change detection to avoid redundant work:
213+
By default, all `memory_add_*` functions use content-hash change detection to avoid redundant work:
214214

215215
- **`memory_add_text`**: Computes a hash of the content. If the same content was already indexed, it is skipped entirely. No duplicate embeddings are ever created.
216216
- **`memory_add_file`**: Reads the file and hashes its content. If the file was previously indexed with different content, the old entry (chunks, embeddings, FTS) is atomically replaced. Unchanged files are skipped. Absolute file paths are stored as portable logical suffixes, while the original local path is retained only in local metadata.
@@ -219,6 +219,14 @@ All `memory_add_*` functions use content-hash change detection to avoid redundan
219219
1. **Cleanup**: Removes database entries for files that no longer exist on disk
220220
2. **Scan**: Recursively processes all matching files - adding new ones, replacing modified ones, and skipping unchanged ones. Stored paths are relative to the scanned directory root, with local provenance retained only in local metadata.
221221

222+
For virtual-file or editor workflows that need separate logical paths even when content is identical or empty, enable path-preserving storage:
223+
224+
```sql
225+
SELECT memory_set_option('preserve_duplicate_paths', 1);
226+
```
227+
228+
In this mode, `dbmem_content.hash` identifies the stored entry and is scoped by path.
229+
222230
`memory_add_text()`, `memory_add_file()`, and `memory_add_content()` each run inside a SQLite SAVEPOINT transaction. `memory_add_directory()` performs its cleanup pass transactionally and then processes each file in its own transaction. If one file fails, that file rolls back cleanly and previously-committed files remain valid; there are no partially-indexed rows or orphaned chunk/FTS entries for the failed file.
223231

224232
This makes all sync functions safe to call repeatedly - for example, on a cron schedule or at agent startup - with minimal overhead.
@@ -300,6 +308,7 @@ SELECT memory_set_option('search_oversample', 4); -- Fetch 4x candidates before
300308

301309
-- File processing
302310
SELECT memory_set_option('extensions', 'md,txt,rst'); -- File types to index
311+
SELECT memory_set_option('preserve_duplicate_paths', 1); -- Keep duplicate/empty virtual paths
303312

304313
-- Embedding cache (enabled by default)
305314
SELECT memory_set_option('embedding_cache', 0); -- Disable cache

src/sqlite-memory.c

Lines changed: 46 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ SQLITE_EXTENSION_INIT1
6464
#define DBMEM_SETTINGS_KEY_EMBEDDING_CACHE "embedding_cache"
6565
#define DBMEM_SETTINGS_KEY_CACHE_MAX_ENTRIES "cache_max_entries"
6666
#define DBMEM_SETTINGS_KEY_SEARCH_OVERSAMPLE "search_oversample"
67+
#define DBMEM_SETTINGS_KEY_PRESERVE_DUP_PATHS "preserve_duplicate_paths"
6768
#define DBMEM_SETTINGS_KEY_SCHEMA_VERSION "schema_version"
6869

6970
#define DBMEM_SCHEMA_VERSION 4
@@ -126,6 +127,7 @@ struct dbmem_context {
126127
bool embedding_cache; // Enable/disable embedding cache (default: true)
127128
int cache_max_entries; // Max cache entries (0 = no limit)
128129
int search_oversample; // Search oversampling multiplier (0 = no oversampling)
130+
bool preserve_duplicate_paths; // Keep separate rows for distinct paths with identical content
129131

130132
// Cache
131133
float *cache_buffer; // Reusable buffer for cache hits
@@ -181,6 +183,16 @@ static bool dbmem_value_hash (sqlite3_value *value, uint64_t *hash) {
181183
}
182184
}
183185

186+
static uint64_t dbmem_storage_hash_compute (const char *buffer, size_t len, const char *path, bool preserve_duplicate_paths) {
187+
uint64_t content_hash = dbmem_hash_compute(buffer, len);
188+
if (!preserve_duplicate_paths || !path || !path[0]) return content_hash;
189+
190+
uint64_t parts[2];
191+
parts[0] = content_hash;
192+
parts[1] = dbmem_hash_compute(path, strlen(path));
193+
return dbmem_hash_compute(parts, sizeof(parts));
194+
}
195+
184196
// MARK: - Settings -
185197

186198
static int dbmem_settings_write (sqlite3 *db, const char *key, const char *text_value, sqlite3_int64 int_value, const sqlite3_value *sql_value, int bind_type) {
@@ -326,6 +338,12 @@ static int dbmem_settings_sync (dbmem_context *ctx, const char *key, sqlite3_val
326338
return 0;
327339
}
328340

341+
if (strcasecmp(key, DBMEM_SETTINGS_KEY_PRESERVE_DUP_PATHS) == 0) {
342+
int n = sqlite3_value_int(value);
343+
ctx->preserve_duplicate_paths = (n > 0) ? 1 : 0;
344+
return 0;
345+
}
346+
329347
if (strcasecmp(key, DBMEM_SETTINGS_KEY_PROVIDER) == 0) {
330348
char *provider = dbmem_strdup((const char *)sqlite3_value_text(value));
331349
if (provider) {
@@ -668,10 +686,10 @@ static bool dbmem_database_check_if_stored (sqlite3 *db, uint64_t hash, int64_t
668686
rc = sqlite3_step(vm);
669687
if (rc == SQLITE_DONE) rc = SQLITE_OK;
670688
else if (rc != SQLITE_ROW) goto cleanup;
671-
672-
// SQLITE_ROW case
673-
sqlite3_int64 saved_len = sqlite3_column_int64(vm, 0);
674-
result = (saved_len == len);
689+
else {
690+
sqlite3_int64 saved_len = sqlite3_column_int64(vm, 0);
691+
result = (saved_len == len);
692+
}
675693

676694
cleanup:
677695
if (vm) sqlite3_finalize(vm);
@@ -2390,7 +2408,11 @@ static void dbmem_get_option (sqlite3_context *context, int argc, sqlite3_value
23902408

23912409
rc = sqlite3_step(vm);
23922410
if (rc == SQLITE_DONE) {
2393-
sqlite3_result_null(context);
2411+
if (strcasecmp(key, DBMEM_SETTINGS_KEY_PRESERVE_DUP_PATHS) == 0) {
2412+
sqlite3_result_int(context, 0);
2413+
} else {
2414+
sqlite3_result_null(context);
2415+
}
23942416
rc = SQLITE_OK;
23952417
} else if (rc == SQLITE_ROW) {
23962418
sqlite3_result_value(context, sqlite3_column_value(vm, 0));
@@ -2616,7 +2638,7 @@ static int dbmem_process_callback (const char *text, size_t len, size_t offset,
26162638
}
26172639

26182640
static int dbmem_process_buffer (dbmem_context *ctx, const char *buffer, int64_t len) {
2619-
uint64_t hash = dbmem_hash_compute(buffer, (size_t)len);
2641+
uint64_t hash = dbmem_storage_hash_compute(buffer, (size_t)len, ctx->path, ctx->preserve_duplicate_paths);
26202642
const char *saved_path = ctx->path;
26212643
char *unique_path = NULL;
26222644
bool transaction_started = false;
@@ -2625,6 +2647,7 @@ static int dbmem_process_buffer (dbmem_context *ctx, const char *buffer, int64_t
26252647
unique_path = dbmem_path_unique_storage_copy(ctx->db, ctx->path, ctx->source_path);
26262648
if (!unique_path) return SQLITE_NOMEM;
26272649
ctx->path = unique_path;
2650+
hash = dbmem_storage_hash_compute(buffer, (size_t)len, ctx->path, ctx->preserve_duplicate_paths);
26282651
}
26292652

26302653
sqlite3 *db = ctx->db;
@@ -2638,7 +2661,7 @@ static int dbmem_process_buffer (dbmem_context *ctx, const char *buffer, int64_t
26382661
}
26392662
dbmem_database_delete_stale_path(db, ctx->path, hash);
26402663

2641-
if (dbmem_database_check_if_stored(ctx->db, hash, len)) {
2664+
if (!ctx->preserve_duplicate_paths && dbmem_database_check_if_stored(ctx->db, hash, len)) {
26422665
if (ctx->source_path) {
26432666
char *stored_path = dbmem_database_path_for_hash_copy(ctx->db, hash);
26442667
if (!stored_path) {
@@ -2670,6 +2693,8 @@ static int dbmem_process_buffer (dbmem_context *ctx, const char *buffer, int64_t
26702693
if (rc != SQLITE_OK) goto cleanup;
26712694
}
26722695

2696+
if (len == 0) goto cleanup;
2697+
26732698
rc = dbmem_parse(buffer, (size_t)len, &settings);
26742699

26752700
if (rc == SQLITE_OK && !ctx->dimension_saved) {
@@ -3529,20 +3554,25 @@ static void dbmem_sql_reindex (sqlite3_context *context, int argc, sqlite3_value
35293554
break;
35303555
}
35313556

3532-
uint64_t value_hash = dbmem_hash_compute(value, (size_t)value_len);
3533-
bool hash_matches = (stored_hash == value_hash);
3534-
bool value_has_vault = dbmem_database_hash_has_vault(db, value_hash);
3535-
bool needs_reindex = !hash_matches || !value_has_vault;
3557+
uint64_t content_hash = dbmem_hash_compute(value, (size_t)value_len);
3558+
uint64_t scoped_hash = dbmem_storage_hash_compute(value, (size_t)value_len, path, true);
3559+
bool hash_matches = (stored_hash == content_hash || stored_hash == scoped_hash);
3560+
uint64_t target_hash = hash_matches
3561+
? stored_hash
3562+
: dbmem_storage_hash_compute(value, (size_t)value_len, path, ctx->preserve_duplicate_paths);
3563+
bool target_has_vault = (value_len == 0) || dbmem_database_hash_has_vault(db, target_hash);
3564+
bool needs_hash_update = !hash_matches;
3565+
bool needs_reindex = (value_len > 0) && (!hash_matches || !target_has_vault);
35363566

3537-
if (needs_reindex && !value_has_vault) {
3567+
if (needs_reindex) {
35383568
ctx->path = path;
35393569
ctx->context = ctx_name;
35403570
rc = dbmem_process_buffer(ctx, value, value_len);
35413571
}
35423572

3543-
if (rc == SQLITE_OK && needs_reindex) {
3544-
rc = dbmem_database_update_content_hash(db, path, value_hash);
3545-
if (rc == SQLITE_OK && !hash_matches) {
3573+
if (rc == SQLITE_OK && needs_hash_update) {
3574+
rc = dbmem_database_update_content_hash(db, path, target_hash);
3575+
if (rc == SQLITE_OK) {
35463576
rc = dbmem_database_delete_index_hash(db, stored_hash);
35473577
}
35483578
}
@@ -3566,7 +3596,7 @@ static void dbmem_sql_reindex (sqlite3_context *context, int argc, sqlite3_value
35663596
dbmemory_free(ctx_name);
35673597

35683598
if (rc != SQLITE_OK) break;
3569-
if (needs_reindex) processed++;
3599+
if (needs_reindex || needs_hash_update) processed++;
35703600
}
35713601

35723602
done:

src/sqlite-memory.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
extern "C" {
2727
#endif
2828

29-
#define SQLITE_DBMEMORY_VERSION "1.3.1"
29+
#define SQLITE_DBMEMORY_VERSION "1.3.2"
3030

3131
// public API
3232
SQLITE_DBMEMORY_API int sqlite3_memory_init (sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi);

0 commit comments

Comments
 (0)