Skip to content

Commit 671c8f5

Browse files
committed
Enhance index handling in Python bindings: introduce HASH index type and update examples and documentation
1 parent ab70166 commit 671c8f5

23 files changed

Lines changed: 194 additions & 104 deletions

bindings/python/docs/api/schema.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,8 @@ Create an index on a type.
233233
- `type_name` (str): Name of the type
234234
- `property_names` (List[str]): List of property names to index
235235
- `unique` (bool): Whether the index should enforce uniqueness (default: `False`)
236-
- `index_type` (str or IndexType): Type of index (`"LSM_TREE"`, `"FULL_TEXT"`)
236+
- `index_type` (str or IndexType): Type of index (`"LSM_TREE"`, `"HASH"`, `"FULL_TEXT"`,
237+
`"LSM_VECTOR"`, `"GEOSPATIAL"`)
237238

238239
**Returns:**
239240

@@ -250,13 +251,35 @@ Create an index on a type.
250251
# Unique index on username
251252
schema.create_index("User", ["username"], unique=True)
252253

254+
# Exact-match lookup index
255+
schema.create_index("Order", ["customerId"], index_type="HASH")
256+
253257
# Composite index
254258
schema.create_index("Event", ["userId", "timestamp"])
255259

256260
# Full-text index
257261
schema.create_index("Article", ["content"], index_type="FULL_TEXT")
258262
```
259263

264+
**Index choice rules of thumb:**
265+
266+
- Use `HASH` for exact-match lookups when you do not need ranges or ordered scans.
267+
- Use `LSM_TREE` when you need ranges, sorting, or a safe general-purpose default.
268+
- Use `FULL_TEXT`, `LSM_VECTOR`, and `GEOSPATIAL` only for their specialized query
269+
types.
270+
- `HASH` can be unique or non-unique. The index structure and uniqueness constraint are
271+
separate choices.
272+
273+
**SQL DSL equivalents:**
274+
275+
- `CREATE INDEX ON User (email) UNIQUE` -> unique `LSM_TREE`
276+
- `CREATE INDEX ON User (email) NOTUNIQUE` -> non-unique `LSM_TREE`
277+
- `CREATE INDEX ON User (email) UNIQUE_HASH` -> unique `HASH`
278+
- `CREATE INDEX ON Order (customerId) NOTUNIQUE_HASH` -> non-unique `HASH`
279+
- `CREATE INDEX ON Article (content) FULL_TEXT` -> `FULL_TEXT`
280+
- `CREATE INDEX ON Doc (embedding) LSM_VECTOR ...` -> `LSM_VECTOR`
281+
- `CREATE INDEX ON Place (location) GEOSPATIAL` -> `GEOSPATIAL`
282+
260283
**Vector (JVector) Parameters:**
261284

262285
- **max_connections**: Max connections per node (default: 16; typical 8-32). Maps to JVector `maxConnections`.

bindings/python/docs/examples/02_social_network_graph.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -278,10 +278,10 @@ Create indexes on frequently queried properties:
278278

279279
```python
280280
# Index on person names for fast lookups
281-
db.command("sql", "CREATE INDEX ON Person (name) NOTUNIQUE")
281+
db.command("sql", "CREATE INDEX ON Person (name) NOTUNIQUE_HASH")
282282

283283
# For unique identifiers
284-
db.command("sql", "CREATE INDEX ON Person (person_id) UNIQUE")
284+
db.command("sql", "CREATE INDEX ON Person (person_id) UNIQUE_HASH")
285285
```
286286

287287
### Batch Operations

bindings/python/docs/examples/04_csv_import_documents.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -533,7 +533,7 @@ The database is preserved for inspection after the example completes.
533533

534534
⚠️ **Note**: Database files are larger than source CSVs due to:
535535

536-
- Index structures (LSM-Tree buffers and sorted data)
536+
- Index structures (LSM-tree, hash, or other index data depending on what you create)
537537
- Transaction logs and metadata
538538
- Internal data structures for document storage
539539
- WAL (Write-Ahead Log) files for durability
@@ -545,7 +545,9 @@ The database is preserved for inspection after the example completes.
545545
3.**NULL value handling** works seamlessly across all data types (STRING, INTEGER, etc.)
546546
4.**Batch processing** (`commit_every`) dramatically improves import performance
547547
5.**Create indexes AFTER import** - 2-3x faster than indexing during import
548-
6.**LSM_TREE indexes** provide massive performance gains (up to 14,836x speedup!)
548+
6.**Indexes** provide massive performance gains (up to 14,836x speedup!). For
549+
exact-match lookups, prefer `UNIQUE_HASH` / `NOTUNIQUE_HASH`; for ranges and ordered
550+
scans, prefer `UNIQUE` / `NOTUNIQUE` (`LSM_TREE`).
549551
7.**Statistical validation** (10 runs) ensures reliable performance measurements
550552
8.**Result validation** compares actual data values, not just row counts
551553
9.**Multi-bucket architecture** creates 15 buckets per type, 1 index file per bucket per property

bindings/python/docs/examples/05_csv_import_graph.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -306,12 +306,12 @@ LIMIT 10
306306
db.command("sql", "CREATE VERTEX TYPE User IF NOT EXISTS")
307307
db.command("sql", "CREATE PROPERTY User.userId LONG")
308308
db.command("sql", "CREATE PROPERTY User.name STRING")
309-
db.command("sql", "CREATE INDEX ON User (userId) UNIQUE")
309+
db.command("sql", "CREATE INDEX ON User (userId) UNIQUE_HASH")
310310

311311
db.command("sql", "CREATE VERTEX TYPE Movie IF NOT EXISTS")
312312
db.command("sql", "CREATE PROPERTY Movie.movieId LONG")
313313
db.command("sql", "CREATE PROPERTY Movie.title STRING")
314-
db.command("sql", "CREATE INDEX ON Movie (movieId) UNIQUE")
314+
db.command("sql", "CREATE INDEX ON Movie (movieId) UNIQUE_HASH")
315315

316316
# Create edge types
317317
db.command("sql", "CREATE EDGE TYPE RATED UNIDIRECTIONAL IF NOT EXISTS")

bindings/python/docs/guide/core/queries.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -331,6 +331,42 @@ for row in result:
331331
print(row.get("content"), row.get("$score"))
332332
```
333333

334+
### Choosing index types in SQL DSL
335+
336+
When you create indexes through SQL, the index keyword controls both the index
337+
structure and uniqueness.
338+
339+
```python
340+
# General-purpose ordered index (LSM_TREE)
341+
db.command("sql", "CREATE INDEX ON User (email) UNIQUE")
342+
db.command("sql", "CREATE INDEX ON Event (createdAt) NOTUNIQUE")
343+
344+
# Exact-match hash index
345+
db.command("sql", "CREATE INDEX ON User (email) UNIQUE_HASH")
346+
db.command("sql", "CREATE INDEX ON Order (customerId) NOTUNIQUE_HASH")
347+
348+
# Specialized indexes
349+
db.command("sql", "CREATE INDEX ON Article (content) FULL_TEXT")
350+
db.command("sql", "CREATE INDEX ON Doc (embedding) LSM_VECTOR METADATA {\"dimensions\": 128}")
351+
db.command("sql", "CREATE INDEX ON Place (location) GEOSPATIAL")
352+
```
353+
354+
Rules of thumb:
355+
356+
- Use `UNIQUE_HASH` or `NOTUNIQUE_HASH` for exact-match lookups only.
357+
- Use `UNIQUE` or `NOTUNIQUE` for `LSM_TREE` indexes when you need ranges, ordering, or a safe general-purpose default.
358+
- Use `FULL_TEXT` for tokenized text search, not normal equality lookups.
359+
- Use `LSM_VECTOR` for embeddings and nearest-neighbor search.
360+
- Use `GEOSPATIAL` for spatial predicates.
361+
362+
Examples:
363+
364+
- `email = ?`, `userId = ?`, `movieId = ?`: usually `UNIQUE_HASH` or `NOTUNIQUE_HASH`
365+
- `createdAt BETWEEN ? AND ?`, `price > ?`, ordered scans: usually `UNIQUE` or `NOTUNIQUE`
366+
367+
`HASH` does not imply uniqueness. A non-unique hash index still makes sense when many
368+
records share the same exact-match value, such as `customerId`, `status`, or `country`.
369+
334370
### ResultSet Methods
335371

336372
```python

bindings/python/docs/java-api-coverage.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
1-
## Java API Coverage Analysis
1+
# Java API Coverage Analysis
22

33
This section provides a practical mapping between the ArcadeDB Java API and the
44
Python bindings surface in this repository. It reflects the current code in
55
`arcadedb_embedded` rather than a theoretical, full Java surface comparison.
66

7-
### Executive Summary
7+
## Executive Summary
88

99
The Python bindings expose the **core database, schema, graph, vector, async,
1010
import/export, and server workflows** needed for typical application usage. Most
1111
omissions are **low-level JVM internals** (WAL details, bucket scanning, binary
1212
protocol, server plugins, clustering) that are not typically used from Python.
1313

14-
#### Coverage by Area (Qualitative)
14+
### Coverage by Area (Qualitative)
1515

1616
| Area | Status | Notes |
1717
| --- | --- | --- |
1818
| Core Database | ✅ Supported | `DatabaseFactory`, `Database`, transactions, lookups, async helpers |
1919
| Query Execution | ✅ Supported | SQL, OpenCypher, MongoDB, GraphQL passthrough |
20-
| Schema & Indexes | ✅ Supported | Types, properties, LSM/FULL_TEXT/Vector indexes |
20+
| Schema & Indexes | ✅ Supported | Types, properties, LSM_TREE/HASH/FULL_TEXT/LSM_VECTOR/GEOSPATIAL indexes |
2121
| Graph API | ✅ Supported | SQL/OpenCypher graph workflows plus `Document`/`Vertex`/`Edge` wrapper compatibility |
2222
| Vector Search | ✅ Supported | JVector indexes + NumPy conversion helpers |
2323
| Async Execution | ✅ Supported | `AsyncExecutor` plus record-level and SQL/Cypher async flows |

bindings/python/examples/02_social_network_graph.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,8 +139,7 @@ def create_schema(db):
139139
db.command("sql", "CREATE PROPERTY FRIEND_OF.closeness STRING")
140140
print(" ✓ Created FRIEND_OF properties")
141141

142-
# Create indexes for better performance using Schema API
143-
db.command("sql", "CREATE INDEX ON Person (name) NOTUNIQUE")
142+
db.command("sql", "CREATE INDEX ON Person (name) NOTUNIQUE_HASH")
144143
print(" ✓ Created index on Person.name")
145144

146145
print(f" ⏱️ Time: {time.time() - step_start:.3f}s")

bindings/python/examples/03_vector_search.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@
9999
db.command("sql", "CREATE PROPERTY Article.id STRING")
100100

101101
# Create standard index on ID for fast lookups
102-
db.command("sql", "CREATE INDEX ON Article (id) UNIQUE")
102+
db.command("sql", "CREATE INDEX ON Article (id) UNIQUE_HASH")
103103

104104
print(" ✅ Schema created: Article vertex type")
105105
print(" 💡 Vector property type: ARRAY_OF_FLOATS")

bindings/python/examples/04_csv_import_documents.py

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -618,8 +618,14 @@ def create_indexes(db, indexes, verbose=True):
618618
# Convert uniqueness string to SQL index creation
619619
if uniqueness == "UNIQUE":
620620
db.command("sql", f"CREATE INDEX ON {table} ({column}) UNIQUE")
621+
elif uniqueness == "UNIQUE_HASH":
622+
db.command("sql", f"CREATE INDEX ON {table} ({column}) UNIQUE_HASH")
621623
elif uniqueness == "FULL_TEXT":
622624
db.command("sql", f"CREATE INDEX ON {table} ({column}) FULL_TEXT")
625+
elif uniqueness == "NOTUNIQUE_HASH":
626+
db.command(
627+
"sql", f"CREATE INDEX ON {table} ({column}) NOTUNIQUE_HASH"
628+
)
623629
else: # NOTUNIQUE
624630
db.command("sql", f"CREATE INDEX ON {table} ({column}) NOTUNIQUE")
625631

@@ -1469,15 +1475,16 @@ def _flush_batch(batch_rows):
14691475

14701476
# Check all existing indexes
14711477
#
1472-
# Note: ArcadeDB has 3 index engine types: LSM_TREE, FULL_TEXT, VECTOR
1478+
# Note: ArcadeDB exposes multiple index engines, including LSM_TREE, HASH,
1479+
# FULL_TEXT, and VECTOR.
14731480
# The schema metadata query only exposes a boolean 'unique' field, not the engine type.
14741481
# Therefore:
1475-
# - UNIQUE indexes → unique=true, engine=LSM_TREE
1476-
# - NOTUNIQUE indexes → unique=false, engine=LSM_TREE
1482+
# - UNIQUE / UNIQUE_HASH indexes → unique=true
1483+
# - NOTUNIQUE / NOTUNIQUE_HASH indexes → unique=false
14771484
# - FULL_TEXT indexes → unique=false, engine=FULL_TEXT (appears as NOTUNIQUE!)
14781485
#
1479-
# This means FULL_TEXT indexes show as NOTUNIQUE in the metadata, so we need to
1480-
# check for both when validating expected FULL_TEXT indexes.
1486+
# This means HASH and FULL_TEXT indexes must be validated by semantics rather than
1487+
# engine type alone.
14811488
for idx in existing_indexes:
14821489
idx_dict = json.loads(idx.to_json())
14831490
index_type = "UNIQUE" if idx_dict.get("unique") else "NOTUNIQUE"
@@ -1505,7 +1512,7 @@ def _flush_batch(batch_rows):
15051512
candidate_columns.append(prop)
15061513

15071514
if isinstance(name, str) and "[" in name and name.endswith("]"):
1508-
raw_props = name[name.find("[") + 1 : -1]
1515+
raw_props = name[name.find("[") + 1 : -1].strip()
15091516
for col in raw_props.split(","):
15101517
col = col.strip()
15111518
if col:
@@ -1514,15 +1521,23 @@ def _flush_batch(batch_rows):
15141521
candidate_columns = [c for c in candidate_columns if c]
15151522

15161523
for column_name in candidate_columns:
1517-
# Try matching as the reported type (UNIQUE/NOTUNIQUE)
1524+
# Try matching as the reported uniqueness semantics.
15181525
key = (type_name, column_name, index_type)
15191526
if key in expected_indexes:
15201527
expected_indexes[key] = True
15211528

1529+
if index_type == "UNIQUE":
1530+
unique_hash_key = (type_name, column_name, "UNIQUE_HASH")
1531+
if unique_hash_key in expected_indexes:
1532+
expected_indexes[unique_hash_key] = True
1533+
15221534
# FULL_TEXT indexes appear as NOTUNIQUE in metadata, so also check for FULL_TEXT
15231535
# This is expected behavior since FULL_TEXT is a different index engine type,
15241536
# not a variant of LSM_TREE, but metadata only exposes the 'unique' boolean.
15251537
if index_type == "NOTUNIQUE":
1538+
notunique_hash_key = (type_name, column_name, "NOTUNIQUE_HASH")
1539+
if notunique_hash_key in expected_indexes:
1540+
expected_indexes[notunique_hash_key] = True
15261541
fulltext_key = (type_name, column_name, "FULL_TEXT")
15271542
if fulltext_key in expected_indexes:
15281543
expected_indexes[fulltext_key] = True
@@ -1534,6 +1549,9 @@ def _flush_batch(batch_rows):
15341549
unique_key = (type_name, column_name, "UNIQUE")
15351550
if unique_key in expected_indexes:
15361551
expected_indexes[unique_key] = True
1552+
unique_hash_key = (type_name, column_name, "UNIQUE_HASH")
1553+
if unique_hash_key in expected_indexes:
1554+
expected_indexes[unique_hash_key] = True
15371555

15381556
# Validate all expected indexes were created
15391557
print("\n ✅ Validating expected indexes:")

bindings/python/examples/05_csv_import_graph.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1429,8 +1429,8 @@ def create_schema(db: Any, create_indexes: bool = True):
14291429
print("Creating indexes...")
14301430
# Idempotent index creation
14311431
for command in (
1432-
"CREATE INDEX ON User (userId) UNIQUE",
1433-
"CREATE INDEX ON Movie (movieId) UNIQUE",
1432+
"CREATE INDEX ON User (userId) UNIQUE_HASH",
1433+
"CREATE INDEX ON Movie (movieId) UNIQUE_HASH",
14341434
):
14351435
try:
14361436
db.command("sql", command)

0 commit comments

Comments
 (0)