Skip to content

Commit 28cc7f3

Browse files
author
Zhe Yu
committed
docs(database): Document the database connector API
1 parent 0354806 commit 28cc7f3

3 files changed

Lines changed: 125 additions & 87 deletions

File tree

docs/CONTRIBUTING.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,12 @@ You may also find it helpful to
4040
[enable logging](https://github.com/Davidyz/VectorCode/blob/main/docs/cli.md#debugging-and-diagnosing)
4141
for the CLI when developing new features or working on fixes.
4242

43+
### Database Connectors
44+
45+
Please take a look at [the database documentation](../src/vectorcode/database/README.md),
46+
which contains a brief introduction on the API design that explains what you'd need
47+
to do to add support for a new database.
48+
4349
## Neovim Plugin
4450

4551
At the moment, there isn't much to cover on here. As long as the code is

src/vectorcode/database/README.md

Lines changed: 60 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -2,88 +2,70 @@
22

33
A database connector is a compatibility layer that converts data structures that a
44
database natively works with to the ones that VectorCode works with. The connector
5-
classes provides abstractions for VectorCode operations (`vectorise`, `query`, etc.),
5+
classes provide abstractions for VectorCode operations (`vectorise`, `query`, etc.),
66
which enables the use of different database backends.
77

88
<!-- mtoc-start -->
99

10-
* [Creating Database Connectors](#creating-database-connectors)
11-
* [Implementation Details](#implementation-details)
12-
* [Connector Configuration](#connector-configuration)
13-
* [Database Settings](#database-settings)
14-
* [Documenting the Database Settings](#documenting-the-database-settings)
15-
* [CRUD Operations](#crud-operations)
10+
* [Adding a New Database Connector](#adding-a-new-database-connector)
11+
* [Key Implementation Details](#key-implementation-details)
12+
* [The `Config` Object](#the-config-object)
13+
* [Implementing Abstract Methods](#implementing-abstract-methods)
14+
* [Error Handling](#error-handling)
15+
* [Testing](#testing)
1616

1717
<!-- mtoc-end -->
1818

19-
# Creating Database Connectors
20-
21-
To add support for a new database backend, you'd need to:
22-
23-
1. Implement a child class of `vectorcode.database.base.DatabaseConnectorBase` and all
24-
of its abstract methods, and put it under this directory.
25-
2. Add a new entry in the [`get_database_connector`](./__init__.py) function that
26-
initialises your new database connector when the `configs.db_type` points to the new
27-
database.
28-
3. Add tests for your new database connector. The new tests should verify that your
29-
connector correctly converts between the native data structures from the database and
30-
the VectorCode data structures that the rest of the codebase (embedding function,
31-
reranker, etc.)can work with.
32-
33-
# Implementation Details
34-
35-
> Apart from this document, you may refer to [the `DatabaseConnectorBase`](./base.py)
36-
> and [the `ChromaDB0Connector`](./chroma0.py) implementations as reference designs of
37-
> a new database connector.
38-
39-
In the following sections, I'll use the term _database_ to refer to the actual database
40-
backends (chromadb, pgvector, etc.) that holds the data and performs the CRUD operations,
41-
and the term _connector_ to refer to our compatibility layer (child classes of
42-
`vectorcode.database.base.DatabaseConnectorBase`).
43-
44-
## Connector Configuration
45-
46-
The connector has a private attribute (that is, the attribute name is prefixed by a `_`)
47-
`self._configs`. This is a `vectorcode.cli_utils.Config` object that holds various
48-
configuration options, including the database settings used to initialise the
49-
connections to the database and the parameters used for the CRUD operations with the
50-
database. This attribute is **mutable** and _should_ be updated before calling a CRUD
51-
method using the `self.update_config(new_config)` or the `self.replace_config(new_config)`
52-
methods. However, the database-related settings shouldn't be changed. A new connector
53-
instance should be created for that purpose.
54-
55-
## Database Settings
56-
57-
The database settings are configured in the JSON configuration file, and will be parsed
58-
and stored in the `config.db_type` and `config.db_params` attributes of the
59-
`self._configs` object.
60-
61-
The `db_type` attribute is a string that indicates the type of the database backend
62-
(for example, `ChromaDB0` for Chromadb 0.6.3).
63-
64-
The `db_params` attribute is a dictionary that holds some database-specific settings
65-
(for example, the database API endpoint URL and/or database directory).
66-
67-
### Documenting the Database Settings
68-
69-
Please document about the database-specific settings (`db_params`) in the doc-string
70-
of your database connector. This doc-string will be presented in the error message when
71-
the database fails to initialise, and should provide instructions to help the user
72-
debug their configuration.
73-
74-
## CRUD Operations
75-
76-
Historically, the parameters of VectorCode operations have been stored and propagated
77-
in a `vectorcode.cli_utils.Config` object. The database connectors continue to follow
78-
this pattern. That is, each of the abstract methods that represent an abstracted
79-
database operation (`query()`, `vectorise()`, `list()`, etc.) should read the necessary
80-
parameters (`project_root`, file paths, query keywords, etc.) from the `self._configs`
81-
attribute. Note that the `self._configs` attribute is mutable, so you should always read
82-
the parameters from it directly for each of the operations.
83-
84-
> Some methods support keyword arguments that allows temporarily overriding some
85-
> parameters. For example, the `list_collection_content` method supports overriding
86-
> `self._configs` by passing `_collection_id` and `collection_path`. The idea is that
87-
> these methods can usually be used by the implementation of other methods or subcommands
88-
> (for example, `list_collection_content` is used in `count` and `check_orphanes`),
89-
> and being able to pass such parameters are convenient when writing those implementations.
19+
# Adding a New Database Connector
20+
21+
To add support for a new database backend, you will need to:
22+
23+
1. **Implement a connector class**: Create a new file in this directory and implement a child class of `vectorcode.database.base.DatabaseConnectorBase`. You must implement all of its abstract methods.
24+
2. **Write tests**: Add tests for your new connector in the `tests/database/` directory. The tests should mock the database's API and verify that your connector correctly converts data between the database's native format and VectorCode's data structures.
25+
3. **Register your connector**: Add a new entry in the `get_database_connector` function in `src/vectorcode/database/__init__.py` to initialize your new connector.
26+
27+
For a concrete example, refer to the implementation of `DatabaseConnectorBase` and the `ChromaDB0Connector`.
28+
29+
# Key Implementation Details
30+
31+
## The `Config` Object
32+
33+
All settings for a connector are passed through a single `vectorcode.cli_utils.Config` object, which is available as `self._configs`. This includes:
34+
35+
- **Database Settings**: The `db_type` string and `db_params` dictionary are used to configure the connection to the database backend. As a contributor, you should document the specific `db_params` your connector requires in the class's docstring.
36+
- **Operation Parameters**: Parameters for operations like `query` or `vectorise` are also present in this object.
37+
38+
The `self._configs` attribute is mutable and can be updated for subsequent operations, but the database connection settings (`db_type`, `db_params`) should not be changed after initialization.
39+
40+
## Implementing Abstract Methods
41+
42+
When implementing the abstract methods from `DatabaseConnectorBase`, you should:
43+
44+
- Read the necessary parameters from the `self._configs` object.
45+
- Perform the corresponding operation against the database.
46+
- Return data in the format specified by the method's type hints (e.g., `QueryResult`, `CollectionInfo`).
47+
48+
**Please refer to the docstrings in `DatabaseConnectorBase` for the specific API contract of each method.** They contain detailed information about what each method is expected to do and what parameters it uses from the `Config` object.
49+
50+
## Error Handling
51+
52+
If the underlying database library raises a specific exception (e.g., for a collection not being found), you should consider catching it and re-raise it as one of VectorCode's custom database exceptions from `vectorcode.database.errors`. This ensures consistent error handling in the CLI and other clients.
53+
54+
For example:
55+
```python
56+
from vectorcode.database.errors import CollectionNotFoundError
57+
58+
try:
59+
some_action_here()
60+
except SomeCustomException as e:
61+
raise CollectionNotFoundError("The collection was not found.") from e
62+
```
63+
64+
# Testing
65+
66+
The unit tests for database backends should go under [`tests/database/`](../../../tests/database/).
67+
The tests should mock the request body and return values of the database. Integration
68+
tests that interact with an actual database are out of scope for now.
69+
70+
> The tests for the subcommands currently use mocked database connectors. They're not
71+
> supposed to interact with live databases.

src/vectorcode/database/base.py

Lines changed: 59 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,10 @@
4444
class DatabaseConnectorBase(ABC): # pragma: nocover
4545
@classmethod
4646
def create(cls, configs: Config):
47+
"""
48+
Create a new instance of the database connector.
49+
This classmethod will add the docstring of the child class to the exception if the initialisation fails.
50+
"""
4751
try:
4852
return cls(configs)
4953
except Exception as e: # pragma: nocover
@@ -54,14 +58,17 @@ def create(cls, configs: Config):
5458

5559
def __init__(self, configs: Config):
5660
"""
57-
Use the `create` classmethod so that you get docs
58-
when something's wrong during the database initialisation.
61+
Initialises the database connector with the given configs.
62+
It is recommended to use the `create` classmethod instead of calling this directly,
63+
as it provides better error handling during initialisation.
5964
"""
6065
self._configs = configs
6166

6267
async def count(self, what: ResultType = ResultType.chunk) -> int:
6368
"""
6469
Returns the chunk count or file count of the given collection, depending on the value passed for `what`.
70+
This method is implemented in the base class and relies on `list_collection_content`.
71+
Child classes should not need to override this method if `list_collection_content` is implemented correctly.
6572
"""
6673
collection_content = await self.list_collection_content(what=what)
6774
match what:
@@ -74,6 +81,11 @@ async def count(self, what: ResultType = ResultType.chunk) -> int:
7481
async def query(
7582
self,
7683
) -> list[QueryResult]:
84+
"""
85+
Query the database for similar chunks.
86+
The query keywords are stored in `self._configs.query`.
87+
The implementation of this method should handle the conversion from the native database query result to a list of `vectorcode.database.types.QueryResult` objects.
88+
"""
7789
pass
7890

7991
@abstractmethod
@@ -85,13 +97,21 @@ async def vectorise(
8597
"""
8698
Vectorise the given file and add it to the database.
8799
The duplicate checking (using file hash) should be done outside of this function.
100+
101+
For developers:
102+
The implementation should chunk the file, generate embeddings for the chunks, and store them in the database.
103+
It should return a `VectoriseStats` object to report the outcome.
88104
"""
89105
pass
90106

91107
@abstractmethod
92108
async def list_collections(self) -> Sequence[CollectionInfo]:
93109
"""
94-
List collections in the database.
110+
List all collections available in the database.
111+
112+
For developers:
113+
The implementation should retrieve all collections and return them as a sequence of `CollectionInfo` objects.
114+
This includes metadata about each collection like its ID, path, and size.
95115
"""
96116
pass
97117

@@ -119,6 +139,10 @@ async def delete(
119139
"""
120140
Delete files from the database (doesn't remove files on disk).
121141
Returns the actual number of files deleted.
142+
143+
For developers:
144+
The file paths to be deleted are stored in `self._configs.rm_paths`.
145+
The implementation should remove all chunks associated with these files from the database.
122146
"""
123147
pass
124148

@@ -127,20 +151,29 @@ async def drop(
127151
self, *, collection_id: str | None = None, collection_path: str | None = None
128152
):
129153
"""
130-
Delete a collection (`self._configs.project_root`) from the database.
154+
Delete a collection from the database.
155+
The collection to be dropped is specified by `collection_id` or `collection_path`.
156+
If not provided, it defaults to `self._configs.project_root`.
131157
"""
132158
pass
133159

134160
def _check_new_config(self, new_config: Config) -> bool:
135161
"""
136-
Cleanup the `new_config` so that the database config matches the existing one.
162+
Ensures that the new config does not attempt to change database-specific settings.
163+
It copies the `db_type` and `db_params` from the existing config to the new one.
164+
This is a helper method for `update_config` and `replace_config`.
137165
"""
138166
assert isinstance(new_config, Config), "`new_config` is not a `Config` object."
139167
new_config.db_type = self._configs.db_type
140168
new_config.db_params = self._configs.db_params
141169
return True
142170

143171
async def update_config(self, new_config: Config) -> Self:
172+
"""
173+
Merge the new config with the existing one.
174+
This method will not change the database configs.
175+
Child classes should not need to override this method.
176+
"""
144177
assert self._check_new_config(new_config), (
145178
"The new config has different database configs."
146179
)
@@ -150,6 +183,11 @@ async def update_config(self, new_config: Config) -> Self:
150183
return self
151184

152185
async def replace_config(self, new_config: Config) -> Self:
186+
"""
187+
Replace the existing config with the new one.
188+
This method will not change the database configs.
189+
Child classes should not need to override this method.
190+
"""
153191
assert self._check_new_config(new_config), (
154192
"The new config has different database configs."
155193
)
@@ -158,7 +196,10 @@ async def replace_config(self, new_config: Config) -> Self:
158196

159197
async def check_orphanes(self) -> int:
160198
"""
161-
Check for files that are in the database, but no longer on the disk, and remove them.
199+
Check for files that are in the database but no longer on disk, and remove them.
200+
Returns the number of orphaned files removed.
201+
This method relies on `list_collection_content` and `delete`.
202+
Child classes should not need to override this.
162203
"""
163204

164205
orphanes: list[str] = []
@@ -178,7 +219,10 @@ async def check_orphanes(self) -> int:
178219

179220
def get_embedding(self, texts: str | list[str]) -> list[NDArray]:
180221
"""
181-
Generate embeddings and truncate them to `self._configs.embedding_dims` if needed.
222+
Generate embeddings for the given texts.
223+
It uses the embedding function specified in `self._configs.embedding_function`.
224+
If `self._configs.embedding_dims` is set, it truncates the embeddings.
225+
Child classes should use this method to get embeddings.
182226
"""
183227
if isinstance(texts, str):
184228
texts = [texts]
@@ -194,14 +238,20 @@ def get_embedding(self, texts: str | list[str]) -> list[NDArray]:
194238
@abstractmethod
195239
async def get_chunks(self, file_path) -> list[Chunk]:
196240
"""
197-
Return chunks for the provided file, if any.
198-
If not found, return an empty list.
241+
Retrieve all chunks for a given file from the database.
242+
If the file is not found in the database, it should return an empty list.
243+
244+
For developers:
245+
This is useful for operations that need to inspect the chunked content of a file, for example, for debugging or analysis.
199246
"""
200247
pass
201248

202249
async def cleanup(self) -> list[str]:
203250
"""
204251
Remove empty collections from the database.
252+
Returns a list of paths of the removed collections.
253+
This method relies on `list_collections` and `drop`.
254+
Child classes should not need to override this.
205255
"""
206256
removed: list[str] = []
207257
for collection in await self.list_collections():

0 commit comments

Comments
 (0)