Skip to content

Commit 799727e

Browse files
authored
feat: add storage control to DocumentDeleter and refactor service exports
This pull request introduces several improvements and refactors to the `admin-api-lib` and `extractor-api-lib` libraries, focusing on better modularization through re-exports, enhanced document deletion logic, and improved development and testing workflows. The most significant changes are grouped below. **1. Codebase Modularization and Re-exports** - The `FileService`, `S3Service`, and `S3Settings` classes in `admin-api-lib` and `extractor-api-lib` now re-export their implementations from `rag-core-lib`, removing local duplicate definitions and ensuring consistency across libraries. [[1]](diffhunk://#diff-fb36b82cd5dc6c295361b85e59ed77dba4bad6a8e53d9d9e79392e48bb38861aL1-R5) [[2]](diffhunk://#diff-cc8cac6dab28e557fdb760b7827590c39bb1b739f1f480913235c45c01f14cc3L1-R5) [[3]](diffhunk://#diff-32903fe42be8331be09b17e8c3f0184d74e84e4c7b71ed37b9e707ac8b47a54aL1-R5) **2. Document Deletion API Enhancements** - The `DocumentDeleter` interface and its default implementation now accept a new `remove_from_storage` argument, allowing callers to control whether the file is deleted from storage in addition to the key-value store. The logic for file deletion was refactored to handle storage keys more robustly and to provide clearer error handling and logging. [[1]](diffhunk://#diff-d841957e5971661a4b5d8cd4a648283115db1a26e067a2dd7349bafd45dc4753L10-R15) [[2]](diffhunk://#diff-d841957e5971661a4b5d8cd4a648283115db1a26e067a2dd7349bafd45dc4753R25-R26) [[3]](diffhunk://#diff-0ff2bf24abd3de9d5ff96aed7ba588f84b6eeba77d74cc692570f2aaea305b6cL44-R58) [[4]](diffhunk://#diff-0ff2bf24abd3de9d5ff96aed7ba588f84b6eeba77d74cc692570f2aaea305b6cR74-R75) [[5]](diffhunk://#diff-0ff2bf24abd3de9d5ff96aed7ba588f84b6eeba77d74cc692570f2aaea305b6cL70-R91) [[6]](diffhunk://#diff-0ff2bf24abd3de9d5ff96aed7ba588f84b6eeba77d74cc692570f2aaea305b6cR101-R111) **3. Upstream Usage and Test Updates** - All usages of `adelete_document` in file and source uploaders, as well as related tests, have been updated to use the new `remove_from_storage` parameter, ensuring correct behavior and test coverage for the enhanced deletion API. [[1]](diffhunk://#diff-2f472a7af7eccd952a8fc1daa5e849c9cd0649fae490380ff2c421bb57e16feeL191-R195) [[2]](diffhunk://#diff-3c2003f1db64f5b9610d691e451bb2ea5f65a92f1a81b5b210588931a4ecb455L200-R204) [[3]](diffhunk://#diff-0228f4f920d182dd8dae216a018d88c345d3aa732337348ad1c0030060450f55L56-R60) [[4]](diffhunk://#diff-83cebf407211b80a50570753869679e1eb801b09712b51f121a440548296aa30L76-R80) **4. Docker and Build System Improvements** - The Docker build configuration for the extractor service and the `libs/Dockerfile` have been updated to include and install `rag-core-lib` as a dependency where needed, improving development workflows and ensuring all necessary dependencies are available for testing and running the services. [[1]](diffhunk://#diff-c2ee8653e1d6b85f0aadf87cd438a9250806c052877248442be4d434cbc52425L397-R405) [[2]](diffhunk://#diff-4886ca734bb30da5c5ab3bca8bac9607d0a7c7a80c399eea76b24ae67c8509b1L26-R33) [[3]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323R96) These changes collectively improve code maintainability, modularity, and the flexibility of document management operations across the codebase. ** issue: #224
1 parent 3b7693c commit 799727e

File tree

32 files changed

+2001
-714
lines changed

32 files changed

+2001
-714
lines changed

Tiltfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -394,14 +394,15 @@ extractor_docker_build_config = {
394394
"ref": document_extractor_full_image_name,
395395
"context": ".",
396396
"dockerfile": extractor_dockerfile,
397-
"ignore": IGNORE_BASE + libs_ignore_except(["extractor-api-lib"]),
397+
"ignore": IGNORE_BASE + libs_ignore_except(["extractor-api-lib", "rag-core-lib"]),
398398
}
399399

400400
# Add build args and live_update based on dev mode
401401
if dev_mode:
402402
extractor_docker_build_config["live_update"] = [
403403
sync(extractor_context, "/app/services/document-extractor"),
404404
sync(core_library_context +"/extractor-api-lib", "/app/libs/extractor-api-lib"),
405+
sync(core_library_context +"/rag-core-lib", "/app/libs/rag-core-lib"),
405406
]
406407
else:
407408
# Use prod-local for Tilt with production Dockerfile

libs/Dockerfile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,14 @@ RUN poetry config virtualenvs.create false
2323
RUN if [ "$TEST" = "1" ]; then \
2424
rm ../poetry.lock; rm ../pyproject.toml; \
2525
# Install rag-core-lib dependencies if testing libraries that depend on it \
26-
if [ "${DIRECTORY}" = "admin-api-lib" ] || [ "${DIRECTORY}" = "rag-core-api" ]; then \
26+
if [ "${DIRECTORY}" = "admin-api-lib" ] || [ "${DIRECTORY}" = "rag-core-api" ] || [ "${DIRECTORY}" = "extractor-api-lib" ]; then \
2727
cd ../rag-core-lib && poetry install --with dev,test; \
2828
cd ../${DIRECTORY}; \
2929
fi; \
3030
poetry install --with dev,test; \
31+
if [ "${DIRECTORY}" = "admin-api-lib" ] || [ "${DIRECTORY}" = "rag-core-api" ] || [ "${DIRECTORY}" = "extractor-api-lib" ]; then \
32+
pip install -e ../rag-core-lib; \
33+
fi; \
3134
else \
3235
poetry install --with dev,lint; \
3336
fi

libs/admin-api-lib/poetry.lock

Lines changed: 3 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

libs/admin-api-lib/src/admin_api_lib/api_endpoints/document_deleter.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,12 @@ class DocumentDeleter(ABC):
77
"""Abstract base class for document deletion endpoint."""
88

99
@abstractmethod
10-
async def adelete_document(self, identification: str, remove_from_key_value_store: bool = True) -> None:
10+
async def adelete_document(
11+
self,
12+
identification: str,
13+
remove_from_key_value_store: bool = True,
14+
remove_from_storage: bool = True,
15+
) -> None:
1116
"""
1217
Delete a document by its identification asynchronously.
1318
@@ -17,6 +22,8 @@ async def adelete_document(self, identification: str, remove_from_key_value_stor
1722
The unique identifier of the document to be deleted.
1823
remove_from_key_value_store : bool, optional
1924
If True, the document will also be removed from the key-value store (default is True).
25+
remove_from_storage : bool, optional
26+
If True, the document will also be removed from the file storage (default is True).
2027
2128
Returns
2229
-------
Lines changed: 3 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,5 @@
1-
"""Abstract class for dealing with I/O."""
1+
"""Re-export core file service interface."""
22

3-
import abc
4-
from abc import ABC
5-
from pathlib import Path
6-
from typing import BinaryIO
3+
from rag_core_lib.file_services.file_service import FileService
74

8-
9-
class FileService(ABC):
10-
"""Abstract class for dealing with I/O."""
11-
12-
@abc.abstractmethod
13-
def download_folder(self, source: str, target: Path) -> None:
14-
"""Download the remote folder on "source" to the local "target" directory.
15-
16-
Parameters
17-
----------
18-
source: str
19-
Path to the remote folder.
20-
target: Path
21-
Download destination path.
22-
"""
23-
24-
@abc.abstractmethod
25-
def download_file(self, source: str, target_file: BinaryIO) -> None:
26-
"""Read a single remote file "source" into the local "target_file" file-like object.
27-
28-
Example usage
29-
=============
30-
```
31-
s3_settings: S3Settings = get_s3_settings()
32-
s3_service = S3Service(endpoint="endpoint", username="username", password="password", bucket_name="bucket")
33-
34-
with tempfile.SpooledTemporaryFile(max_size=self._iot_forecast_settings.max_model_size) as temp_file:
35-
s3_service.download_file("remote_file", temp_file)
36-
# do stuff with temp_file
37-
```
38-
39-
Parameters
40-
----------
41-
source: str
42-
Path to the remote folder.
43-
target_file: BinaryIO
44-
File-like object to save the data to.
45-
"""
46-
47-
@abc.abstractmethod
48-
def upload_file(self, file_path: str, file_name: str) -> None:
49-
"""Upload a local file to the Fileservice.
50-
51-
Parameters
52-
----------
53-
file_path : str
54-
The path to the local file to be uploaded.
55-
file_name : str
56-
The target path in the file storage where the file will be stored.
57-
"""
58-
59-
@abc.abstractmethod
60-
def get_all_sorted_file_names(self) -> list[str]:
61-
"""Retrieve all file names stored in the file storage.
62-
63-
Returns
64-
-------
65-
list[str]
66-
A list of file names stored in the file storage.
67-
"""
68-
69-
@abc.abstractmethod
70-
def delete_file(self, file_name: str) -> None:
71-
"""Delete a file from the file storage.
72-
73-
Parameters
74-
----------
75-
file_name : str
76-
The name of the file to be deleted from the file storage.
77-
"""
5+
__all__ = ["FileService"]

libs/admin-api-lib/src/admin_api_lib/impl/api_endpoints/default_document_deleter.py

Lines changed: 34 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,21 @@ def __init__(self, file_service: FileService, rag_api: RagApi, key_value_store:
4141
self._rag_api = rag_api
4242
self._key_value_store = key_value_store
4343

44-
async def adelete_document(self, identification: str, remove_from_key_value_store: bool = True) -> None:
44+
@staticmethod
45+
def _storage_key_from_identification(identification: str) -> str | None:
46+
if identification.startswith("file:"):
47+
storage_key = identification[len("file:") :]
48+
return storage_key or None
49+
if ":" in identification:
50+
return None
51+
return identification or None
52+
53+
async def adelete_document(
54+
self,
55+
identification: str,
56+
remove_from_key_value_store: bool = True,
57+
remove_from_storage: bool = True,
58+
) -> None:
4559
"""
4660
Asynchronously delete a document identified by the given identification string.
4761
@@ -57,6 +71,8 @@ async def adelete_document(self, identification: str, remove_from_key_value_stor
5771
The unique identifier of the document to be deleted.
5872
remove_from_key_value_store : bool, optional
5973
If True, the document will also be removed from the key-value store (default is True).
74+
remove_from_storage : bool, optional
75+
If True, the document will also be removed from the file storage (default is True).
6076
6177
Raises
6278
------
@@ -67,12 +83,12 @@ async def adelete_document(self, identification: str, remove_from_key_value_stor
6783
error_messages = ""
6884
# Delete the document from file service and vector database
6985
logger.debug("Deleting existing document: %s", identification)
70-
try:
71-
if remove_from_key_value_store:
72-
self._key_value_store.remove(identification)
73-
self._file_service.delete_file(identification)
74-
except Exception as e:
75-
error_messages += f"Error while deleting {identification} from file storage\n {str(e)}\n"
86+
if remove_from_key_value_store:
87+
self._key_value_store.remove(identification)
88+
89+
if remove_from_storage:
90+
error_messages = self._delete_from_storage(identification, error_messages)
91+
7692
try:
7793
self._rag_api.remove_information_piece(
7894
DeleteRequest(metadata=[KeyValuePair(key="document", value=json.dumps(identification))])
@@ -82,3 +98,14 @@ async def adelete_document(self, identification: str, remove_from_key_value_stor
8298
error_messages += f"Error while deleting {identification} from vector db\n{str(e)}"
8399
if error_messages:
84100
raise HTTPException(404, error_messages)
101+
102+
def _delete_from_storage(self, identification: str, error_messages: str) -> str:
103+
try:
104+
storage_key = self._storage_key_from_identification(identification)
105+
if storage_key:
106+
self._file_service.delete_file(storage_key)
107+
else:
108+
logger.debug("Skipping file storage deletion for non-file source: %s", identification)
109+
except Exception as e:
110+
error_messages += f"Error while deleting {identification} from file storage\n {str(e)}\n"
111+
return error_messages

libs/admin-api-lib/src/admin_api_lib/impl/api_endpoints/default_file_uploader.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,11 @@ async def _handle_source_upload(
188188
# Replace old document
189189
# deletion is allowed to fail
190190
with suppress(Exception):
191-
await self._document_deleter.adelete_document(source_name, remove_from_key_value_store=False)
191+
await self._document_deleter.adelete_document(
192+
source_name,
193+
remove_from_key_value_store=False,
194+
remove_from_storage=False,
195+
)
192196

193197
# Run blocking RAG API call in thread pool to avoid blocking event loop
194198
await asyncio.to_thread(self._rag_api.upload_information_piece, rag_information_pieces)

libs/admin-api-lib/src/admin_api_lib/impl/api_endpoints/default_source_uploader.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,11 @@ async def _handle_source_upload(
197197
rag_information_pieces.append(self._information_mapper.document2rag_information_piece(doc))
198198

199199
with suppress(Exception):
200-
await self._document_deleter.adelete_document(source_name, remove_from_key_value_store=False)
200+
await self._document_deleter.adelete_document(
201+
source_name,
202+
remove_from_key_value_store=False,
203+
remove_from_storage=False,
204+
)
201205

202206
# Run blocking RAG API call in thread pool to avoid blocking event loop
203207
await asyncio.to_thread(self._rag_api.upload_information_piece, rag_information_pieces)
Lines changed: 3 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -1,130 +1,5 @@
1-
"""Class to handle I/O with S3 storage."""
1+
"""Re-export core S3 service implementation."""
22

3-
import logging
4-
from pathlib import Path
5-
from typing import BinaryIO
3+
from rag_core_lib.impl.file_services.s3_service import S3Service
64

7-
import boto3
8-
9-
from admin_api_lib.file_services.file_service import FileService
10-
from admin_api_lib.impl.settings.s3_settings import S3Settings
11-
12-
logger = logging.getLogger(__name__)
13-
14-
15-
class S3Service(FileService):
16-
"""Class to handle I/O with S3 storage."""
17-
18-
def __init__(self, s3_settings: S3Settings):
19-
"""Class to handle I/O with S3 storage.
20-
21-
Parameters
22-
----------
23-
s3_settings: S3Settings
24-
Settings for the s3. Must contain at least the endpoint, access_key_id, secret_access_key and bucket.
25-
"""
26-
self._s3_settings = s3_settings
27-
self._s3_client = boto3.client(
28-
"s3",
29-
endpoint_url=s3_settings.endpoint,
30-
aws_access_key_id=s3_settings.access_key_id,
31-
aws_secret_access_key=s3_settings.secret_access_key,
32-
aws_session_token=None,
33-
config=boto3.session.Config(signature_version="s3v4"),
34-
verify=False,
35-
)
36-
37-
def download_folder(self, source: str, target: Path) -> None:
38-
"""Download the remote folder on "source" to the local "target" directory.
39-
40-
Parameters
41-
----------
42-
source: str
43-
Path to the remote folder.
44-
target: Path
45-
Download destination path.
46-
"""
47-
target.mkdir(parents=True, exist_ok=True)
48-
49-
search_response = self._s3_client.list_objects_v2(
50-
Bucket=self._s3_settings.bucket,
51-
Prefix=source,
52-
)
53-
for found_content in search_response.get("Contents", []):
54-
file_source = found_content["Key"]
55-
target_path = target / file_source[len(source) :]
56-
target_path.parent.mkdir(parents=True, exist_ok=True)
57-
with open(target_path, "wb") as local_file:
58-
self.download_file(file_source, local_file)
59-
60-
def download_file(self, source: str, target_file: BinaryIO) -> None:
61-
"""Read a single remote file "source" into the local "target_file" file-like object.
62-
63-
Example usage
64-
=============
65-
```
66-
s3_settings: S3Settings = get_s3_settings()
67-
s3_service = S3Service(endpoint="endpoint", username="username", password="password", bucket_name="bucket")
68-
69-
with tempfile.SpooledTemporaryFile(max_size=self._iot_forecast_settings.max_model_size) as temp_file:
70-
s3_service.download_file("remote_file", temp_file)
71-
# do stuff with temp_file
72-
```
73-
74-
Parameters
75-
----------
76-
source: str
77-
Path to the remote folder.
78-
target_file: BinaryIO
79-
File-like object to save the data to.
80-
"""
81-
self._s3_client.download_fileobj(self._s3_settings.bucket, source, target_file)
82-
83-
def upload_file(self, file_path: str, file_name: str) -> None:
84-
"""
85-
Upload a local file to the S3 bucket.
86-
87-
Parameters
88-
----------
89-
source : Path
90-
The path to the local file to upload.
91-
target : str
92-
The target path in the S3 bucket where the file will be stored.
93-
"""
94-
self._s3_client.upload_file(
95-
Filename=file_path,
96-
Bucket=self._s3_settings.bucket,
97-
Key=file_name,
98-
)
99-
100-
def get_all_sorted_file_names(self) -> list[str]:
101-
"""Retrieve all file names stored in the S3 bucket.
102-
103-
Returns
104-
-------
105-
list[str]
106-
A list of file names stored in the S3 bucket.
107-
"""
108-
file_names = []
109-
110-
resp = self._s3_client.list_objects_v2(Bucket=self._s3_settings.bucket)
111-
if resp.get("Contents"):
112-
for obj in resp["Contents"]:
113-
file_names.append(obj["Key"])
114-
return file_names
115-
116-
def delete_file(self, file_name: str) -> None:
117-
"""Delete a file from the S3 bucket.
118-
119-
Parameters
120-
----------
121-
file_name : str
122-
The name of the file to be deleted from the S3 bucket.
123-
"""
124-
try:
125-
file_name = f"/{file_name}" if not file_name.startswith("/") else file_name
126-
self._s3_client.delete_object(Bucket=self._s3_settings.bucket, Key=file_name)
127-
logger.info("File %s successfully deleted.", file_name)
128-
except Exception:
129-
logger.exception("Error deleting file %s", file_name)
130-
raise
5+
__all__ = ["S3Service"]

0 commit comments

Comments
 (0)