|
| 1 | +--- |
| 2 | +# frontmatter |
| 3 | +path: "/tutorial-couchbase-autovectorization-workflows-with-unstructured-data-and-langchain" |
| 4 | +title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services |
| 5 | +short_title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets |
| 6 | +description: |
| 7 | + - Learn how to use Couchbase Capella's AI Services Auto-Vectorization feature to automatically process unstructured data from S3 buckets. |
| 8 | + - Configure workflows to chunk and vectorize documents (PDFs, images, etc.) and import them into Capella collections. |
| 9 | + - Perform semantic vector search using LangChain and the generated embeddings. |
| 10 | +content_type: tutorial |
| 11 | +filter: sdk |
| 12 | +technology: |
| 13 | + - vector search |
| 14 | +tags: |
| 15 | + - Hyperscale Vector Index |
| 16 | + - Artificial Intelligence |
| 17 | + - LangChain |
| 18 | +sdk_language: |
| 19 | + - python |
| 20 | +length: 20 Mins |
| 21 | +--- |
| 22 | + |
| 23 | + |
| 24 | +<!--- *** WARNING ***: Autogenerated markdown file from jupyter notebook. ***DO NOT EDIT THIS FILE***. Changes should be made to the original notebook file. See commit message for source repo. --> |
| 25 | + |
| 26 | + |
| 27 | +[View Source](https://github.com/couchbase-examples/vector-search-cookbook/tree/main/autovec_unstructured/autovec_unstructured.ipynb) |
| 28 | + |
| 29 | +# Create and Deploy Your Operational cluster on Capella |
| 30 | +To get started with Couchbase Capella, create an account and use it to deploy a cluster. |
| 31 | + |
| 32 | +Make sure that you deploy a `Multi-node` cluster with `data`, `index`, `query` and `eventing` services enabled. To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html). |
| 33 | + |
| 34 | +## Couchbase Capella Configuration |
| 35 | +When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met. |
| 36 | +- Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket you will be using for this tutorial (e.g., `Unstructured_data_bucket`) with Read and Write permissions. |
| 37 | +- [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running. |
| 38 | + |
| 39 | +# Deploying the Model |
| 40 | +Now, before we actually create embeddings for the documents, we need to deploy a model that will create the embeddings for us. Make sure the model is deployed in the same region as that of database for workflows to work. To know more about model services click [here](https://docs.couchbase.com/ai/build/model-service/deploy-embed-model.html). |
| 41 | +## Selecting the Model |
| 42 | +1. To select the model, you first need to navigate to the "<B>AI Services</B>" tab, then select "<B>Models</B>" and click on "<B>Deploy New Model</B>". |
| 43 | + |
| 44 | + <img src="./img/importing_model.png" width="950px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 45 | + |
| 46 | +2. Enter the <B>model name</B>, and choose the model that you want to deploy. After selecting your model, choose the <B>model infrastructure</B> and <B>region</B> where the model will be deployed. |
| 47 | + |
| 48 | + <img src="./img/deploying_model.png" width="800px" height="800px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 49 | + |
| 50 | +## Access Control to the Model |
| 51 | + |
| 52 | +1. After deploying the model, go to the "<B>Models</B>" tab in the <B>AI Services</B> and click on "<B>Setup Access</B>". |
| 53 | + |
| 54 | + <img src="./img/model_setup_access.png" width="1100px" height="400px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 55 | + |
| 56 | +2. Enter your <B>API key name</B>, <B>expiration time</B> and the <B>IP address</B> from which you will be accessing the model. |
| 57 | + |
| 58 | + <img src="./img/model_api_key_form.png" width="1100px" height="600px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 59 | + |
| 60 | +3. Download your API key |
| 61 | + |
| 62 | + <img src="./img/download_api_key_details.png" width="1200px" height="800px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 63 | + |
| 64 | +# Data upload from S3 bucket to Couchbase (with chunking and vectorization) |
| 65 | + |
| 66 | +In order to import unstructured data from the S3 bucket, you need to create a workflow that connects to your S3 bucket and chunks your unstructured data before importing it into the collections. To do so, please follow the steps mentioned below: |
| 67 | +1) Let's start by creating a new workflow. This can be done by clicking on the <B>`AI Services`</B> tab, then click on <B>`Workflows`</B>, and then click on <B>`Create New Workflow`</B>. |
| 68 | + |
| 69 | + <img src="./img/workflow.png" width="1000px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 70 | + |
| 71 | +2) Start your workflow deployment by giving it a name and selecting where your data will be provided to the Auto-Vectorization service. There are currently three options: <B>`pre-processed data (JSON format) from Capella`</B>, <B>`pre-processed data (JSON format) from external sources (S3 buckets)`</B> and <B>`unstructured data from external sources (S3 buckets)`</B>. For this tutorial, we will choose the third option, which is unstructured data from external sources (S3 buckets). After selecting the workflow enter the workflow name and click on <B>`Start Workflow`</B>. |
| 72 | + |
| 73 | + <img src="./img/start_workflow.png" width="1000px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 74 | + |
| 75 | +3) To proceed, Capella needs to connect to your S3 bucket which will be the source of the data, and to do so click on the <B>`+ Add New S3 Bucket`</B>. |
| 76 | + |
| 77 | + <img src="./img/addS3bucket.png" width="1000px" height="300px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 78 | + |
| 79 | +4) Upon clicking <B>`+ Add New S3 Bucket`</B> a new sidebar will appear that asks for the credentials of your S3 bucket. |
| 80 | + |
| 81 | + <img src="./img/S3credentials.png" width="1000px" height="800px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 82 | + |
| 83 | + - Enter <B>`Integration Name`</B>, which will be later used to select your S3 Bucket. |
| 84 | + - Select the AWS Region where the bucket is deployed. |
| 85 | + - Enter the name of the S3 bucket deployed in AWS. |
| 86 | + - Enter the path where your unstructured-data is present. |
| 87 | + - Enter your S3 bucket credentials. |
| 88 | + - Click on ADD Credentials. |
| 89 | +5) If the steps mentioned above are followed correctly then you should see a success pop-up as shown below and then the S3 bucket can be selected from the drop-down menu. |
| 90 | + |
| 91 | + <img src="./img/S3bucketsuccess.png" width="800px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 92 | + |
| 93 | +6) On selection of the S3 bucket, various options will be displayed as described below. |
| 94 | + |
| 95 | + <img src="./img/configure_data_source.png" width="900px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 96 | +- <B>`Index Configuration`</B> allows the workflow to **automatically create a Hyperscale Vector Search index** on the generated embeddings. This Vector Search index is essential for performing vector similarity searches. |
| 97 | + - If you enable this option (recommended), the workflow will create a properly configured Search index that includes vector field mappings for your embeddings. |
| 98 | + - If you skip this step, you'll need to manually create a Vector Search index later to perform optimised vector searches. See the [Vector Search Index Creation Guide](https://docs.couchbase.com/server/current/vector-index/vectors-and-indexes-overview.html) below for manual setup instructions. |
| 99 | +- `Destination Cluster` helps choose the cluster, bucket, scope and collection in which the data needs to be imported. |
| 100 | +- `Estimated Cost` dialogue box in blue color(on the right) will show you the cost of operation per document. |
| 101 | +- Click on `Next`. |
| 102 | + |
| 103 | +7) <B>`Configure Data Preprocessing`</B> allows you to perform various operations on the data being imported from the S3 buckets and are described below. |
| 104 | + |
| 105 | + <img src="./img/data_processing.png" width="600px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 106 | +- <B>`Page Range selection`</B> allows you to select a custom page range when working with PDFs. (Optional) |
| 107 | +- <B>`Layout Exclusions`</B> allows you to skip various unnecessary objects in your unstructured data. (Optional) |
| 108 | +- <B>`Object Character Recognition (OCR)`</B> allows you to detect text from images/pdfs. (Optional) |
| 109 | +- <B>`Chunking Strategy`</B> is an important step for importing data and creating embeddings(vectors) in Capella, the step will be further described below. |
| 110 | + - `Strategy` dropdown menu helps to select the strategy that will be used to chunk the data present in S3 bucket and might be useful depending upon the data present in the S3 bucket. |
| 111 | + - `Max Token in Chunk` decides the number of tokens that will be present in a chunk. |
| 112 | + - `Chunk Overlap` decides the number of tokens that will overlap, this helps create context between chunks. |
| 113 | +- Click `Next` after the options above specified are modified according to the requirement. |
| 114 | + |
| 115 | +8) Select the model which will be used to create the embeddings. There are two options to create the embeddings, `Capella-based` and `external model`. |
| 116 | + |
| 117 | + <img src="./img/Select_embedding_model.png" width="600px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 118 | + |
| 119 | + - For this tutorial, Capella-based embedding model is used as can be seen in the image above. API credentials can be uploaded using the file downloaded in model deployment section or it can be entered manually as well. |
| 120 | + - Choices between private and insecure networking is available to choose. |
| 121 | + - A click on `Next` will land you at the final page of the workflow. |
| 122 | + |
| 123 | +9) <B>`Workflow Summary`</B> will display all the necessary details of the workflow including `Data Source`, `Model Service`, `Unstructured Data Service` and `Billing Overview` as shown in image below. |
| 124 | + |
| 125 | + <img src="./img/workflow_summary.png" width="800px" height="500px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 126 | + |
| 127 | +10) <B>`Workflow Deployed`</B> Now in the `workflow` tab we can see our workflow deployed and can check the status of our workflow. The status of the workflow run will be shown over here. |
| 128 | + |
| 129 | + <img src="./img/workflow_deployed.png" width="950px" height="350px" style="padding: 5px; border-radius: 10px 20px 30px 40px; border: 2px solid #555;"> |
| 130 | + |
| 131 | + |
| 132 | + After this step, your vector embeddings for the selected fields should be ready, and you can check them out in the Capella UI. In the next step, we will demonstrate how we can use the generated vectors to perform vector search. |
| 133 | + |
| 134 | +# Vector Search Using Couchbase Search Service |
| 135 | + |
| 136 | +The following code cells implement semantic vector search against the embeddings generated by the Auto-Vectorization workflow. These searches are powered by **Couchbase's Search service**. |
| 137 | + |
| 138 | +Before you proceed, make sure the following packages are installed by running: |
| 139 | + |
| 140 | + |
| 141 | +```python |
| 142 | +!pip install langchain-couchbase==1.0.1 langchain-openai |
| 143 | +``` |
| 144 | + |
| 145 | +`langchain-couchbase >= Version: 1.0.1` \ |
| 146 | +`langchain-openai - Version: 0.3.34` |
| 147 | + |
| 148 | +Now, please proceed to execute the cells in order to run the vector similarity search. |
| 149 | + |
| 150 | +# Importing Required Packages |
| 151 | + |
| 152 | + |
| 153 | +```python |
| 154 | +from couchbase.cluster import Cluster |
| 155 | +from couchbase.auth import PasswordAuthenticator |
| 156 | +from couchbase.options import ClusterOptions |
| 157 | + |
| 158 | +from langchain_openai import OpenAIEmbeddings |
| 159 | +from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore |
| 160 | +from langchain_couchbase.vectorstores import DistanceStrategy |
| 161 | + |
| 162 | +from datetime import timedelta |
| 163 | +``` |
| 164 | + |
| 165 | +# Cluster Connection Setup |
| 166 | + - Defines the secure connection string, user credentials, and creates a `Cluster` object. |
| 167 | + |
| 168 | + |
| 169 | +```python |
| 170 | +endpoint = "COUCHBASE_CAPELLA_ENDPOINT" # Replace this with Connection String |
| 171 | +username = "COUCHBASE_CAPELLA_USERNAME" |
| 172 | +password = "COUCHBASE_CAPELLA_PASSWORD" |
| 173 | + |
| 174 | +auth = PasswordAuthenticator(username, password) |
| 175 | +options = ClusterOptions(auth) |
| 176 | +cluster = Cluster(endpoint, options) |
| 177 | +cluster.wait_until_ready(timedelta(seconds=10)) |
| 178 | +print("Connected!") |
| 179 | +``` |
| 180 | + |
| 181 | + Connected! |
| 182 | + |
| 183 | + |
| 184 | +# Selection of Buckets / Scope / Collection / Index / Embedder |
| 185 | + - Sets the bucket, scope, and collection where the documents (with vector fields) live. |
| 186 | + - `index_name` specifies the **Capella Search index name**. This is the Search index created automatically in the workflow setup section or manually as described in the same step. You can find this index name in the **Search** tab of your Capella cluster. |
| 187 | + - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time. |
| 188 | + - `open_api_key` is the api key token created in model deployment section. |
| 189 | + - `open_api_base` is the Capella model services endpoint found in the models section. |
| 190 | + - for more details visit [openAIEmbeddings](https://docs.langchain.com/oss/python/integrations/text_embedding/openai). |
| 191 | + |
| 192 | +`Note that the Capella AI Endpoint also requires an additional /v1 from the endpoint if not shown on the UI` |
| 193 | + |
| 194 | + |
| 195 | +```python |
| 196 | +bucket_name = "Unstructured_data_bucket" |
| 197 | +scope_name = "_default" |
| 198 | +collection_name = "_default" |
| 199 | +index_name = "hyperscale_autovec_workflow_text-embedding" # This is the name of the search index that was created in step 3.6 and can also be seen in the search tab of the cluster. |
| 200 | + |
| 201 | +# Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain |
| 202 | +embedder = OpenAIEmbeddings( |
| 203 | + model="nvidia/llama-3.2-nv-embedqa-1b-v2", # This is the model that will be used to create the embedding of the query. |
| 204 | + openai_api_key="COUCHBASE_CAPELLA_MODEL_API_KEY", |
| 205 | + openai_api_base="COUCHBASE_CAPELLA_MODEL_ENDPOINT/v1", |
| 206 | + check_embedding_ctx_length=False, |
| 207 | + tiktoken_enabled=False, |
| 208 | +) |
| 209 | +``` |
| 210 | + |
| 211 | +# VectorStore Construction |
| 212 | + - Creates a [CouchbaseQueryVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-query-vector-store) instance that interfaces with **Couchbase's Query service** to perform vector similarity searches using [Hyperscale/Composite](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html) indexes. |
| 213 | + - The vector store: |
| 214 | + * Knows where to read documents (`bucket/scope/collection`). |
| 215 | + * Knows the embedding field (the vector produced by the Auto-Vectorization workflow). |
| 216 | + * Uses the provided embedder to embed queries on-demand for similarity search. |
| 217 | + - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly. |
| 218 | + - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields. |
| 219 | + |
| 220 | + |
| 221 | +```python |
| 222 | +vector_store = CouchbaseQueryVectorStore( |
| 223 | + cluster=cluster, |
| 224 | + bucket_name=bucket_name, |
| 225 | + scope_name=scope_name, |
| 226 | + collection_name=collection_name, |
| 227 | + embedding=embedder, |
| 228 | + text_key="text-to-embed", # Your document's text field |
| 229 | + embedding_key="text-embedding", |
| 230 | + distance_metric=DistanceStrategy.COSINE # This is the field in which your vector (embedding) is stored in the cluster. |
| 231 | +) |
| 232 | +``` |
| 233 | + |
| 234 | +# Performing a Similarity Search |
| 235 | + - Defines a natural language query (e.g., "What are the pre-requisite for java SDK?"). |
| 236 | + - Calls `similarity_search(query, k=3)` to retrieve the top 3 most semantically similar documents using **Couchbase's Hyperscale Vector Search** service. |
| 237 | + - The Search service performs efficient vector similarity search using the index created earlier. |
| 238 | + - Prints ranked results, extracting the chosen `text_key` (here `text-to-embed`). |
| 239 | + - Change `query` to any descriptive phrase (e.g., "beach resort", "airport hotel near NYC"). |
| 240 | + - Adjust `k` for more or fewer results. |
| 241 | + |
| 242 | + |
| 243 | +```python |
| 244 | +query = "What are the pre-requisite for java SDK?" |
| 245 | +results = vector_store.similarity_search(query, k=3) |
| 246 | +for i, doc in enumerate(results, 1): |
| 247 | + print(f"\n--- Result {i} ---") |
| 248 | + print(doc.page_content) |
| 249 | +``` |
| 250 | + |
| 251 | + |
| 252 | + --- Result 1 --- |
| 253 | + Section Title: Prerequisites |
| 254 | + Content: You have installed the Java Software Development Kit (version 8, 11, 17, or 21). The recommended version is the latest Java LTS release. Make sure to install the highest available patch for the LTS version. |
| 255 | + |
| 256 | + --- Result 2 --- |
| 257 | + Section Title: Connect the SDK to Your Cluster |
| 258 | + Content: Important: directory whenever you |
| 259 | + |
| 260 | + --- Result 3 --- |
| 261 | + Section Title: Set Up the Java SDK |
| 262 | + Content: To set up the Java SDK: Create the following directory structure on your computer: In the student directory, create a new file called pom.xml . Paste the following code block into your pom.xml file: Open a terminal window and navigate to your student directory. Run the command mvn install to pull in all the dependencies and finish your SDK setup. Next, connect the Java SDK to your cluster. |
| 263 | + |
| 264 | + |
| 265 | +### How the Ranking Works with Search Service |
| 266 | +1. Your natural language query (e.g., `query = "How to setup java SDK?"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`). |
| 267 | +2. The query embedding is compared against the `embedding_key`. |
| 268 | +3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning. |
| 269 | + |
| 270 | + |
| 271 | +> Your vector search pipeline is working if the returned documents feel meaningfully related to your natural language query—even when exact keywords do not match. Feel free to experiment with increasingly descriptive queries to observe the semantic power of the embeddings powered by Couchbase's Search service. |
0 commit comments