Skip to content

Latest commit

 

History

History
225 lines (165 loc) · 9.93 KB

File metadata and controls

225 lines (165 loc) · 9.93 KB
path /tutorial-huggingface-couchbase-vector-search-with-fts
title Using Hugging Face Embeddings with Couchbase Vector Search using FTS Service
short_title Hugging Face with Couchbase Vector Search using FTS Service
description
Learn how to generate embeddings using Hugging Face and store them in Couchbase.
This tutorial demonstrates how to use Couchbase's vector search capabilities with Hugging Face embeddings.
You'll understand how to perform vector search to find relevant documents based on similarity using FTS Service.
content_type tutorial
filter sdk
technology
vector search
tags
FTS
Artificial Intelligence
Hugging Face
sdk_language
python
length 30 Mins

View Source

Introduction

In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, Hugging Face as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using the GSI index, please take a look at this.

How to run this tutorial

This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run interactively. You can access the original notebook here.

You can either download the notebook file and run it on Google Colab or run it on your system by setting up the Python environment.

Before you start

Create and Deploy Your Free Tier Operational cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.

To know more, please follow the instructions.

Couchbase Capella Configuration

When running Couchbase using Capella, the following prerequisites need to be met.

  • Create the database credentials to access the travel-sample bucket (Read and Write) used in the application.
  • Allow access to the Cluster from the IP on which the application is running.

Install necessary libraries

!pip --quiet install couchbase==4.4.0 transformers==4.56.1 sentence_transformers==5.1.0 langchain-community==0.3.29 langchain_huggingface==0.3.1 python-dotenv==1.1.1 ipywidgets

Imports

from pathlib import Path
from datetime import timedelta
from transformers import pipeline, AutoModel, AutoTokenizer
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import (ClusterOptions, ClusterTimeoutOptions,
                               QueryOptions)
import couchbase.search as search
from couchbase.options import SearchOptions
from couchbase.vector_search import VectorQuery, VectorSearch
import uuid
import os
from dotenv import load_dotenv
import getpass

Prerequisites

In order to run this tutorial, you will need access to a Couchbase Cluster with Full Text Search service either through Couchbase Capella or by running it locally and have credentials to acces a collection on that cluster:

# Load environment variables
load_dotenv("./.env")

# Configuration
couchbase_cluster_url = os.getenv('CB_CLUSTER_URL') or input("Couchbase Cluster URL:")
couchbase_username = os.getenv('CB_USERNAME') or input("Couchbase Username:")
couchbase_password = os.getenv('CB_PASSWORD') or getpass.getpass("Couchbase password:")
couchbase_bucket = os.getenv('CB_BUCKET') or input("Couchbase Bucket:")
couchbase_scope = os.getenv('CB_SCOPE') or input("Couchbase Scope:")
couchbase_collection = os.getenv('CB_COLLECTION') or input("Couchbase Collection:")

Couchbase Connection

In this section, we first need to create a PasswordAuthenticator object that would hold our Couchbase credentials:

auth = PasswordAuthenticator(
    couchbase_username,
    couchbase_password
)

Then, we use this object to connect to Couchbase Cluster and select specified above bucket, scope and collection:

print("Connecting to cluster")
cluster = Cluster(couchbase_cluster_url, ClusterOptions(auth))
cluster.wait_until_ready(timedelta(seconds=5))

bucket = cluster.bucket(couchbase_bucket)
scope = bucket.scope(couchbase_scope)
collection = scope.collection(couchbase_collection)
print("Connected to the cluster")
Connecting to cluster
Connected to the cluster

Creating Couchbase Vector Search Index

In order to store generated with Hugging Face embeddings onto a Couchbase Cluster, a vector search index needs to be created first. We included a sample index definition that will work with this tutorial in a file named huggingface_index.json located in the folder with this tutorial. The definition can be used to create a vector index using Couchbase server web console, on more information on vector indexes, please read Create a Vector Search Index with the Server Web Console. Please note that the index is configured for documents from bucket hugginface, scope _default and collection huggingface and you will have to edit source and document type name in the index definition file if your collection, scope or bucket names are different.

Here, our code verifies the existence of the index and will throw an exception if the index has not been found:

search_index_name = couchbase_bucket + "._default.vector_test"
search_index = cluster.search_indexes().get_index(search_index_name)
print("Found index: " + search_index_name)
Found index: huggingface._default.vector_test

Hugging Face Initialization

embedding_model = HuggingFaceEmbeddings()
print("Initialized successfully")
Initialized successfully

Embedding Documents

After initializing Hugging Face transformers library, it can be used to generate vector embeddings for user input or predefined set of phrases. Here, we're generating 2 embeddings for contained in the array strings:

texts = [
    "Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.",
    "It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.",
    input("Enter custom embedding text:")
]
embeddings = []
for i in range(0, len(texts)):
    embeddings.append(embedding_model.embed_query(texts[i]))

Storing Embeddings in Couchbase

Generated embeddings are then stored as vector fields inside documents that can contain additional information about the vector, including the original text. The documents are then upserted onto the couchbase cluster:

for i in range(0, len(texts)):
    doc = {
        "id": str(uuid.uuid4()),
        "text": texts[i],
        "vector": embeddings[i],
    }
    collection.upsert(doc["id"], doc)

Searching For Embeddings

After the documents are upserted onto the cluster, their vector fields will be added into previously imported vector index. Later, new embeddings can be added or used to perform a similarity search on the previously added documents:

def search_similar(text):
    print("Vector similarity search for phrase: \"" + text + "\"")
    search_embedding = embedding_model.embed_query(text)
    
    search_req = search.SearchRequest.create(search.MatchNoneQuery()).with_vector_search(
        VectorSearch.from_vector_query(
            VectorQuery(
                "vector", search_embedding, num_candidates=1
            )
        )
    )
    result = scope.search(
        "vector_test", 
        search_req, 
        SearchOptions(
            limit=13, 
            fields=["vector", "id", "text"]
        )
    )
    for row in result.rows():
        print("Found answer: " + row.id + "; score: " + str(row.score))
        doc = collection.get(row.id)
        print("Answer text: " + doc.value["text"])
        
search_similar("name a multipurpose database with distributed capability")
print("------")
search_similar(input("Enter custom search phrase:"))
Vector similarity search for phrase: "name a multipurpose database with distributed capability"
Found answer: 3993ec2e-c184-4d7f-8fc3-55961afe264c; score: 0.9256534967756203
Answer text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.
------
Vector similarity search for phrase: "What is the data in the sample text?"
Found answer: a7748fac-b41f-4846-bebc-d89bdcd645e3; score: 1.0016003788325407
Answer text: this is a sample text with the data "Qwerty"