|
| 1 | +--- |
| 2 | +layout: integration |
| 3 | +name: Valyu Search |
| 4 | +description: Search and content extraction components using Valyu's API for web and proprietary sources |
| 5 | +authors: |
| 6 | + - name: Valyu |
| 7 | + socials: |
| 8 | + github: valyu-network |
| 9 | +pypi: https://pypi.org/project/valyu-search-haystack |
| 10 | +repo: https://github.com/valyu-network/valyu-search-haystack |
| 11 | +type: Search & Extraction |
| 12 | +report_issue: https://github.com/valyu-network/valyu-search-haystack/issues |
| 13 | +version: Haystack 2.0 |
| 14 | +toc: true |
| 15 | +--- |
| 16 | + |
| 17 | +### Table of Contents |
| 18 | + |
| 19 | +- [Overview](#overview) |
| 20 | +- [Installation](#installation) |
| 21 | +- [Usage](#usage) |
| 22 | + - [ValyuSearch](#valyusearch) |
| 23 | + - [ValyuContentFetcher](#valyucontentfetcher) |
| 24 | + - [Pipeline Examples](#pipeline-examples) |
| 25 | + - [Advanced Configuration](#advanced-configuration) |
| 26 | +- [API Integration Details](#api-integration-details) |
| 27 | + - [Authentication](#authentication) |
| 28 | + - [License](#license) |
| 29 | + |
| 30 | +## Overview |
| 31 | + |
| 32 | +[](https://pypi.org/project/valyu-search-haystack) |
| 33 | +[](https://pypi.org/project/valyu-search-haystack) |
| 34 | + |
| 35 | +Haystack components for integrating [Valyu](https://docs.valyu.ai/overview)'s powerful search and content extraction APIs into your Haystack pipelines. |
| 36 | + |
| 37 | +This package provides two main components: |
| 38 | + |
| 39 | +- **`ValyuSearch`** - Search component that queries the Valyu DeepSearch API and returns documents with content already included |
| 40 | +- **`ValyuContentFetcher`** - Content extraction component that fetches and cleans content from URLs |
| 41 | + |
| 42 | +**Key Features:** |
| 43 | + |
| 44 | +- Search across web and proprietary sources |
| 45 | +- Full content included in search results |
| 46 | +- AI-powered content extraction and summarization |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Installation |
| 51 | + |
| 52 | +Use `pip` to install Valyu Search for Haystack: |
| 53 | + |
| 54 | +```console |
| 55 | +pip install valyu-search-haystack |
| 56 | +``` |
| 57 | + |
| 58 | +Or install from source: |
| 59 | + |
| 60 | +```console |
| 61 | +pip install -e . |
| 62 | +``` |
| 63 | + |
| 64 | +**Requirements:** |
| 65 | + |
| 66 | +- Python 3.8+ |
| 67 | +- haystack-ai >= 2.0.0 |
| 68 | +- valyu >= 2.2.1 |
| 69 | + |
| 70 | +## Usage |
| 71 | + |
| 72 | +Set your Valyu API key as an environment variable: |
| 73 | + |
| 74 | +```bash |
| 75 | +export VALYU_API_KEY="your-api-key" |
| 76 | +``` |
| 77 | + |
| 78 | +### ValyuSearch |
| 79 | + |
| 80 | +The `ValyuSearch` component integrates with the Valyu DeepSearch API. Unlike many search APIs, Valyu returns full content by default, making it ideal for RAG pipelines. |
| 81 | + |
| 82 | +**Basic Usage:** |
| 83 | + |
| 84 | +```python |
| 85 | +from valyu_haystack import ValyuSearch |
| 86 | +from haystack import Pipeline |
| 87 | + |
| 88 | +# Create a search component (API key from VALYU_API_KEY env var) |
| 89 | +search = ValyuSearch( |
| 90 | + top_k=5, |
| 91 | + search_type="all", # "web", "proprietary", or "all" |
| 92 | + relevance_threshold=0.5 |
| 93 | +) |
| 94 | + |
| 95 | +# Create and run a pipeline |
| 96 | +pipeline = Pipeline() |
| 97 | +pipeline.add_component("search", search) |
| 98 | + |
| 99 | +result = pipeline.run({"search": {"query": "What is Haystack AI?"}}) |
| 100 | +documents = result["search"]["documents"] |
| 101 | +links = result["search"]["links"] |
| 102 | +``` |
| 103 | + |
| 104 | +**Component Parameters:** |
| 105 | + |
| 106 | +- `api_key` (Secret): Your Valyu API key. Defaults to `VALYU_API_KEY` environment variable |
| 107 | +- `top_k` (int, default=10): Maximum number of results to return |
| 108 | +- `api_base_url` (str): Base URL for the Valyu API |
| 109 | +- `search_type` (Literal["web", "proprietary", "all"], default="all"): Type of search |
| 110 | +- `relevance_threshold` (float, default=0.5): Minimum relevance score (0.0-1.0) |
| 111 | +- `max_price` (int, default=100): Maximum price per thousand queries in cents |
| 112 | + |
| 113 | +**Output:** |
| 114 | + |
| 115 | +- `documents` (List[Document]): Documents with content and rich metadata |
| 116 | +- `links` (List[str]): List of URLs from search results |
| 117 | + |
| 118 | +**Metadata included:** |
| 119 | + |
| 120 | +- `title`: Page title |
| 121 | +- `url`: Source URL |
| 122 | +- `description`: Page description |
| 123 | +- `source`: Data source identifier |
| 124 | +- `relevance_score`: Relevance score (0.0-1.0) |
| 125 | +- `price`: Cost of this result |
| 126 | +- `length`: Content length in characters |
| 127 | +- `data_type`: Type of data ("structured" or "unstructured") |
| 128 | +- `image_url`: Associated image URL (if any) |
| 129 | + |
| 130 | +### ValyuContentFetcher |
| 131 | + |
| 132 | +The `ValyuContentFetcher` component extracts clean, readable content from URLs using the Valyu Contents API. It supports batch processing and AI-powered summarization. |
| 133 | + |
| 134 | +**Basic Usage:** |
| 135 | + |
| 136 | +```python |
| 137 | +from valyu_haystack import ValyuContentFetcher |
| 138 | +from haystack import Pipeline |
| 139 | + |
| 140 | +# Create a content fetcher component |
| 141 | +fetcher = ValyuContentFetcher( |
| 142 | + extract_effort="normal", # "normal", "high", or "auto" |
| 143 | + response_length="short", # "short", "medium", "large", "max", or int |
| 144 | + summary=True # Enable AI summarization |
| 145 | +) |
| 146 | + |
| 147 | +# Create and run a pipeline |
| 148 | +pipeline = Pipeline() |
| 149 | +pipeline.add_component("fetcher", fetcher) |
| 150 | + |
| 151 | +urls = ["https://example.com/article1", "https://example.com/article2"] |
| 152 | +result = pipeline.run({"fetcher": {"urls": urls}}) |
| 153 | +documents = result["fetcher"]["documents"] |
| 154 | +``` |
| 155 | + |
| 156 | +**Component Parameters:** |
| 157 | + |
| 158 | +- `api_key` (Secret): Your Valyu API key. Defaults to `VALYU_API_KEY` environment variable |
| 159 | +- `api_base_url` (str): Base URL for the Valyu API |
| 160 | +- `timeout` (int, default=30): Request timeout in seconds |
| 161 | +- `extract_effort` (Literal["normal", "high", "auto"], optional): Extraction thoroughness |
| 162 | +- `response_length` (Union[Literal["short", "medium", "large", "max"], int], optional): Content length per URL |
| 163 | +- `summary` (Union[bool, str, Dict], optional): AI summary config |
| 164 | + - `False` or `None`: No AI processing (raw content) |
| 165 | + - `True`: Basic automatic summarization |
| 166 | + - `str`: Custom instructions (max 500 chars) |
| 167 | + - `dict`: JSON schema for structured extraction |
| 168 | + |
| 169 | +**Input:** |
| 170 | + |
| 171 | +- `urls` (List[str], optional): List of URLs to fetch |
| 172 | +- `documents` (List[Document], optional): Documents with URLs in metadata |
| 173 | + |
| 174 | +**Output:** |
| 175 | + |
| 176 | +- `documents` (List[Document]): Documents with extracted content |
| 177 | + |
| 178 | +**Metadata included:** |
| 179 | + |
| 180 | +- `url`: Source URL |
| 181 | +- `title`: Page title |
| 182 | +- `length`: Content length in characters |
| 183 | +- `source`: Data source identifier |
| 184 | +- `data_type`: Type of content |
| 185 | + |
| 186 | +### Pipeline Examples |
| 187 | + |
| 188 | +**RAG Pipeline with Search and Chat:** |
| 189 | + |
| 190 | +```python |
| 191 | +from haystack import Pipeline |
| 192 | +from haystack.utils import Secret |
| 193 | +from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder |
| 194 | +from haystack.components.generators.chat import OpenAIChatGenerator |
| 195 | +from haystack.dataclasses import ChatMessage |
| 196 | +from valyu_haystack import ValyuSearch |
| 197 | + |
| 198 | +# Create components |
| 199 | +web_search = ValyuSearch(top_k=3) |
| 200 | + |
| 201 | +prompt_template = [ |
| 202 | + ChatMessage.from_system("You are a helpful assistant."), |
| 203 | + ChatMessage.from_user( |
| 204 | + "Given the information below:\n" |
| 205 | + "{% for document in documents %}{{ document.content }}{% endfor %}\n" |
| 206 | + "Answer question: {{ query }}.\nAnswer:" |
| 207 | + ) |
| 208 | +] |
| 209 | + |
| 210 | +prompt_builder = ChatPromptBuilder(template=prompt_template, required_variables={"query", "documents"}) |
| 211 | +llm = OpenAIChatGenerator(api_key=Secret.from_env_var("OPENAI_API_KEY"), model="gpt-4o-mini") |
| 212 | + |
| 213 | +# Build pipeline |
| 214 | +pipe = Pipeline() |
| 215 | +pipe.add_component("search", web_search) |
| 216 | +pipe.add_component("prompt_builder", prompt_builder) |
| 217 | +pipe.add_component("llm", llm) |
| 218 | + |
| 219 | +# Connect components |
| 220 | +pipe.connect("search.documents", "prompt_builder.documents") |
| 221 | +pipe.connect("prompt_builder.messages", "llm.messages") |
| 222 | + |
| 223 | +# Run pipeline |
| 224 | +query = "What is the most famous landmark in Berlin?" |
| 225 | +result = pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}}) |
| 226 | +``` |
| 227 | + |
| 228 | +**Indexing Pipeline with Content Fetcher:** |
| 229 | + |
| 230 | +```python |
| 231 | +from haystack import Pipeline |
| 232 | +from haystack.document_stores.in_memory import InMemoryDocumentStore |
| 233 | +from haystack.components.writers import DocumentWriter |
| 234 | +from valyu_haystack import ValyuContentFetcher |
| 235 | + |
| 236 | +# Create components |
| 237 | +document_store = InMemoryDocumentStore() |
| 238 | +fetcher = ValyuContentFetcher() |
| 239 | +writer = DocumentWriter(document_store=document_store) |
| 240 | + |
| 241 | +# Build indexing pipeline |
| 242 | +indexing_pipeline = Pipeline() |
| 243 | +indexing_pipeline.add_component(instance=fetcher, name="fetcher") |
| 244 | +indexing_pipeline.add_component(instance=writer, name="writer") |
| 245 | + |
| 246 | +# Connect components |
| 247 | +indexing_pipeline.connect("fetcher.documents", "writer.documents") |
| 248 | + |
| 249 | +# Run pipeline |
| 250 | +indexing_pipeline.run(data={ |
| 251 | + "fetcher": {"urls": ["https://haystack.deepset.ai/blog/guide-to-using-zephyr-with-haystack2"]} |
| 252 | +}) |
| 253 | +``` |
| 254 | + |
| 255 | +### Advanced Configuration |
| 256 | + |
| 257 | +**Structured data extraction with Content Fetcher:** |
| 258 | + |
| 259 | +```python |
| 260 | +from valyu_haystack import ValyuContentFetcher |
| 261 | + |
| 262 | +# Define JSON schema for structured extraction |
| 263 | +schema = { |
| 264 | + "type": "object", |
| 265 | + "properties": { |
| 266 | + "title": {"type": "string"}, |
| 267 | + "author": {"type": "string"}, |
| 268 | + "published_date": {"type": "string"}, |
| 269 | + "summary": {"type": "string"} |
| 270 | + } |
| 271 | +} |
| 272 | + |
| 273 | +fetcher = ValyuContentFetcher(summary=schema) |
| 274 | +result = fetcher.run(urls=["https://example.com/article"]) |
| 275 | + |
| 276 | +# Extracted structured data will be in document metadata |
| 277 | +``` |
| 278 | + |
| 279 | +## API Integration Details |
| 280 | + |
| 281 | +### Authentication |
| 282 | + |
| 283 | +Both components use Haystack's `Secret` class for secure API key management: |
| 284 | + |
| 285 | +- Header: `x-api-key: your-api-key` |
| 286 | +- Environment variable: `VALYU_API_KEY` |
| 287 | + |
| 288 | +### License |
| 289 | + |
| 290 | +`valyu-search-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license. |
0 commit comments