Skip to content

Commit 4333403

Browse files
alexngysbilgeyucel
andauthored
[FEAT] Added valyu integration (#359)
* [FEAT] Added valyu integration * Apply suggestion from @bilgeyucel * Apply suggestion from @bilgeyucel * Apply suggestion from @bilgeyucel --------- Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai>
1 parent 8e8a455 commit 4333403

1 file changed

Lines changed: 290 additions & 0 deletions

File tree

integrations/valyu.md

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
---
2+
layout: integration
3+
name: Valyu Search
4+
description: Search and content extraction components using Valyu's API for web and proprietary sources
5+
authors:
6+
- name: Valyu
7+
socials:
8+
github: valyu-network
9+
pypi: https://pypi.org/project/valyu-search-haystack
10+
repo: https://github.com/valyu-network/valyu-search-haystack
11+
type: Search & Extraction
12+
report_issue: https://github.com/valyu-network/valyu-search-haystack/issues
13+
version: Haystack 2.0
14+
toc: true
15+
---
16+
17+
### Table of Contents
18+
19+
- [Overview](#overview)
20+
- [Installation](#installation)
21+
- [Usage](#usage)
22+
- [ValyuSearch](#valyusearch)
23+
- [ValyuContentFetcher](#valyucontentfetcher)
24+
- [Pipeline Examples](#pipeline-examples)
25+
- [Advanced Configuration](#advanced-configuration)
26+
- [API Integration Details](#api-integration-details)
27+
- [Authentication](#authentication)
28+
- [License](#license)
29+
30+
## Overview
31+
32+
[![PyPI - Version](https://img.shields.io/pypi/v/valyu-search-haystack.svg)](https://pypi.org/project/valyu-search-haystack)
33+
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/valyu-search-haystack.svg)](https://pypi.org/project/valyu-search-haystack)
34+
35+
Haystack components for integrating [Valyu](https://docs.valyu.ai/overview)'s powerful search and content extraction APIs into your Haystack pipelines.
36+
37+
This package provides two main components:
38+
39+
- **`ValyuSearch`** - Search component that queries the Valyu DeepSearch API and returns documents with content already included
40+
- **`ValyuContentFetcher`** - Content extraction component that fetches and cleans content from URLs
41+
42+
**Key Features:**
43+
44+
- Search across web and proprietary sources
45+
- Full content included in search results
46+
- AI-powered content extraction and summarization
47+
48+
---
49+
50+
## Installation
51+
52+
Use `pip` to install Valyu Search for Haystack:
53+
54+
```console
55+
pip install valyu-search-haystack
56+
```
57+
58+
Or install from source:
59+
60+
```console
61+
pip install -e .
62+
```
63+
64+
**Requirements:**
65+
66+
- Python 3.8+
67+
- haystack-ai >= 2.0.0
68+
- valyu >= 2.2.1
69+
70+
## Usage
71+
72+
Set your Valyu API key as an environment variable:
73+
74+
```bash
75+
export VALYU_API_KEY="your-api-key"
76+
```
77+
78+
### ValyuSearch
79+
80+
The `ValyuSearch` component integrates with the Valyu DeepSearch API. Unlike many search APIs, Valyu returns full content by default, making it ideal for RAG pipelines.
81+
82+
**Basic Usage:**
83+
84+
```python
85+
from valyu_haystack import ValyuSearch
86+
from haystack import Pipeline
87+
88+
# Create a search component (API key from VALYU_API_KEY env var)
89+
search = ValyuSearch(
90+
top_k=5,
91+
search_type="all", # "web", "proprietary", or "all"
92+
relevance_threshold=0.5
93+
)
94+
95+
# Create and run a pipeline
96+
pipeline = Pipeline()
97+
pipeline.add_component("search", search)
98+
99+
result = pipeline.run({"search": {"query": "What is Haystack AI?"}})
100+
documents = result["search"]["documents"]
101+
links = result["search"]["links"]
102+
```
103+
104+
**Component Parameters:**
105+
106+
- `api_key` (Secret): Your Valyu API key. Defaults to `VALYU_API_KEY` environment variable
107+
- `top_k` (int, default=10): Maximum number of results to return
108+
- `api_base_url` (str): Base URL for the Valyu API
109+
- `search_type` (Literal["web", "proprietary", "all"], default="all"): Type of search
110+
- `relevance_threshold` (float, default=0.5): Minimum relevance score (0.0-1.0)
111+
- `max_price` (int, default=100): Maximum price per thousand queries in cents
112+
113+
**Output:**
114+
115+
- `documents` (List[Document]): Documents with content and rich metadata
116+
- `links` (List[str]): List of URLs from search results
117+
118+
**Metadata included:**
119+
120+
- `title`: Page title
121+
- `url`: Source URL
122+
- `description`: Page description
123+
- `source`: Data source identifier
124+
- `relevance_score`: Relevance score (0.0-1.0)
125+
- `price`: Cost of this result
126+
- `length`: Content length in characters
127+
- `data_type`: Type of data ("structured" or "unstructured")
128+
- `image_url`: Associated image URL (if any)
129+
130+
### ValyuContentFetcher
131+
132+
The `ValyuContentFetcher` component extracts clean, readable content from URLs using the Valyu Contents API. It supports batch processing and AI-powered summarization.
133+
134+
**Basic Usage:**
135+
136+
```python
137+
from valyu_haystack import ValyuContentFetcher
138+
from haystack import Pipeline
139+
140+
# Create a content fetcher component
141+
fetcher = ValyuContentFetcher(
142+
extract_effort="normal", # "normal", "high", or "auto"
143+
response_length="short", # "short", "medium", "large", "max", or int
144+
summary=True # Enable AI summarization
145+
)
146+
147+
# Create and run a pipeline
148+
pipeline = Pipeline()
149+
pipeline.add_component("fetcher", fetcher)
150+
151+
urls = ["https://example.com/article1", "https://example.com/article2"]
152+
result = pipeline.run({"fetcher": {"urls": urls}})
153+
documents = result["fetcher"]["documents"]
154+
```
155+
156+
**Component Parameters:**
157+
158+
- `api_key` (Secret): Your Valyu API key. Defaults to `VALYU_API_KEY` environment variable
159+
- `api_base_url` (str): Base URL for the Valyu API
160+
- `timeout` (int, default=30): Request timeout in seconds
161+
- `extract_effort` (Literal["normal", "high", "auto"], optional): Extraction thoroughness
162+
- `response_length` (Union[Literal["short", "medium", "large", "max"], int], optional): Content length per URL
163+
- `summary` (Union[bool, str, Dict], optional): AI summary config
164+
- `False` or `None`: No AI processing (raw content)
165+
- `True`: Basic automatic summarization
166+
- `str`: Custom instructions (max 500 chars)
167+
- `dict`: JSON schema for structured extraction
168+
169+
**Input:**
170+
171+
- `urls` (List[str], optional): List of URLs to fetch
172+
- `documents` (List[Document], optional): Documents with URLs in metadata
173+
174+
**Output:**
175+
176+
- `documents` (List[Document]): Documents with extracted content
177+
178+
**Metadata included:**
179+
180+
- `url`: Source URL
181+
- `title`: Page title
182+
- `length`: Content length in characters
183+
- `source`: Data source identifier
184+
- `data_type`: Type of content
185+
186+
### Pipeline Examples
187+
188+
**RAG Pipeline with Search and Chat:**
189+
190+
```python
191+
from haystack import Pipeline
192+
from haystack.utils import Secret
193+
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
194+
from haystack.components.generators.chat import OpenAIChatGenerator
195+
from haystack.dataclasses import ChatMessage
196+
from valyu_haystack import ValyuSearch
197+
198+
# Create components
199+
web_search = ValyuSearch(top_k=3)
200+
201+
prompt_template = [
202+
ChatMessage.from_system("You are a helpful assistant."),
203+
ChatMessage.from_user(
204+
"Given the information below:\n"
205+
"{% for document in documents %}{{ document.content }}{% endfor %}\n"
206+
"Answer question: {{ query }}.\nAnswer:"
207+
)
208+
]
209+
210+
prompt_builder = ChatPromptBuilder(template=prompt_template, required_variables={"query", "documents"})
211+
llm = OpenAIChatGenerator(api_key=Secret.from_env_var("OPENAI_API_KEY"), model="gpt-4o-mini")
212+
213+
# Build pipeline
214+
pipe = Pipeline()
215+
pipe.add_component("search", web_search)
216+
pipe.add_component("prompt_builder", prompt_builder)
217+
pipe.add_component("llm", llm)
218+
219+
# Connect components
220+
pipe.connect("search.documents", "prompt_builder.documents")
221+
pipe.connect("prompt_builder.messages", "llm.messages")
222+
223+
# Run pipeline
224+
query = "What is the most famous landmark in Berlin?"
225+
result = pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}})
226+
```
227+
228+
**Indexing Pipeline with Content Fetcher:**
229+
230+
```python
231+
from haystack import Pipeline
232+
from haystack.document_stores.in_memory import InMemoryDocumentStore
233+
from haystack.components.writers import DocumentWriter
234+
from valyu_haystack import ValyuContentFetcher
235+
236+
# Create components
237+
document_store = InMemoryDocumentStore()
238+
fetcher = ValyuContentFetcher()
239+
writer = DocumentWriter(document_store=document_store)
240+
241+
# Build indexing pipeline
242+
indexing_pipeline = Pipeline()
243+
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
244+
indexing_pipeline.add_component(instance=writer, name="writer")
245+
246+
# Connect components
247+
indexing_pipeline.connect("fetcher.documents", "writer.documents")
248+
249+
# Run pipeline
250+
indexing_pipeline.run(data={
251+
"fetcher": {"urls": ["https://haystack.deepset.ai/blog/guide-to-using-zephyr-with-haystack2"]}
252+
})
253+
```
254+
255+
### Advanced Configuration
256+
257+
**Structured data extraction with Content Fetcher:**
258+
259+
```python
260+
from valyu_haystack import ValyuContentFetcher
261+
262+
# Define JSON schema for structured extraction
263+
schema = {
264+
"type": "object",
265+
"properties": {
266+
"title": {"type": "string"},
267+
"author": {"type": "string"},
268+
"published_date": {"type": "string"},
269+
"summary": {"type": "string"}
270+
}
271+
}
272+
273+
fetcher = ValyuContentFetcher(summary=schema)
274+
result = fetcher.run(urls=["https://example.com/article"])
275+
276+
# Extracted structured data will be in document metadata
277+
```
278+
279+
## API Integration Details
280+
281+
### Authentication
282+
283+
Both components use Haystack's `Secret` class for secure API key management:
284+
285+
- Header: `x-api-key: your-api-key`
286+
- Environment variable: `VALYU_API_KEY`
287+
288+
### License
289+
290+
`valyu-search-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

0 commit comments

Comments
 (0)