If executing bash commands, you have three very helpful sucommands installed within semtools:
parse-- converts any non grep-able format into markdown, outputs a filepath for a converted markdown file for every input file to stdinsearch-- performs a search using static embeddings on either stdin or a list of files (very similar to grep). Works best with keyword based search queries. Only works with text-based files so it may require theparsetool to help preprocess into markdown.workspace- Workspace management for accelerating search over large collections
These commands, combined with other CLI commands, you can ensure that you can search large amounts of files efficiently, while handling various formats of documents. Both parse and search can scale to hundreds of thousands of documents.
$ semtools parse --help
A CLI tool for parsing documents using various backends
Usage: semtools parse [OPTIONS] <FILES>...
Arguments:
<FILES>... Files to parse
Options:
-c, --config <CONFIG> Path to the config file. Defaults to ~/.semtools_config.json
-b, --backend <BACKEND> The backend type to use for parsing. Defaults to `llama-parse` [default: llama-parse]
-v, --verbose Verbose output while parsing
-h, --help Print help$ semtools search --help
A CLI tool for fast semantic keyword search
Usage: semtools search [OPTIONS] <QUERY> [FILES]...
Arguments:
<QUERY> Query to search for (positional argument)
[FILES]... Files to search, optional if using stdin
Options:
-n, --n-lines <N_LINES> How many lines before/after to return as context [default: 3]
--top-k <TOP_K> The top-k files or texts to return (ignored if max_distance is set) [default: 3]
-m, --max-distance <MAX_DISTANCE> Return all results with distance below this threshold (0.0+)
-i, --ignore-case Perform case-insensitive search (default is false)
-j, --json Output results in JSON format
-h, --help Print help$ semtools workspace --help
Manage semtools workspaces
Usage: semtools workspace [OPTIONS] <COMMAND>
Commands:
use Use or create a workspace (prints export command to run)
status Show active workspace and basic stats
prune Remove stale or missing files from store
help Print this message or the help of the given subcommand(s)
Options:
-j, --json Output results in JSON format
-h, --help Print helpHere's how to convert those standalone commands to semtools subcommands:
# Parse a PDF and search for specific content
semtools parse document.pdf | xargs cat | semtools search "error handling"
# Search within many files after parsing
semtools parse my_docs/*.pdf | xargs semtools search "API endpoints"
# Search with custom context and thresholds or distance thresholds
semtools search "machine learning" *.txt --n-lines 5 --max-distance 0.3
# Search from stdin
echo "some text content" | semtools search "content"
# Parse multiple documents
semtools parse report.pdf data.xlsx presentation.pptx
# Chain parsing with semantic search
semtools parse *.pdf | xargs semtools search "financial projections" --n-lines 3
# Search with distance threshold (lower = more similar)
semtools parse document.pdf | xargs cat | semtools search "revenue" --max-distance 0.2
# Search multiple files directly
semtools search "error handling" src/*.rs --top-k 5
# Combine with grep for exact-match pre-filtering and distance thresholding
semtools parse *.pdf | xargs cat | grep -i "error" | semtools search "network error" --max-distance 0.3
# Pipeline with content search (note the 'cat')
find . -name "*.md" | xargs semtools parse | xargs semtools search "installation"# Create or select a workspace
# Workspaces are stored in ~/.semtools/workspaces/
semtools workspace use my-workspace
> Workspace 'my-workspace' configured.
> To activate it, run:
> export SEMTOOLS_WORKSPACE=my-workspace
>
> Or add this to your shell profile (.bashrc, .zshrc, etc.)
# Activate the workspace
export SEMTOOLS_WORKSPACE=my-workspace
# All search commands will now use the workspace for caching embeddings
# The initial command is used to initialize the workspace
semtools search "some keywords" ./some_large_dir/*.txt --n-lines 5 --top-k 10
# If documents change, they are automatically re-embedded and cached
echo "some new content" > ./some_large_dir/some_file.txt
semtools search "some keywords" ./some_large_dir/*.txt --n-lines 5 --top-k 10
# A workspace example if you are using with parse
# create a workspace
semtools workspace use my-workspace2
export SEMTOOLS_WORKSPACE=my-workspace2
# parse files, and then search over the parsed files, and cache the file embeddings
semtools parse *.pdf | xargs semtools search "financial projections" --n-lines 3
# if you run the command with a different query (see option a and b), over the same set of files, then search will operate
# over the cached file embeddings
# option a - parse won't rerun since files already cached
semtools parse *.pdf | xargs semtools search "balance sheet" --n-lines 3
# option b - run search directly over the parse cache
xargs semtools search "balance sheet" /Users/jerryliu/.parse/*.pdf.md --n-lines 3
# If documents are removed, you can run prune to clean up stale files
semtools workspace prune
# You can see the stats of a workspace at any time
semtools workspace status
> Active workspace: arxiv
> Root: /Users/loganmarkewich/.semtools/workspaces/arxiv
> Documents: 3000
> Index: Yes (IVF_PQ)- If you have run / plan on running repeated
searchqueries over the same file or set of files, you SHOULD create a workspace (semtools workspace use) before running parse/search commands - otherwise you will be re-embedding the same document collections from scratch every time. Make sure the environment variable is set before downstream commands. - Before you create a workspace, you can check current workspace through
semtools workspace statuswhich will also give the directory where all workspaces are stored. - You can choose to add a new workspace or prune an existing one if you are changing to a different collection of files. You can check the status through
semtools workspace status. parsewill always output paths of parsed files to stdin. These parsed files represent the markdown version of their original file (for example, parsing a PDF or DOCX file into markdown).- ALWAYS call
parsefirst when interacting with PDF (or similar) formats so that you can get the paths to the markdown versions of those files searchonly works with text-based files (like markdown). It's a common pattern to first callparseand either feed files intosearchor cat files and search from stdinsearchworks best with keywords, or comma-separated inputs- By default the tokenizer for
searchis case sensitive, which may lead to unexpected results if you don't know capitalization beforehand. You should generally TRY to set--ignore-casefor more general case insensitive search. --n-lineson search controls how much context is shown around matching lines in the results- If
--n-linesreturns incomplete results, you may want to consider expanding--n-lines. - NOTE: by default --n-lines is too small. Consider setting n-lines to 30-50 at least always.
--max-distanceis useful on search for cases where you don't know a top-k value ahead of time and need relevant results from all files- That said if setting
--max-distancedoesn't return any results, you may want to try--top-kto double-check.