Skip to content

Commit a1b0efd

Browse files
committed
Sync documentation with current API and features
1 parent d2fd37d commit a1b0efd

4 files changed

Lines changed: 91 additions & 36 deletions

File tree

README.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,14 @@
1717
## Features
1818

1919
- **Blazing Fast**: C++ core for 2-5x speed improvement over pure Python alternatives.
20-
- **Multiple Scorers**: Support for Levenshtein, Jaccard, and Token Sort ratios.
21-
- **Partial Matching**: Find the best substring matches.
22-
- **Hybrid Scoring**: Combine multiple scorers with custom weights.
23-
- **Pandas & NumPy Integration**: Native support for Series and Arrays.
20+
- **Multiple Scorers**: Support for Levenshtein, Jaccard, Token Sort, Token Set, QRatio, WRatio, and Partial Ratio.
21+
- **Partial Matching**: Find the best substring matches using `mode="partial"`.
22+
- **Hybrid Scoring**: Combine multiple scorers with custom weights for complex matching tasks.
23+
- **Pandas & NumPy Integration**: Native support for Series and Arrays via a dedicated accessor.
2424
- **Batch Processing**: Parallelized matching for large datasets using OpenMP.
25-
- **Unicode Support**: Handles international characters and normalization.
26-
- **Benchmarking Tools**: Built-in utilities to measure performance.
27-
- **Thread Safe**: Releases the GIL in C++ for better multi-threading performance.
25+
- **Unicode Support**: Handles international characters and basic normalization.
26+
- **Benchmarking Tools**: Built-in utilities to measure and compare performance.
27+
- **Thread Safe**: Releases the GIL in C++ for optimal multi-threaded performance.
2828
- **Type Safe**: Includes PEP 561 type stubs for full IDE and MyPy support.
2929

3030
## Installation
@@ -51,7 +51,7 @@ results = fuzzybunny.rank("app", candidates, top_n=2)
5151
## Advanced Usage
5252

5353
### Hybrid Scorer
54-
Combine different algorithms to get better results:
54+
Combine different algorithms using custom weights:
5555

5656
```python
5757
results = fuzzybunny.rank(
@@ -62,8 +62,19 @@ results = fuzzybunny.rank(
6262
)
6363
```
6464

65+
### Partial Matching
66+
Find the best substring match:
67+
68+
```python
69+
score = fuzzybunny.partial_ratio("apple", "apple pie") # 1.0
70+
71+
# Using rank with partial mode
72+
results = fuzzybunny.rank("apple", ["apple pie", "banana"], mode="partial")
73+
# [('apple pie', 1.0), ('banana', 0.18)]
74+
```
75+
6576
### Pandas Integration
66-
Use the specialized accessor for clean code:
77+
Use the specialized `fuzzy` accessor:
6778

6879
```python
6980
import pandas as pd

docs/guide/advanced.md

Lines changed: 32 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,36 @@ score = fuzzybunny.wratio("fuzzy bunny", "bunny fuzzy!!!")
1414
# 1.0 (Token sort/set will match and WRatio will pick the best)
1515
```
1616

17+
## Hybrid Scorer
18+
19+
The `hybrid` scorer allows you to define a custom weighted average of multiple built-in algorithms. This is useful when you have specific data requirements that a single algorithm can't fully capture.
20+
21+
To use it, set `scorer="hybrid"` and provide a `weights` dictionary in `rank` or `batch_match`.
22+
23+
```python
24+
import fuzzybunny
25+
26+
results = fuzzybunny.rank(
27+
"fuzzy bunny",
28+
["bunny fuzzy", "the fuzzy bunny", "rabbit"],
29+
scorer="hybrid",
30+
weights={
31+
"levenshtein": 0.2,
32+
"token_sort": 0.5,
33+
"token_set": 0.3
34+
}
35+
)
36+
```
37+
38+
**Supported weight keys:** `levenshtein`, `jaccard`, `token_sort`, `token_set`, `qratio`, `wratio`.
39+
1740
## High-Performance Batch Matching
1841

1942
When comparing many queries against a common candidate set, `batch_match` is the most efficient choice.
2043

21-
It provides two major optimizations over calling `rank` in a loop:
22-
1. **Multi-threading (OpenMP)**: Automatically distributes work across all CPU cores.
23-
2. **Normalization Caching**: Normalizes the candidate set only once per batch.
44+
It provides two major optimizations:
45+
1. **Normalization Caching**: In a standard loop, each candidate is normalized once per query. `batch_match` normalizes each candidate only once for the entire batch.
46+
2. **Multi-threading (OpenMP)**: The C++ core uses OpenMP to parallelize the comparison loops across all available CPU cores.
2447

2548
```python
2649
import fuzzybunny
@@ -31,11 +54,14 @@ candidates = ["apple pie", "banana bread", "cherry tart", "apple turnover"]
3154
# Parallel matching
3255
results = fuzzybunny.batch_match(queries, candidates, top_n=2)
3356

34-
# Results is a list where each element matches the corresponding query
35-
for i, res in enumerate(results):
36-
print(f"Results for {queries[i]}: {res}")
57+
# results is a list of result lists
58+
# results[0] contains matches for "apple"
59+
# results[1] contains matches for "banana"
3760
```
3861

62+
!!! tip "Performance Hint"
63+
Parallel execution is automatically triggered when the number of queries is greater than 5. It releases the Python GIL during the intensive matching loops, allowing for true multi-core utilization.
64+
3965
## Custom Python Scorers
4066

4167
You can pass a custom Python function as the `scorer` argument.

docs/guide/basic_usage.md

Lines changed: 35 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,60 +4,77 @@ FuzzyBunny provides a simple and intuitive API for fuzzy string matching.
44

55
## Individual Scorers
66

7-
The library offers several algorithms to compare strings:
7+
The library offers several algorithms to compare two strings directly. These functions expect strings as input and return a score between 0.0 and 1.0.
88

99
```python
1010
import fuzzybunny
1111

12-
# Levenshtein Distance
13-
score = fuzzybunny.levenshtein("kitten", "sitting")
12+
# Levenshtein Ratio (edit distance)
13+
fuzzybunny.levenshtein("kitten", "sitting")
1414
# 0.5714...
1515

16-
# Token Sort Ratio
17-
# Good for strings with the same words but in different orders
18-
score = fuzzybunny.token_sort("apple banana", "banana apple")
16+
# Partial Ratio (best substring match)
17+
fuzzybunny.partial_ratio("apple", "apple pie")
1918
# 1.0
2019

21-
# Jaccard Similarity
22-
# Good for comparing sets of tokens
23-
score = fuzzybunny.jaccard("apple banana cherry", "banana apple")
20+
# Token Sort Ratio (alphabetical word ordering)
21+
fuzzybunny.token_sort("apple banana", "banana apple")
22+
# 1.0
23+
24+
# Token Set Ratio (set intersection/difference)
25+
# Good for strings with extra words or duplicates
26+
fuzzybunny.token_set("apple banana", "apple banana banana")
27+
# 1.0
28+
29+
# Jaccard Similarity (intersection over union)
30+
fuzzybunny.jaccard("apple banana cherry", "banana apple")
2431
# 0.666...
32+
33+
# WRatio (Weighted Ratio - Recommended for general use)
34+
fuzzybunny.wratio("fuzzy bunny", "bunny fuzzy!!!")
35+
# 1.0
2536
```
2637

38+
!!! info "Direct vs. Ranked Matching"
39+
Individual scorer functions (like `levenshtein`, `jaccard`, etc.) do **not** automatically normalize your strings. They perform a direct comparison. If you need automatic lowercasing or punctuation removal, use `rank` or `batch_match`, or preprocess your strings manually.
40+
2741
## Ranking Candidates
2842

29-
To find the best matches from a list of strings, use the `rank` function:
43+
To find the best matches from a list of strings, use the `rank` function. This function *does* provide integrated normalization.
3044

3145
```python
3246
candidates = ["apple pie", "banana bread", "cherry tart", "apple turnover"]
3347

3448
# Find top 2 matches for "apple"
49+
# By default, it uses 'levenshtein' and 'process=True'
3550
results = fuzzybunny.rank("apple", candidates, top_n=2)
3651
# [('apple pie', 0.55), ('apple turnover', 0.35)]
3752
```
3853

3954
### Partial Matching
4055

41-
If you want to find if a query exists as a substring of a candidate, use `mode="partial"`:
56+
If you want to find if a query exists as a substring of a candidate, use `mode="partial"`. In `rank`, this uses the `partial_ratio` logic.
4257

4358
```python
4459
# Standard rank (full match)
4560
res_full = fuzzybunny.rank("apple", ["apple pie"], mode="full")
46-
# Score will be ~0.55
61+
# Score: 0.555...
4762

4863
# Partial rank (substring match)
4964
res_partial = fuzzybunny.rank("apple", ["apple pie"], mode="partial")
50-
# Score will be 1.0 because "apple" is exactly in "apple pie"
65+
# Score: 1.0
5166
```
5267

5368
## Normalization
5469

55-
By default, FuzzyBunny normalizes strings by lowercasing and removing punctuation. You can disable this by passing `process=False`:
70+
By default, `rank` and `batch_match` normalize strings by lowercasing and removing punctuation. You can disable this by passing `process=False`:
5671

5772
```python
58-
# Default (case-insensitive)
59-
fuzzybunny.levenshtein("APPLE", "apple", process=True) # 1.0
73+
# Default (case-insensitive & punctuation-agnostic)
74+
fuzzybunny.rank("APPLE!", ["apple"], process=True)
75+
# [('apple', 1.0)]
6076

61-
# Case-sensitive
62-
fuzzybunny.levenshtein("APPLE", "apple", process=False) # < 1.0
77+
# Case-sensitive and strict
78+
fuzzybunny.rank("APPLE!", ["apple"], process=False)
79+
# [('apple', 0.0)]
6380
```

docs/index.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,13 @@ A high-performance, lightweight Python library for fuzzy string matching and ran
55
## Features
66

77
- **Blazing Fast**: Optimized C++ core (Myers' Bit-Parallel algorithm) for superior performance.
8-
- **Multiple Scorers**: Support for Levenshtein, Jaccard, Token Sort, Token Set, QRatio, and WRatio.
9-
- **Partial Matching**: Find the best substring matches.
8+
- **Multiple Scorers**: Support for Levenshtein, Jaccard, Token Sort, Token Set, QRatio, WRatio, and Partial Ratio.
9+
- **Partial Matching**: Find the best substring matches using `mode="partial"`.
1010
- **Hybrid Scoring**: Combine multiple scorers with custom weights.
1111
- **Python Callbacks**: Use your own Python functions as scorers.
12-
- **Pandas & NumPy Integration**: Native support for Series and Arrays.
12+
- **Pandas & NumPy Integration**: Native support for Series and Arrays via a dedicated accessor.
1313
- **Parallelized**: Parallel matching for large datasets using OpenMP.
14+
- **Unicode Support**: Handles international characters and basic normalization.
1415

1516
## Quick Start
1617

0 commit comments

Comments
 (0)