Skip to content

Commit 8c6ea90

Browse files
committed
Comprehensive documentation upgrade: improved docstrings, guides, and syntax highlighting
1 parent 173bdbe commit 8c6ea90

6 files changed

Lines changed: 310 additions & 14 deletions

File tree

docs/guide/advanced.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# Advanced Scoring and Performance
2+
3+
FuzzyBunny provides several advanced tools for performance and custom matching needs.
4+
5+
## WRatio (Weighted Similarity Ratio)
6+
7+
`WRatio` is the recommended general-purpose matcher. It combines several algorithms using heuristics to provide a more "intuitive" similarity score.
8+
9+
```python
10+
import fuzzybunny
11+
12+
# Matches well even with different word orders and lengths
13+
score = fuzzybunny.wratio("fuzzy bunny", "bunny fuzzy!!!")
14+
# 1.0 (Token sort/set will match and WRatio will pick the best)
15+
```
16+
17+
## High-Performance Batch Matching
18+
19+
When comparing many queries against a common candidate set, `batch_match` is the most efficient choice.
20+
21+
It provides two major optimizations over calling `rank` in a loop:
22+
1. **Multi-threading (OpenMP)**: Automatically distributes work across all CPU cores.
23+
2. **Normalization Caching**: Normalizes the candidate set only once per batch.
24+
25+
```python
26+
import fuzzybunny
27+
28+
queries = ["apple", "banana", "cherry"]
29+
candidates = ["apple pie", "banana bread", "cherry tart", "apple turnover"]
30+
31+
# Parallel matching
32+
results = fuzzybunny.batch_match(queries, candidates, top_n=2)
33+
34+
# Results is a list where each element matches the corresponding query
35+
for i, res in enumerate(results):
36+
print(f"Results for {queries[i]}: {res}")
37+
```
38+
39+
## Custom Python Scorers
40+
41+
You can pass a custom Python function as the `scorer` argument.
42+
43+
!!! warning "Performance"
44+
Custom Python scorers are significantly slower than C++ scorers because they must acquire the Python Global Interpreter Lock (GIL) for every comparison.
45+
46+
```python
47+
def my_custom_scorer(s1, s2):
48+
# Your custom logic here
49+
# Return a score between 0.0 and 1.0
50+
return 1.0 if s1[0] == s2[0] else 0.0
51+
52+
results = fuzzybunny.rank("apple", ["apricot", "banana"], scorer=my_custom_scorer)
53+
```
54+
55+
## Integration with Pandas and NumPy
56+
57+
FuzzyBunny integrates directly with common data science tools:
58+
59+
```python
60+
import pandas as pd
61+
import fuzzybunny
62+
63+
df = pd.DataFrame({"names": ["apple pie", "banana bread", "cherry tart"]})
64+
65+
# Use the pandas accessor
66+
results = df["names"].fuzzy.match("apple")
67+
```

docs/guide/basic_usage.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Basic Usage
2+
3+
FuzzyBunny provides a simple and intuitive API for fuzzy string matching.
4+
5+
## Individual Scorers
6+
7+
The library offers several algorithms to compare strings:
8+
9+
```python
10+
import fuzzybunny
11+
12+
# Levenshtein Distance
13+
score = fuzzybunny.levenshtein("kitten", "sitting")
14+
# 0.5714...
15+
16+
# Token Sort Ratio
17+
# Good for strings with the same words but in different orders
18+
score = fuzzybunny.token_sort("apple banana", "banana apple")
19+
# 1.0
20+
21+
# Jaccard Similarity
22+
# Good for comparing sets of tokens
23+
score = fuzzybunny.jaccard("apple banana cherry", "banana apple")
24+
# 0.666...
25+
```
26+
27+
## Ranking Candidates
28+
29+
To find the best matches from a list of strings, use the `rank` function:
30+
31+
```python
32+
candidates = ["apple pie", "banana bread", "cherry tart", "apple turnover"]
33+
34+
# Find top 2 matches for "apple"
35+
results = fuzzybunny.rank("apple", candidates, top_n=2)
36+
# [('apple pie', 0.55), ('apple turnover', 0.35)]
37+
```
38+
39+
### Partial Matching
40+
41+
If you want to find if a query exists as a substring of a candidate, use `mode="partial"`:
42+
43+
```python
44+
# Standard rank (full match)
45+
res_full = fuzzybunny.rank("apple", ["apple pie"], mode="full")
46+
# Score will be ~0.55
47+
48+
# Partial rank (substring match)
49+
res_partial = fuzzybunny.rank("apple", ["apple pie"], mode="partial")
50+
# Score will be 1.0 because "apple" is exactly in "apple pie"
51+
```
52+
53+
## Normalization
54+
55+
By default, FuzzyBunny normalizes strings by lowercasing and removing punctuation. You can disable this by passing `process=False`:
56+
57+
```python
58+
# Default (case-insensitive)
59+
fuzzybunny.levenshtein("APPLE", "apple", process=True) # 1.0
60+
61+
# Case-sensitive
62+
fuzzybunny.levenshtein("APPLE", "apple", process=False) # < 1.0
63+
```

docs/guide/installation.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Installation
2+
3+
FuzzyBunny can be installed from PyPI using `pip`.
4+
5+
```bash
6+
pip install fuzzybunny
7+
```
8+
9+
## System Requirements
10+
11+
- **Python**: 3.8 or higher.
12+
- **Compiler**: C++17 compatible compiler (only if building from source).
13+
14+
## Platform Specifics
15+
16+
### macOS
17+
18+
For high-performance parallel processing via OpenMP, it is highly recommended to install `libomp` via Homebrew:
19+
20+
```bash
21+
brew install libomp
22+
```
23+
24+
FuzzyBunny will automatically detect `libomp` and enable multi-threading for `batch_match`.
25+
26+
### Linux
27+
28+
Most Linux distributions have `libgomp` pre-installed as part of `gcc`. No extra steps are typically required.
29+
30+
### Windows
31+
32+
OpenMP is supported via the MSVC compiler flags.

mkdocs.yml

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,60 @@ repo_url: https://github.com/cachevector/fuzzybunny
44
theme:
55
name: material
66
palette:
7-
primary: deep purple
8-
accent: pink
7+
- media: "(prefers-color-scheme: light)"
8+
scheme: default
9+
primary: deep purple
10+
accent: pink
11+
toggle:
12+
icon: material/brightness-7
13+
name: Switch to dark mode
14+
- media: "(prefers-color-scheme: dark)"
15+
scheme: slate
16+
primary: deep purple
17+
accent: pink
18+
toggle:
19+
icon: material/brightness-4
20+
name: Switch to light mode
21+
features:
22+
- navigation.tabs
23+
- navigation.sections
24+
- toc.follow
25+
- content.code.annotate
26+
- content.code.copy
927

1028
plugins:
1129
- search
1230
- mkdocstrings:
1331
handlers:
1432
python:
1533
paths: [src]
34+
options:
35+
show_source: true
36+
show_root_heading: true
37+
show_category_heading: true
38+
39+
markdown_extensions:
40+
- pymdownx.highlight:
41+
anchor_linenums: true
42+
pygments_lang_class: true
43+
- pymdownx.inlinehilite
44+
- pymdownx.snippets
45+
- pymdownx.superfences:
46+
custom_fences:
47+
- name: mermaid
48+
class: mermaid
49+
format: !!python/name:pymdownx.superfences.fence_code_format
50+
- admonition
51+
- pymdownx.details
52+
- pymdownx.emoji:
53+
emoji_index: !!python/name:pymdownx.emoji.twemoji
54+
emoji_generator: !!python/name:pymdownx.emoji.to_svg
1655

1756
nav:
1857
- Home: index.md
58+
- User Guide:
59+
- Installation: guide/installation.md
60+
- Basic Usage: guide/basic_usage.md
61+
- Advanced Scoring: guide/advanced.md
1962
- API Reference: api.md
2063
- Performance: performance.md

src/bindings.cpp

Lines changed: 40 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,31 +20,64 @@ PYBIND11_MODULE(_fuzzybunny, m) {
2020

2121
m.def("levenshtein", [](const std::string& s1, const std::string& s2) {
2222
return levenshtein_ratio(utf8_to_u32(s1), utf8_to_u32(s2));
23-
}, py::arg("s1"), py::arg("s2"), "Calculate Levenshtein ratio (0.0 - 1.0)");
23+
}, py::arg("s1"), py::arg("s2"), R"pbdoc(
24+
Calculate the Levenshtein similarity ratio between two strings.
25+
26+
Returns a score between 0.0 and 1.0, where 1.0 is an exact match.
27+
The ratio is calculated as: 1 - (distance / max_length).
28+
)pbdoc");
2429

2530
m.def("partial_ratio", [](const std::string& s1, const std::string& s2) {
2631
return partial_ratio(utf8_to_u32(s1), utf8_to_u32(s2));
27-
}, py::arg("s1"), py::arg("s2"), "Calculate Partial Levenshtein ratio (0.0 - 1.0)");
32+
}, py::arg("s1"), py::arg("s2"), R"pbdoc(
33+
Calculate the best substring similarity ratio.
34+
35+
If the shorter string has length k, this finds the best Levenshtein
36+
ratio between the shorter string and any substring of length k
37+
in the longer string.
38+
)pbdoc");
2839

2940
m.def("jaccard", [](const std::string& s1, const std::string& s2) {
3041
return jaccard_similarity(utf8_to_u32(s1), utf8_to_u32(s2));
31-
}, py::arg("s1"), py::arg("s2"), "Calculate Jaccard similarity (0.0 - 1.0)");
42+
}, py::arg("s1"), py::arg("s2"), R"pbdoc(
43+
Calculate Jaccard similarity between token sets.
44+
45+
Tokenizes both strings and calculates the intersection over union
46+
of the unique tokens.
47+
)pbdoc");
3248

3349
m.def("token_sort", [](const std::string& s1, const std::string& s2) {
3450
return token_sort_ratio(utf8_to_u32(s1), utf8_to_u32(s2));
35-
}, py::arg("s1"), py::arg("s2"), "Calculate Token Sort ratio (0.0 - 1.0)");
51+
}, py::arg("s1"), py::arg("s2"), R"pbdoc(
52+
Calculate similarity ratio after sorting tokens.
53+
54+
Tokenizes both strings, sorts the tokens alphabetically, joins them
55+
back with spaces, and then calculates the Levenshtein ratio.
56+
)pbdoc");
3657

3758
m.def("token_set", [](const std::string& s1, const std::string& s2) {
3859
return token_set_ratio(utf8_to_u32(s1), utf8_to_u32(s2));
39-
}, py::arg("s1"), py::arg("s2"), "Calculate Token Set ratio (0.0 - 1.0)");
60+
}, py::arg("s1"), py::arg("s2"), R"pbdoc(
61+
Calculate similarity ratio while ignoring duplicates and token order.
62+
63+
Finds the intersection and differences between token sets and
64+
compares them to find the best possible match.
65+
)pbdoc");
4066

4167
m.def("qratio", [](const std::string& s1, const std::string& s2) {
4268
return qratio(utf8_to_u32(s1), utf8_to_u32(s2));
43-
}, py::arg("s1"), py::arg("s2"), "Calculate QRatio (0.0 - 1.0)");
69+
}, py::arg("s1"), py::arg("s2"), R"pbdoc(
70+
A simple Levenshtein ratio matching the behavior of other fuzzy libs.
71+
)pbdoc");
4472

4573
m.def("wratio", [](const std::string& s1, const std::string& s2) {
4674
return wratio(utf8_to_u32(s1), utf8_to_u32(s2));
47-
}, py::arg("s1"), py::arg("s2"), "Calculate WRatio (0.0 - 1.0)");
75+
}, py::arg("s1"), py::arg("s2"), R"pbdoc(
76+
Weighted similarity ratio (recommended for general use).
77+
78+
Combines Levenshtein, partial ratio, and token-based ratios using
79+
heuristics to provide the most 'intuitive' similarity score.
80+
)pbdoc");
4881

4982
m.def("rank", &rank,
5083
py::arg("query"),

0 commit comments

Comments
 (0)