Skip to content

Commit e052cc0

Browse files
authored
Merge pull request #352 from urchade/feature/client
Implement a proper gliner client, fix bugs
2 parents 4b9e7f7 + a5a1ac3 commit e052cc0

5 files changed

Lines changed: 269 additions & 97 deletions

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ data.json
1515
logs/
1616
models/
1717

18+
labels_trie.cpp
19+
1820
# Distribution / packaging
1921
.Python
2022
build/

docs/serving.md

Lines changed: 123 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -46,10 +46,10 @@ from `asyncio`.
4646
```python
4747
from gliner.serve import GLiNERClient
4848

49-
client = GLiNERClient()
49+
client = GLiNERClient() # defaults to http://localhost:8000/gliner
5050
result = client.predict(
5151
"John works at Google in Mountain View",
52-
labels=["person", "organization", "location"]
52+
labels=["person", "organization", "location"],
5353
)
5454
print(result)
5555
# {'entities': [
@@ -59,7 +59,35 @@ print(result)
5959
# ]}
6060
```
6161

62-
**HTTP request:**
62+
`GLiNERClient` is a pure HTTP client built on the Python standard library —
63+
it does **not** import `ray` and does **not** join the Ray cluster, so it
64+
runs from any Python process (including environments where `ray` is not
65+
installed). Construct it with a custom URL/prefix or timeout as needed:
66+
67+
```python
68+
client = GLiNERClient(
69+
base_url="http://gliner.internal:8000",
70+
route_prefix="/gliner",
71+
timeout=30.0,
72+
max_concurrency=32, # bound on concurrent in-flight HTTP requests
73+
)
74+
```
75+
76+
Passing a list of texts preserves server-side dynamic batching — each text
77+
is dispatched as its own HTTP request concurrently (threads for `predict`,
78+
`asyncio.gather` for `predict_async`) so Ray Serve's `@serve.batch`
79+
coalesces them into a single forward pass:
80+
81+
```python
82+
outputs = client.predict(
83+
["John works at Google", "Paris is in France"],
84+
labels=["person", "organization", "location"],
85+
) # → list[dict], one per input text
86+
```
87+
88+
Network or server errors surface as `gliner.serve.client.GLiNERClientError`.
89+
90+
**HTTP request (no client library):**
6391
```bash
6492
curl -X POST http://localhost:8000/gliner \
6593
-H "Content-Type: application/json" \
@@ -128,16 +156,100 @@ result = ref.result()
128156

129157
## Relation Extraction
130158

131-
For models that support relation extraction:
159+
GLiNER-RelEx models (e.g. `knowledgator/gliner-relex-large-v0.5`,
160+
`knowledgator/gliner-token-relex-v1.0`) jointly extract entities and the
161+
relations between them in a single forward pass. The server auto-detects
162+
relation support by inspecting `model.config.model_type` and enables the
163+
relex code path when it contains `"relex"` — no extra flag is needed.
164+
165+
### Start a RelEx server
166+
167+
```bash
168+
python -m gliner.serve \
169+
--model knowledgator/gliner-relex-large-v1.0 \
170+
--dtype bfloat16 \
171+
--max-batch-size 16
172+
```
173+
174+
### Predict via the client
132175

133176
```python
177+
from gliner.serve import GLiNERClient
178+
179+
client = GLiNERClient() # http://localhost:8000/gliner
180+
181+
text = "Bill Gates founded Microsoft in 1975. The company is headquartered in Redmond."
182+
134183
result = client.predict(
135-
"John works at Google",
136-
labels=["person", "organization"],
137-
relations=["works_at", "founded_by"]
184+
text,
185+
labels=["person", "organization", "date", "location"],
186+
relations=["founded", "founded_in", "headquartered_in"],
187+
threshold=0.5,
188+
relation_threshold=0.5,
138189
)
139-
# {'entities': [...], 'relations': [...]}
190+
191+
for ent in result["entities"]:
192+
print(f" {ent['text']} ({ent['label']})")
193+
194+
for rel in result["relations"]:
195+
head = result["entities"][rel["head"]["entity_idx"]]
196+
tail = result["entities"][rel["tail"]["entity_idx"]]
197+
print(f" {head['text']} --[{rel['relation']}]--> {tail['text']}")
198+
```
199+
200+
For a batched call, pass a list of texts — each one dispatches as its own
201+
request so the server can coalesce them into a single relex forward pass:
202+
203+
```python
204+
results = client.predict(
205+
[
206+
"Bill Gates founded Microsoft in 1975.",
207+
"Apple is headquartered in Cupertino.",
208+
],
209+
labels=["person", "organization", "location", "date"],
210+
relations=["founded", "founded_in", "headquartered_in"],
211+
)
212+
# results == [ {"entities": [...], "relations": [...]}, {...} ]
213+
```
214+
215+
### In-process (GLiNERFactory)
216+
217+
```python
218+
from gliner.serve import GLiNERFactory
219+
220+
with GLiNERFactory(model="knowledgator/gliner-relex-large-v0.5") as llm:
221+
out = llm.predict(
222+
"Bill Gates founded Microsoft in 1975.",
223+
labels=["person", "organization", "date"],
224+
relations=["founded", "founded_in"],
225+
)
226+
```
227+
228+
### HTTP (curl)
229+
230+
```bash
231+
curl -X POST http://localhost:8000/gliner \
232+
-H "Content-Type: application/json" \
233+
-d '{
234+
"text": "Bill Gates founded Microsoft in 1975.",
235+
"labels": ["person", "organization", "date"],
236+
"relations": ["founded", "founded_in"],
237+
"threshold": 0.5,
238+
"relation_threshold": 0.5
239+
}'
240+
```
241+
242+
**Response shape for RelEx models:**
243+
```python
244+
{
245+
"entities": [{"start", "end", "text", "label", "score"}, ...],
246+
"relations": [{"relation", "score",
247+
"head": {"entity_idx": int, ...},
248+
"tail": {"entity_idx": int, ...}}, ...],
249+
}
140250
```
251+
For NER-only models the `"relations"` key is omitted; passing `relations=`
252+
to such a model is a no-op.
141253

142254
## All CLI Options
143255

@@ -152,7 +264,7 @@ Model Configuration:
152264
153265
Batching:
154266
--max-batch-size Max batch size (default: 32)
155-
--batch-wait-timeout-ms Batch wait timeout (default: 50)
267+
--batch-wait-timeout-ms Batch wait timeout (default: 10)
156268
--precompiled-batch-sizes Comma-separated sizes (default: 1,2,4,8,16,32)
157269
158270
Replicas:
@@ -165,7 +277,8 @@ Performance:
165277
--no-compile Disable torch.compile
166278
167279
Memory:
168-
--target-memory-fraction GPU memory fraction (default: 0.8)
280+
--target-memory-fraction GPU memory fraction (default: 0.9)
281+
--memory-overhead-factor Safety margin on memory estimates (default: 1.3)
169282
170283
Server:
171284
--route-prefix HTTP route (default: /gliner)

gliner/modeling/base.py

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2186,17 +2186,8 @@ def select_target_embedding(
21862186
- target_mask: Packed mask of shape (B, M).
21872187
"""
21882188
B, N, D = representations.shape
2189-
2190-
# ``lengths.max().item()`` would force a GPU→CPU sync to compress the output
2191-
# from ``N`` to ``max_len``. Under ``torch.compile`` with ``capture_scalar_outputs``
2192-
# the sync is traced symbolically and we keep the packing benefit; in eager
2193-
# execution we skip packing (use the full ``N``) to stay sync-free, at the cost
2194-
# of some downstream compute that masked positions would have saved.
2195-
if torch.compiler.is_compiling():
2196-
lengths = rep_mask.sum(dim=-1)
2197-
max_len = lengths.max().item()
2198-
else:
2199-
max_len = N
2189+
lengths = rep_mask.sum(dim=-1)
2190+
max_len = lengths.max().item()
22002191

22012192
if max_len != N:
22022193
target_rep = representations.new_zeros(B, max_len, D)

0 commit comments

Comments
 (0)