Skip to content

Commit 2dda97b

Browse files
authored
Merge pull request #320 from sciknoworg/dev
documentation, text2onto (major), and requirements update
2 parents c5fc55b + 38c018c commit 2dda97b

21 files changed

Lines changed: 6722 additions & 223 deletions

.github/workflows/test-os-compatibility.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ on:
44
push:
55
branches: [main]
66
pull_request:
7-
branches: [main]
7+
branches: [main, dev]
88

99
jobs:
1010
os-compatibility-tests:

.github/workflows/test-package.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ on:
88
pull_request:
99
branches:
1010
- main
11+
- dev
1112

1213
jobs:
1314
build-and-test:

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ print(ontolearner.__version__)
7373
- **Unified API**: consistent `fit → predict → evaluate` interface across all learners.
7474
- **LearnerPipeline**: end-to-end pipeline in a single call.
7575
- **Extensible**: easily plug in custom ontologies, learners, or retrievers.
76+
- **Text2Onto generation**: synthetic document generation now uses a direct `transformers` backend with ontology-aware context enrichment.
7677

7778
---
7879

docs/source/learners/llms4ol_challenge/alexbek_learner.rst

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -257,33 +257,20 @@ Text2Onto
257257
Loading Ontological Data
258258
~~~~~~~~~~~~~~~~~~~~~~~~~
259259

260-
For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and then generate synthetic pseudo-sentences using an LLM-backed generator (DSPy + Ollama in this example).
260+
For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and then generate synthetic pseudo-sentences using a direct ``transformers`` backend.
261261

262262
.. code-block:: python
263263
264264
import os
265-
import dspy
266265
267266
# Ontology loader/manager
268267
from ontolearner.ontology import OM
269268
270269
# Text2Onto utilities: synthetic generation + dataset splitting
271270
from ontolearner.text2onto import SyntheticGenerator, SyntheticDataSplitter
272271
273-
# ---- DSPy -> Ollama (LiteLLM-style) ----
274-
LLM_MODEL_ID = "ollama/llama3.2:3b" # use your pulled Ollama model
275-
LLM_API_KEY = "NA" # local Ollama doesn't use a key
276-
LLM_BASE_URL = "http://localhost:11434" # default Ollama endpoint
277-
278-
dspy_llm = dspy.LM(
279-
model=LLM_MODEL_ID,
280-
cache=True,
281-
max_tokens=4000,
282-
temperature=0,
283-
api_key=LLM_API_KEY,
284-
base_url=LLM_BASE_URL,
285-
)
286-
dspy.configure(lm=dspy_llm)
272+
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
273+
HF_TOKEN = os.getenv("HF_TOKEN", "")
287274
288275
# ---- Synthetic generation configuration ----
289276
pseudo_sentence_batch_size = int(os.getenv("TEXT2ONTO_BATCH", "10"))
@@ -292,6 +279,8 @@ For the Text2Onto task, we load an ontology (via ``OM``), extract its structured
292279
text2onto_synthetic_generator = SyntheticGenerator(
293280
batch_size=pseudo_sentence_batch_size,
294281
worker_count=max_worker_count_for_llm_calls,
282+
model_id=MODEL_ID,
283+
token=HF_TOKEN,
295284
)
296285
297286
# ---- Load ontology and extract structured data ----

docs/source/learners/llms4ol_challenge/sbunlp_learner.rst

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -188,31 +188,18 @@ Text2Onto
188188
Loading Ontological Data
189189
~~~~~~~~~~~~~~~~~~~~~~~~~~~
190190

191-
For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and generate synthetic pseudo-sentences using an LLM-backed generator (DSPy + Ollama in this example).
191+
For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and generate synthetic pseudo-sentences using a direct ``transformers`` backend.
192192

193193
.. code-block:: python
194194
195195
import os
196-
import dspy
197196
198197
# Import ontology loader/manager and Text2Onto utilities
199198
from ontolearner.ontology import OM
200199
from ontolearner.text2onto import SyntheticGenerator, SyntheticDataSplitter
201200
202-
# ---- DSPy -> Ollama (LiteLLM-style) ----
203-
LLM_MODEL_ID = "ollama/llama3.2:3b"
204-
LLM_API_KEY = "NA" # local Ollama doesn't use a key
205-
LLM_BASE_URL = "http://localhost:11434" # default Ollama endpoint
206-
207-
dspy_llm = dspy.LM(
208-
model=LLM_MODEL_ID,
209-
cache=True,
210-
max_tokens=4000,
211-
temperature=0,
212-
api_key=LLM_API_KEY,
213-
base_url=LLM_BASE_URL,
214-
)
215-
dspy.configure(lm=dspy_llm)
201+
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
202+
HF_TOKEN = os.getenv("HF_TOKEN", "")
216203
217204
# ---- Synthetic generation configuration ----
218205
batch_size = int(os.getenv("TEXT2ONTO_BATCH", "10"))
@@ -221,6 +208,8 @@ For the Text2Onto task, we load an ontology (via ``OM``), extract its structured
221208
text2onto_synthetic_generator = SyntheticGenerator(
222209
batch_size=batch_size,
223210
worker_count=worker_count,
211+
model_id=MODEL_ID,
212+
token=HF_TOKEN,
224213
)
225214
226215
# ---- Load ontology and extract structured data ----

docs/source/learners/rag.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ Retrieval Augmented Generation
88

99
**Retrieval Augmented Generation (RAG)** learners combine the strengths of both retrieval models and LLMs to perform ontology learning tasks. This methodology is a hybrid approach designed to enhance ontology learning by addressing the limitations of using LLMs alone.
1010

11+
The same retrieval-and-context idea is also useful for Text2Onto generation: ontology graph context can be injected into the synthetic document prompt so the generated passages stay closer to the source ontology.
12+
1113
RAG learners operate in two main steps: 1) **Retrieval** component that finds the most relevant examples from the training data based on similarity to the input query. 2) **Generation** component that uses retrieved examples as context to generate a response. The methodology behind RAG learners combines vector retrieval with generative language modeling to enhance ontology learning tasks. This hybrid approach addresses the limitations of using LLMs alone by grounding the model's responses in specific ontological examples from the training data. By encoding ontological elements into a vector space, the retriever can identify semantically similar concepts, relations, or taxonomic structures. These retrieved examples serve as few-shot demonstrations that provide the LLM with domain-specific context, enabling more accurate and consistent ontological inferences. This approach is particularly effective for specialized domains where the model's pre-trained knowledge may be insufficient or where precise ontological alignments are critical.
1214

1315
Loading Ontological Data

docs/source/learning_tasks/learning_tasks.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ Within the OntoLearner framework, the modularized ontologies are extended with O
1717

1818
Additionally, OntoLearner incorporates a **Text2Onto**, which focuses on extracting ontological terms and types directly from raw text. Notably, Text2Onto is designed to function independently of the LLMs4OL pipeline, and Ontologizer. Users can import or load LLMs4OL tasks as inputs to Text2Onto, enabling flexible and extensible data extraction workflows for OL.
1919

20+
Text2Onto synthetic generation now uses a direct ``transformers``-based backend, and the generator can enrich each passage with ontology-aware context pulled from the extracted term/type graph. This makes the generated text more faithful to the source ontology and easier to use for downstream training.
21+
2022
LLMs4OL Paradigm
2123
-------------------
2224

docs/source/learning_tasks/text2onto.rst

Lines changed: 27 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -13,61 +13,63 @@ The first step is to load the ontology data from the selected ontology.
1313
1414
conference = ConferenceOntology()
1515
conference.load()
16-
ontological_data = ontology.extract()
16+
ontological_data = conference.extract()
1717
1818
print(f"term types: {len(ontological_data.term_typings)}")
1919
print(f"taxonomic relations: {len(ontological_data.type_taxonomies.taxonomies)}")
2020
print(f"non-taxonomic relations: {len(ontological_data.type_non_taxonomic_relations.non_taxonomies)}")
2121
2222
23-
As the second step, an LLM is used to generate synthetic text documents. DSPy is used to connect to the LLM and parse the LLM outputs. You can use an LLM from an external provider
24-
or host an LLM locally using tools such as Ollama or vLLM.
23+
As the second step, an LLM is used to generate synthetic text documents. Text2Onto now uses a direct ``transformers`` backend for generation, so you can run it with a Hugging Face model locally or with a remote model that is accessible through the standard Transformers APIs.
2524

26-
.. note::
27-
28-
More details about all providers supported by ``DSPy`` (through *LiteLLM*) can be found in `this link <https://docs.litellm.ai/docs/providers>`_.
29-
30-
Information about the LLM is provided in a ``.env`` file similar to the following.
31-
32-
.. code-block::
25+
The generator also enriches the prompt with ontology-aware context derived from the extracted term typing, taxonomy, and non-taxonomic relation structure. This improves faithfulness and helps the model produce more coherent passages that stay closer to the source ontology.
3326

34-
"LLM_MODEL_ID"={model_id_from_provider}
35-
"LLM_BASE_URL"={llm_provider_base_url}
36-
"LLM_API_KEY"={api_key_for_the_provider}
27+
.. note::
3728

29+
Text2Onto works best with instruction-tuned Hugging Face models that can follow structured-output prompts. Smaller models can work for quick demos, while stronger instruction-tuned models usually produce cleaner passages and more consistent JSON.
3830

39-
Then you can configure DSPy to use the provided LLM and generate the synthetic text documents using the ontology data extracted before.
31+
You can configure the generator directly with a model identifier, optional Hugging Face token, and decoding settings.
4032

4133
.. code-block:: python
4234
4335
from dotenv import load_dotenv
44-
import dspy
36+
import os
4537
4638
from ontolearner.text2onto import SyntheticGenerator
4739
4840
load_dotenv(override=True)
4941
50-
dspy_llm = dspy.LM(
51-
model=os.environ["LLM_MODEL_ID"],
52-
cache=True,
53-
max_tokens=4000,
54-
temperature=0,
55-
api_key=os.environ["LLM_API_KEY"],
56-
base_url=os.environ["LLM_BASE_URL"])
57-
dspy.configure(lm=dspy_llm)
58-
5942
pseudo_sentence_batch_size = 50
6043
max_worker_count_for_llm_calls = 3
6144
text2onto_synthetic_generator = SyntheticGenerator(batch_size=pseudo_sentence_batch_size,
62-
worker_count=max_worker_count_for_llm_calls)
45+
worker_count=max_worker_count_for_llm_calls,
46+
model_id=os.getenv("TEXT2ONTO_MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct"),
47+
token=os.getenv("HF_TOKEN", ""),
48+
device=os.getenv("TEXT2ONTO_DEVICE", "auto"),
49+
max_new_tokens=256)
6350
synthetic_data = text2onto_synthetic_generator.generate(ontological_data=ontological_data,
6451
topic=ontology.domain)
6552
53+
.. tip::
54+
55+
For better generation quality, use an instruction-tuned model, keep temperature low, and increase ``batch_size`` only when the ontology context still fits comfortably into the model context window.
56+
57+
.. note::
58+
59+
The generator does not rely on DSPy anymore. If you previously configured DSPy for Text2Onto, you can remove that setup and pass the model directly through ``SyntheticGenerator``.
60+
6661
Data Splitter
6762
------------------------
6863

6964
You can split the generated synthetic data for training, hyperparameter optimization (validation), and testing purposes.
7065

66+
If you want to improve the synthetic corpus further, the current generator can be extended with:
67+
68+
* richer ontology context retrieval from neighboring terms or parent chains,
69+
* stricter JSON/structured-output validation,
70+
* post-generation repair retries when required labels are missing,
71+
* and optional reranking of multiple candidate passages.
72+
7173
.. code-block:: python
7274
7375
from ontolearner.text2onto import SyntheticDataSplitter

docs/source/package_reference/text2onto.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,15 @@ SyntheticGenerator
1818

1919
SyntheticDataSplitter
2020
-----------------------
21-
.. autoclass:: ontolearner.text2onto.spliter.SyntheticDataSplitter
21+
.. autoclass:: ontolearner.text2onto.splitter.SyntheticDataSplitter
2222
:members:
2323
:undoc-members:
2424
:show-inheritance:
2525

2626

2727
TaxonomyBatchifier
2828
-----------------------
29-
.. autoclass:: ontolearner.text2onto.batchifier.SyntheticDataSplitter
29+
.. autoclass:: ontolearner.text2onto.batchifier.TaxonomyBatchifier
3030
:members:
3131
:undoc-members:
3232
:show-inheritance:

docs/source/quickstart.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,8 @@ Once the data is split into training and testing sets, you can apply learning mo
143143
- LLM-only: set ``llm_id``
144144
- RAG: set both ``retriever_id`` + ``llm_id`` for AutoRAGLearner. For prebuild RAG pass ``rag`` learner.
145145

146+
Text2Onto synthetic data generation follows a similar philosophy: the generator now uses a direct ``transformers`` backend and augments prompts with ontology graph context, which makes the generated passages more faithful and easier to validate.
147+
146148
In the example below, we configure a RAG-based learner by specifying the Qwen LLM (`Qwen/Qwen2.5-0.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct>`_) and a retriever based on a sentence-transformer model (`all-MiniLM-L6-v2 <https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2>`_):
147149

148150
.. sidebar:: Other Learners

0 commit comments

Comments
 (0)