You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/learners/llms4ol_challenge/alexbek_learner.rst
+5-16Lines changed: 5 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -257,33 +257,20 @@ Text2Onto
257
257
Loading Ontological Data
258
258
~~~~~~~~~~~~~~~~~~~~~~~~~
259
259
260
-
For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and then generate synthetic pseudo-sentences using an LLM-backed generator (DSPy + Ollama in this example).
260
+
For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and then generate synthetic pseudo-sentences using a direct ``transformers`` backend.
Copy file name to clipboardExpand all lines: docs/source/learners/llms4ol_challenge/sbunlp_learner.rst
+5-16Lines changed: 5 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -188,31 +188,18 @@ Text2Onto
188
188
Loading Ontological Data
189
189
~~~~~~~~~~~~~~~~~~~~~~~~~~~
190
190
191
-
For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and generate synthetic pseudo-sentences using an LLM-backed generator (DSPy + Ollama in this example).
191
+
For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and generate synthetic pseudo-sentences using a direct ``transformers`` backend.
192
192
193
193
.. code-block:: python
194
194
195
195
import os
196
-
import dspy
197
196
198
197
# Import ontology loader/manager and Text2Onto utilities
199
198
from ontolearner.ontology importOM
200
199
from ontolearner.text2onto import SyntheticGenerator, SyntheticDataSplitter
Copy file name to clipboardExpand all lines: docs/source/learners/rag.rst
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,8 @@ Retrieval Augmented Generation
8
8
9
9
**Retrieval Augmented Generation (RAG)** learners combine the strengths of both retrieval models and LLMs to perform ontology learning tasks. This methodology is a hybrid approach designed to enhance ontology learning by addressing the limitations of using LLMs alone.
10
10
11
+
The same retrieval-and-context idea is also useful for Text2Onto generation: ontology graph context can be injected into the synthetic document prompt so the generated passages stay closer to the source ontology.
12
+
11
13
RAG learners operate in two main steps: 1) **Retrieval** component that finds the most relevant examples from the training data based on similarity to the input query. 2) **Generation** component that uses retrieved examples as context to generate a response. The methodology behind RAG learners combines vector retrieval with generative language modeling to enhance ontology learning tasks. This hybrid approach addresses the limitations of using LLMs alone by grounding the model's responses in specific ontological examples from the training data. By encoding ontological elements into a vector space, the retriever can identify semantically similar concepts, relations, or taxonomic structures. These retrieved examples serve as few-shot demonstrations that provide the LLM with domain-specific context, enabling more accurate and consistent ontological inferences. This approach is particularly effective for specialized domains where the model's pre-trained knowledge may be insufficient or where precise ontological alignments are critical.
Copy file name to clipboardExpand all lines: docs/source/learning_tasks/learning_tasks.rst
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,6 +17,8 @@ Within the OntoLearner framework, the modularized ontologies are extended with O
17
17
18
18
Additionally, OntoLearner incorporates a **Text2Onto**, which focuses on extracting ontological terms and types directly from raw text. Notably, Text2Onto is designed to function independently of the LLMs4OL pipeline, and Ontologizer. Users can import or load LLMs4OL tasks as inputs to Text2Onto, enabling flexible and extensible data extraction workflows for OL.
19
19
20
+
Text2Onto synthetic generation now uses a direct ``transformers``-based backend, and the generator can enrich each passage with ontology-aware context pulled from the extracted term/type graph. This makes the generated text more faithful to the source ontology and easier to use for downstream training.
As the second step, an LLM is used to generate synthetic text documents. DSPy is used to connect to the LLM and parse the LLM outputs. You can use an LLM from an external provider
24
-
or host an LLM locally using tools such as Ollama or vLLM.
23
+
As the second step, an LLM is used to generate synthetic text documents. Text2Onto now uses a direct ``transformers`` backend for generation, so you can run it with a Hugging Face model locally or with a remote model that is accessible through the standard Transformers APIs.
25
24
26
-
.. note::
27
-
28
-
More details about all providers supported by ``DSPy`` (through *LiteLLM*) can be found in `this link <https://docs.litellm.ai/docs/providers>`_.
29
-
30
-
Information about the LLM is provided in a ``.env`` file similar to the following.
31
-
32
-
.. code-block::
25
+
The generator also enriches the prompt with ontology-aware context derived from the extracted term typing, taxonomy, and non-taxonomic relation structure. This improves faithfulness and helps the model produce more coherent passages that stay closer to the source ontology.
33
26
34
-
"LLM_MODEL_ID"={model_id_from_provider}
35
-
"LLM_BASE_URL"={llm_provider_base_url}
36
-
"LLM_API_KEY"={api_key_for_the_provider}
27
+
.. note::
37
28
29
+
Text2Onto works best with instruction-tuned Hugging Face models that can follow structured-output prompts. Smaller models can work for quick demos, while stronger instruction-tuned models usually produce cleaner passages and more consistent JSON.
38
30
39
-
Then you can configure DSPy to use the provided LLM and generate the synthetic text documents using the ontology data extracted before.
31
+
You can configure the generator directly with a model identifier, optional Hugging Face token, and decoding settings.
40
32
41
33
.. code-block:: python
42
34
43
35
from dotenv import load_dotenv
44
-
importdspy
36
+
importos
45
37
46
38
from ontolearner.text2onto import SyntheticGenerator
For better generation quality, use an instruction-tuned model, keep temperature low, and increase ``batch_size`` only when the ontology context still fits comfortably into the model context window.
56
+
57
+
.. note::
58
+
59
+
The generator does not rely on DSPy anymore. If you previously configured DSPy for Text2Onto, you can remove that setup and pass the model directly through ``SyntheticGenerator``.
60
+
66
61
Data Splitter
67
62
------------------------
68
63
69
64
You can split the generated synthetic data for training, hyperparameter optimization (validation), and testing purposes.
70
65
66
+
If you want to improve the synthetic corpus further, the current generator can be extended with:
67
+
68
+
* richer ontology context retrieval from neighboring terms or parent chains,
69
+
* stricter JSON/structured-output validation,
70
+
* post-generation repair retries when required labels are missing,
71
+
* and optional reranking of multiple candidate passages.
72
+
71
73
.. code-block:: python
72
74
73
75
from ontolearner.text2onto import SyntheticDataSplitter
Copy file name to clipboardExpand all lines: docs/source/quickstart.rst
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -143,6 +143,8 @@ Once the data is split into training and testing sets, you can apply learning mo
143
143
- LLM-only: set ``llm_id``
144
144
- RAG: set both ``retriever_id`` + ``llm_id`` for AutoRAGLearner. For prebuild RAG pass ``rag`` learner.
145
145
146
+
Text2Onto synthetic data generation follows a similar philosophy: the generator now uses a direct ``transformers`` backend and augments prompts with ontology graph context, which makes the generated passages more faithful and easier to validate.
147
+
146
148
In the example below, we configure a RAG-based learner by specifying the Qwen LLM (`Qwen/Qwen2.5-0.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct>`_) and a retriever based on a sentence-transformer model (`all-MiniLM-L6-v2 <https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2>`_):
0 commit comments