sciknoworg
diff --git a/‎.github/workflows/test-os-compatibility.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/test-os-compatibility.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/test-package.yml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/test-package.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/learners/llms4ol_challenge/alexbek_learner.rst‎
Lines changed: 5 additions & 16 deletions b/‎docs/source/learners/llms4ol_challenge/alexbek_learner.rst‎
Lines changed: 5 additions & 16 deletions
diff --git a/‎docs/source/learners/llms4ol_challenge/sbunlp_learner.rst‎
Lines changed: 5 additions & 16 deletions b/‎docs/source/learners/llms4ol_challenge/sbunlp_learner.rst‎
Lines changed: 5 additions & 16 deletions
diff --git a/‎docs/source/learners/rag.rst‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/learners/rag.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/learning_tasks/learning_tasks.rst‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/learning_tasks/learning_tasks.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/learning_tasks/text2onto.rst‎
Lines changed: 27 additions & 25 deletions b/‎docs/source/learning_tasks/text2onto.rst‎
Lines changed: 27 additions & 25 deletions
diff --git a/‎docs/source/package_reference/text2onto.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/package_reference/text2onto.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/quickstart.rst‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/quickstart.rst‎
Lines changed: 2 additions & 0 deletions
@@ -4,7 +4,7 @@ on:
   push:
     branches: [main]
   pull_request:
-    branches: [main]
+    branches: [main, dev]
 
 jobs:
   os-compatibility-tests:
 
@@ -8,6 +8,7 @@ on:
   pull_request:
     branches:
       - main
+      - dev
 
 jobs:
   build-and-test:
 
@@ -73,6 +73,7 @@ print(ontolearner.__version__)
 - **Unified API**: consistent `fit → predict → evaluate` interface across all learners.
 - **LearnerPipeline**: end-to-end pipeline in a single call.
 - **Extensible**: easily plug in custom ontologies, learners, or retrievers.
+- **Text2Onto generation**: synthetic document generation now uses a direct `transformers` backend with ontology-aware context enrichment.
 
 ---
 
 
@@ -257,33 +257,20 @@ Text2Onto
 Loading Ontological Data
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
-For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and then generate synthetic pseudo-sentences using an LLM-backed generator (DSPy + Ollama in this example).
+For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and then generate synthetic pseudo-sentences using a direct ``transformers`` backend.
 
 .. code-block:: python
 
    import os
-   import dspy
 
    # Ontology loader/manager
    from ontolearner.ontology import OM
 
    # Text2Onto utilities: synthetic generation + dataset splitting
    from ontolearner.text2onto import SyntheticGenerator, SyntheticDataSplitter
 
-   # ---- DSPy -> Ollama (LiteLLM-style) ----
-   LLM_MODEL_ID = "ollama/llama3.2:3b"      # use your pulled Ollama model
-   LLM_API_KEY  = "NA"                      # local Ollama doesn't use a key
-   LLM_BASE_URL = "http://localhost:11434"  # default Ollama endpoint
-
-   dspy_llm = dspy.LM(
-       model=LLM_MODEL_ID,
-       cache=True,
-       max_tokens=4000,
-       temperature=0,
-       api_key=LLM_API_KEY,
-       base_url=LLM_BASE_URL,
-   )
-   dspy.configure(lm=dspy_llm)
+   MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
+   HF_TOKEN = os.getenv("HF_TOKEN", "")
 
    # ---- Synthetic generation configuration ----
    pseudo_sentence_batch_size = int(os.getenv("TEXT2ONTO_BATCH", "10"))
@@ -292,6 +279,8 @@ For the Text2Onto task, we load an ontology (via ``OM``), extract its structured
    text2onto_synthetic_generator = SyntheticGenerator(
        batch_size=pseudo_sentence_batch_size,
        worker_count=max_worker_count_for_llm_calls,
+       model_id=MODEL_ID,
+       token=HF_TOKEN,
    )
 
    # ---- Load ontology and extract structured data ----
 
@@ -188,31 +188,18 @@ Text2Onto
 Loading Ontological Data
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and generate synthetic pseudo-sentences using an LLM-backed generator (DSPy + Ollama in this example).
+For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and generate synthetic pseudo-sentences using a direct ``transformers`` backend.
 
 .. code-block:: python
 
    import os
-   import dspy
 
    # Import ontology loader/manager and Text2Onto utilities
    from ontolearner.ontology import OM
    from ontolearner.text2onto import SyntheticGenerator, SyntheticDataSplitter
 
-   # ---- DSPy -> Ollama (LiteLLM-style) ----
-   LLM_MODEL_ID = "ollama/llama3.2:3b"
-   LLM_API_KEY  = "NA"                      # local Ollama doesn't use a key
-   LLM_BASE_URL = "http://localhost:11434"  # default Ollama endpoint
-
-   dspy_llm = dspy.LM(
-       model=LLM_MODEL_ID,
-       cache=True,
-       max_tokens=4000,
-       temperature=0,
-       api_key=LLM_API_KEY,
-       base_url=LLM_BASE_URL,
-   )
-   dspy.configure(lm=dspy_llm)
+   MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
+   HF_TOKEN = os.getenv("HF_TOKEN", "")
 
    # ---- Synthetic generation configuration ----
    batch_size = int(os.getenv("TEXT2ONTO_BATCH", "10"))
@@ -221,6 +208,8 @@ For the Text2Onto task, we load an ontology (via ``OM``), extract its structured
    text2onto_synthetic_generator = SyntheticGenerator(
        batch_size=batch_size,
        worker_count=worker_count,
+       model_id=MODEL_ID,
+       token=HF_TOKEN,
    )
 
    # ---- Load ontology and extract structured data ----
 
@@ -8,6 +8,8 @@ Retrieval Augmented Generation
 
 **Retrieval Augmented Generation (RAG)** learners combine the strengths of both retrieval models and LLMs to perform ontology learning tasks. This methodology is a hybrid approach designed to enhance ontology learning by addressing the limitations of using LLMs alone.
 
+The same retrieval-and-context idea is also useful for Text2Onto generation: ontology graph context can be injected into the synthetic document prompt so the generated passages stay closer to the source ontology.
+
 RAG learners operate in two main steps: 1) **Retrieval** component that finds the most relevant examples from the training data based on similarity to the input query. 2) **Generation** component that uses retrieved examples as context to generate a response. The methodology behind RAG learners combines vector retrieval with generative language modeling to enhance ontology learning tasks. This hybrid approach addresses the limitations of using LLMs alone by grounding the model's responses in specific ontological examples from the training data. By encoding ontological elements into a vector space, the retriever can identify semantically similar concepts, relations, or taxonomic structures. These retrieved examples serve as few-shot demonstrations that provide the LLM with domain-specific context, enabling more accurate and consistent ontological inferences. This approach is particularly effective for specialized domains where the model's pre-trained knowledge may be insufficient or where precise ontological alignments are critical.
 
 Loading Ontological Data
 
@@ -17,6 +17,8 @@ Within the OntoLearner framework, the modularized ontologies are extended with O
 
 Additionally, OntoLearner incorporates a **Text2Onto**, which focuses on extracting ontological terms and types directly from raw text. Notably, Text2Onto is designed to function independently of the LLMs4OL pipeline, and Ontologizer. Users can import or load LLMs4OL tasks as inputs to Text2Onto, enabling flexible and extensible data extraction workflows for OL.
 
+Text2Onto synthetic generation now uses a direct ``transformers``-based backend, and the generator can enrich each passage with ontology-aware context pulled from the extracted term/type graph. This makes the generated text more faithful to the source ontology and easier to use for downstream training.
+
 LLMs4OL Paradigm
 -------------------
 
 
@@ -13,61 +13,63 @@ The first step is to load the ontology data from the selected ontology.
 
     conference = ConferenceOntology()
     conference.load()
-    ontological_data = ontology.extract()
+    ontological_data = conference.extract()
 
     print(f"term types: {len(ontological_data.term_typings)}")
     print(f"taxonomic relations: {len(ontological_data.type_taxonomies.taxonomies)}")
     print(f"non-taxonomic relations: {len(ontological_data.type_non_taxonomic_relations.non_taxonomies)}")
 
 
-As the second step, an LLM is used to generate synthetic text documents. DSPy is used to connect to the LLM and parse the LLM outputs. You can use an LLM from an external provider
-or host an LLM locally using tools such as Ollama or vLLM.
+As the second step, an LLM is used to generate synthetic text documents. Text2Onto now uses a direct ``transformers`` backend for generation, so you can run it with a Hugging Face model locally or with a remote model that is accessible through the standard Transformers APIs.
 
-.. note::
-
-     More details about all providers supported by ``DSPy`` (through *LiteLLM*) can be found in `this link <https://docs.litellm.ai/docs/providers>`_.
-
-Information about the LLM is provided in a ``.env`` file similar to the following.
-
-.. code-block::
+The generator also enriches the prompt with ontology-aware context derived from the extracted term typing, taxonomy, and non-taxonomic relation structure. This improves faithfulness and helps the model produce more coherent passages that stay closer to the source ontology.
 
-    "LLM_MODEL_ID"={model_id_from_provider}
-    "LLM_BASE_URL"={llm_provider_base_url}
-    "LLM_API_KEY"={api_key_for_the_provider}
+.. note::
 
+     Text2Onto works best with instruction-tuned Hugging Face models that can follow structured-output prompts. Smaller models can work for quick demos, while stronger instruction-tuned models usually produce cleaner passages and more consistent JSON.
 
-Then you can configure DSPy to use the provided LLM and generate the synthetic text documents using the ontology data extracted before.
+You can configure the generator directly with a model identifier, optional Hugging Face token, and decoding settings.
 
 .. code-block:: python
 
     from dotenv import load_dotenv
-    import dspy
+    import os
 
     from ontolearner.text2onto import SyntheticGenerator
 
     load_dotenv(override=True)
 
-    dspy_llm = dspy.LM(
-        model=os.environ["LLM_MODEL_ID"],
-        cache=True,
-        max_tokens=4000,
-        temperature=0,
-        api_key=os.environ["LLM_API_KEY"],
-        base_url=os.environ["LLM_BASE_URL"])
-    dspy.configure(lm=dspy_llm)
-
     pseudo_sentence_batch_size = 50
     max_worker_count_for_llm_calls = 3
     text2onto_synthetic_generator = SyntheticGenerator(batch_size=pseudo_sentence_batch_size,
-                                                   worker_count=max_worker_count_for_llm_calls)
+                                                        worker_count=max_worker_count_for_llm_calls,
+                                                        model_id=os.getenv("TEXT2ONTO_MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct"),
+                                                        token=os.getenv("HF_TOKEN", ""),
+                                                        device=os.getenv("TEXT2ONTO_DEVICE", "auto"),
+                                                        max_new_tokens=256)
     synthetic_data = text2onto_synthetic_generator.generate(ontological_data=ontological_data,
                                                                     topic=ontology.domain)
 
+.. tip::
+
+   For better generation quality, use an instruction-tuned model, keep temperature low, and increase ``batch_size`` only when the ontology context still fits comfortably into the model context window.
+
+.. note::
+
+   The generator does not rely on DSPy anymore. If you previously configured DSPy for Text2Onto, you can remove that setup and pass the model directly through ``SyntheticGenerator``.
+
 Data Splitter
 ------------------------
 
 You can split the generated synthetic data for training, hyperparameter optimization (validation), and testing purposes.
 
+If you want to improve the synthetic corpus further, the current generator can be extended with:
+
+* richer ontology context retrieval from neighboring terms or parent chains,
+* stricter JSON/structured-output validation,
+* post-generation repair retries when required labels are missing,
+* and optional reranking of multiple candidate passages.
+
 .. code-block:: python
 
     from ontolearner.text2onto import SyntheticDataSplitter
 
@@ -18,15 +18,15 @@ SyntheticGenerator
 
 SyntheticDataSplitter
 -----------------------
-.. autoclass:: ontolearner.text2onto.spliter.SyntheticDataSplitter
+.. autoclass:: ontolearner.text2onto.splitter.SyntheticDataSplitter
    :members:
    :undoc-members:
    :show-inheritance:
 
 
 TaxonomyBatchifier
 -----------------------
-.. autoclass:: ontolearner.text2onto.batchifier.SyntheticDataSplitter
+.. autoclass:: ontolearner.text2onto.batchifier.TaxonomyBatchifier
    :members:
    :undoc-members:
    :show-inheritance:
@@ -143,6 +143,8 @@ Once the data is split into training and testing sets, you can apply learning mo
 - LLM-only: set ``llm_id``
 - RAG: set both ``retriever_id`` + ``llm_id`` for AutoRAGLearner. For prebuild RAG pass ``rag`` learner.
 
+Text2Onto synthetic data generation follows a similar philosophy: the generator now uses a direct ``transformers`` backend and augments prompts with ontology graph context, which makes the generated passages more faithful and easier to validate.
+
 In the example below, we configure a RAG-based learner by specifying the Qwen LLM (`Qwen/Qwen2.5-0.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct>`_) and a retriever based on a sentence-transformer model (`all-MiniLM-L6-v2 <https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2>`_):
 
 .. sidebar:: Other Learners