BIOfid_Updates/Updates.json at main · texttechnologylab/BIOfid_Updates · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
[
  {
    "title": "Add DUUI-based temporal expression detection components (duui-Time)",
    "content": "Implement Dockerized DUUI components for selected temporal expression detection backends, including Microsoft Recognizers-Text, Duckling, Stanford SUTime, German GELECTRA, BERT Got-a-Date, TEI2GO, Timexy, and generic Hugging Face token-classification models. Each Docker image now builds a single model- and language-specific TimeX3 service with DUUI endpoints for type system, Lua communication, documentation, and processing. Add model metadata, runtime parameters, TimeX3/ISO-TimeML annotation mapping, per-model Docker build support, external Duckling/CoreNLP service configuration, Java test coverage, and usage documentation for DUUI integration.",
    "date": "2026-06-10",
    "author": "Mevlüt Bagci",
    "email": "[bagci@em.uni-frankfurt.de](mailto:bagci@em.uni-frankfurt.de)"
  },
  {
    "title": "Added DUUI-Based Transformer NER Components",
    "content": "This update adds Dockerized DUUI components for selected multilingual transformer-based Named Entity Recognition models, including GLiNER, GLiNER2, RoBERTa, WikiNEuRal, and XLM-R. Each Docker image now provides a dedicated model-specific NER service with DUUI endpoints for the type system, Lua communication layer, documentation, and text processing. The implementation also includes model metadata, configurable runtime parameters, DKPro and TTLab annotation mapping, and usage documentation for integration into DUUI pipelines.",
    "date": "2026-06-08",
    "author": "Mevlüt Bagci",
    "email": "[bagci@em.uni-frankfurt.de](mailto:bagci@em.uni-frankfurt.de)"
  },
  {
    "title": "Added a DUUI-Based Component for creating embeddings using self hosted models in Ollama",
    "content": "This update adds a Dockerized DUUI component for creating sentence embeddings using models hosted in Ollama. This will be later used to create embeddings of documents before adding them to UCE.",
    "date": "2026-06-03",
    "author": "Daniel Bundan",
    "email": "[bundan@em.uni-frankfurt.de](mailto:bundan@em.uni-frankfurt.de)"
  },
  {
    "title": "Improved Update Overview and Detail View",
    "content": "The update display on this website has been redesigned to make the overview clearer and easier to navigate. Each update now includes a dedicated title that is shown directly in the list and calendar views. The full update description is no longer displayed immediately in the overview, but is available through the \"Open details\" button. This keeps the update list compact while still allowing users to access the complete content whenever needed.",
    "date": "2026-06-01",
    "author": "Mevlüt Bagci",
    "email": "bagci@em.uni-frankfurt.de"
  },
  {
    "title": "Pro-Mode Search Parser",
    "content": "A full recursive-descent parser was introduced that compiles boolean search expressions directly into PostgreSQL tsquery. The parser implements a formal grammar (AND, OR, NOT, FOLLOWED_BY with the adjacency proximity operator and configurable distance operator, plus parenthesized grouping as well as domain specific annotation traversal) via a multi-pass AST pipeline: lexing, parsing, semantic expansion (CommandExpansionPass resolving taxonomic and geographic commands against the database and SPARQL endpoint, TaxonEnrichmentPass enriching plain taxon terms with alternative names from the knowledge graph), AST normalization (deduplication and canonical ordering), and finally tsquery compilation (ProTsQueryCompiler generating PostgreSQL tsquery strings with proper AND conjunction, OR disjunction, NOT negation, adjacency proximity, and distance N syntax). The backend passes proModeActivated=true through the search flow (SearchApi to EnrichedSearchQuery to Search_DefaultImpl to PostgresqlDataInterface) so that the database layer applies to_tsquery with the simple dictionary configuration to preserve operator semantics. Frontend delivers a chip-based visual query builder (proModeSearchBar FreeMarker template, JavaScript, and CSS) with inline editing, command indicators, grouping overlays, and real-time syntax error display fed from ProModeSyntaxException diagnostics.",
    "date": "2026-03-26",
    "author": "Dawit Terefe",
    "email": "s0424382@stud.uni-frankfurt.de"
  },
  {
    "title": "DUUI Importer Pipeline Architecture",
    "content": "The corpus import system was migrated from a legacy CompletableFuture-based batch processor to a staged DUUI pipeline framework. The new pipeline implements a typed document processing graph: a RuntimeSeedGenerator feeds into CorpusToDocuments fork, followed by sequential pipeline stages — document-read (wait-for-batch), document-jcas, parallel document-independent-extraction (UCE metadata, sentences, NER, sentiments, emotions, lemmata, SRL, times, taxonomy, wikilinks, negation, topics, images, permissions all executing concurrently), document-dependent-extraction (geonames, pages, logical-links), document-domain-capture (domain graph recording via AgeGraphService), document-persist (persist-document, persist-domain-association-graph), and finally corpus-finalization (refresh logical links, lexicon, geonames, postprocess, persist operation graph). A DUUI-as-a-service variant (DUUICorpusImporter) wraps this pipeline behind HTTP endpoints implementing the DUUI protocol (at paths for communication layer, type system, process, and input/output details) for integration with external DUUI orchestration. Concurrency robustness was added to all services (PostgresqlDataInterface, AgeGraphService, S3StorageService, JenaSparqlService, LexiconService, EmbeddingService) to handle parallel pipeline stages safely.",
    "date": "2026-05-14",
    "author": "Dawit Terefe",
    "email": "s0424382@stud.uni-frankfurt.de"
  },
  {
    "title": "Importer Connection Pooling and Service Resilience",
    "content": "The import pipeline's database access layer was hardened with HikariCP connection pooling configured through Hibernate (HibernateConf wiring through CommonConfig properties). Pool sizing (maximum 10 connections, minimum 2 idle) with configurable connection timeout (30s), idle timeout (10min), max lifetime (30min), and leak detection threshold (60s) prevents connection exhaustion during parallel pipeline stages. The architecture enforces a strict separation of concerns: pipeline stages define import semantics only and do not own service pool management — each service (PostgresqlDataInterface, AgeGraphService, S3StorageService, JenaSparqlService, LexiconService, EmbeddingService) is responsible for its own capacity, blocking behavior, and retry logic. Safe tsquery SQL functions were designed (15_safe_tsquery_functions.sql) with array-to-tsquery escaping, term length limits, and automatic fallback to plainto_tsquery on overflow. An enhanced two-phase search function (16_enhanced_search_function.sql) chunks large expanded term sets into batches of 50 terms, executes a first-phase query, and only fans out to additional chunked queries if insufficient results are returned, avoiding giant tsquery strings that would degrade PostgreSQL GIN index performance.",
    "date": "2026-05-14",
    "author": "Dawit Terefe",
    "email": "s0424382@stud.uni-frankfurt.de"
  },
  {
    "title": "Importer Pipelined Processing with Configurable Parallelism",
    "content": "The importer was restructured to support configurable thread-level parallelism (via the numThreads command-line flag) that controls DUUI document task concurrency independently of service pool capacity. Input handling was extended to support .xmi, .bz2, .zip, and .gz corpus formats with automatic decompression. The import flow produces S3-stored raw XMI backups when enableS3Storage is set, and executes post-processing continuations (embedding generation, topic modeling, batch management) through DocumentImportContinuation. An importer harness test framework was added with podman-compose orchestration for local profiling, correctness matrix testing across domain subsets, service health checking, and Jupyter-based import profiling notebooks for performance analysis.",
    "date": "2026-04-30",
    "author": "Dawit Terefe",
    "email": "s0424382@stud.uni-frankfurt.de"
  },
  {
    "title": "GNFinder Taxon Abbreviation Diagnostic",
    "content": "A diagnostic script scans imported UCE corpus XMI files for abbreviated gnfinder taxon annotations (e.g. 'C. muricata' where the genus is only an initial) that likely need repair. For each abbreviated gnfinder Taxon element found, the script identifies nearby fully-verified gnfinder VerifiedTaxon annotations by character offset proximitycan limits and GBIF request timeouts.",
    "date": "2026-05-29",
    "author": "Dawit Terefe",
    "email": "s0424382@stud.uni-frankfurt.de"
  },
  {
    "title": "FUSEKI GBIF Taxon Enrichment for RDF Patching",
    "content": "A generic taxon name enrichment tool queries the full GBIF taxonomic API surface to produce comprehensive JSON reports for comparison against Fuseki/TDB RDF store contents. For each input scientific name, the tool resolves exact homonym usages from GBIF species search (casefold-canonical matching), infers the full genus from the nearest verified match starting with the same initial letter, and queries the GBIF species API (species match and species search endpoints) to obtain canonical expansion candidates with confidence scores and taxonomic status. The output is a structured report per query with sections for GBIF match diagnostics, exact canonical usages, accepted usage summaries, source-to-accepted mappings, and a deduplicated expansionNames list used for patching stale SparQL RDF artifacts.",
    "date": "2026-05-29",
    "author": "Dawit Terefe",
    "email": "s0424382@stud.uni-frankfurt.de"
  },
  {
    "title": "Podman Compose Deployment and Profile-Based Service Orchestration",
    "content": "A Podman-compatible compose file (podman-compose.yaml) was added enabling rootless container deployment. A new 'web' Compose profile was introduced allowing the UCE web application to be started standalone without authentication dependencies. The recently patched RDF artifacts from the Fuseki backend, SparQL, enabled valid taxonomic expansions once aga",
    "date": "2026-05-29",
    "author": "Dawit Terefe",
    "email": "s0424382@stud.uni-frankfurt.de"
  },
  {
    "title": "Article metadata information reader",
    "content": "Article, Page, Journal, Volume, Issue, Collection.",
    "date": "2026-05-11",
    "author": "Mevlüt Bagci",
    "email": "bagci@em.uni-frankfurt.de"
  },
  {
    "title": "Corpus meta information display",
    "content": "UCE displays meta information for individual corpora, such as the number of documents and pages, and which tools were used to process them.",
    "date": "2026-05-11",
    "author": "Mevlüt Bagci",
    "email": "bagci@em.uni-frankfurt.de"
  },
  {
    "title": "UB Journal DUUI BioFID pipeline processing",
    "content": "Processing of the complete UB Journal with the DUUI BioFID pipeline has been started.",
    "date": "2026-05-11",
    "author": "Mevlüt Bagci",
    "email": "bagci@em.uni-frankfurt.de"
  },
  {
    "title": "DUUI-Adapted Integration of Coreferee for Multilingual Coreference Resolution",
    "content": "The coreference resolution tool Coreferee has been integrated into DUUI and adapted to work with spaCy tokens generated within the DUUI pipeline. Instead of relying on Coreferee's own internal spaCy tokenization, the component uses the existing DUUI-generated spaCy token annotations as input for coreference resolution. This ensures better compatibility with DUUI processing workflows and allows coreference information to be added consistently to already processed documents. The integration supports English, German, French, and Polish. Two DUUI variants are available: the small variant (\"sm\") is intended for faster processing, while the large variant (\"lg\") provides a more accurate pipeline for use cases where higher-quality coreference resolution is required.",
    "date": "2026-05-29",
    "author": "Mevlüt Bagci",
    "email": "bagci@em.uni-frankfurt.de"
  }
]