Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 185 additions & 0 deletions .docker/config/solr/config-set/accents_et.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# À => A
"\u00C0" => "A"
# Á => A
"\u00C1" => "A"
# Â => A
"\u00C2" => "A"
# Ã => A
"\u00C3" => "A"
# Ä => A
#"\u00C4" => "A"
# Å => A
#"\u00C5" => "A"
# Ą => A
"\u0104" => "A"
# Æ => AE
"\u00C6" => "AE"
# Ç => C
"\u00C7" => "C"
# Ć => C
"\U0106" => "C"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix Unicode escape sequence syntax error on line 20.

Line 20 uses uppercase \U0106 but the correct syntax is lowercase \u0106. The uppercase variant is not valid for 4-digit hex escape sequences in most text processing contexts (including Solr's accent mapping configuration). This will cause a parsing error when Solr loads this configuration file.

Apply this diff to fix the error:

-# Ć => C
-"\U0106" => "C"
+# Ć => C
+"\u0106" => "C"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"\U0106" => "C"
"\u0106" => "C"
🤖 Prompt for AI Agents
In .docker/config/solr/config-set/accents_et.txt around line 20, the Unicode
escape uses an invalid uppercase escape "\U0106"; replace it with the correct
lowercase "\u0106" (and scan other entries for any similar uppercase \U
occurrences), save the file in UTF-8, and re-run Solr config validation to
ensure the mapping parses correctly.

# È => E
"\u00C8" => "E"
# É => E
"\u00C9" => "E"
# Ê => E
"\u00CA" => "E"
# Ë => E
"\u00CB" => "E"
# Ę => E
"\u0118" => "E"
# Ì => I
"\u00CC" => "I"
# Í => I
"\u00CD" => "I"
# Î => I
"\u00CE" => "I"
# Ï => I
"\u00CF" => "I"
# IJ => IJ
"\u0132" => "IJ"
# Ð => D
"\u00D0" => "D"
# Ł => L
"\u0141" => "L"
# Ñ => N
"\u00D1" => "N"
# Ń => N
"\u0143" => "N"
# Ò => O
"\u00D2" => "O"
# Ó => O
"\u00D3" => "O"
# Ô => O
"\u00D4" => "O"
# Õ => O
"\u00D5" => "O"
# Ö => O
#"\u00D6" => "O"
# Ø => O
"\u00D8" => "O"
# Œ => OE
"\u0152" => "OE"
# Þ
"\u00DE" => "TH"
# Ù => U
"\u00D9" => "U"
# Ú => U
"\u00DA" => "U"
# Û => U
"\u00DB" => "U"
# Ü => U
"\u00DC" => "U"
# Ý => Y
"\u00DD" => "Y"
# Ÿ => Y
"\u0178" => "Y"
# à => a
"\u00E0" => "a"
# á => a
"\u00E1" => "a"
# â => a
"\u00E2" => "a"
# ã => a
"\u00E3" => "a"
# ä => a
#"\u00E4" => "a"
# å => a
#"\u00E5" => "a"
# æ => ae
"\u00E6" => "ae"
# ç => c
"\u00E7" => "c"
# è => e
"\u00E8" => "e"
# é => e
"\u00E9" => "e"
# ê => e
"\u00EA" => "e"
# ë => e
"\u00EB" => "e"
# ì => i
"\u00EC" => "i"
# í => i
"\u00ED" => "i"
# î => i
"\u00EE" => "i"
# ï => i
"\u00EF" => "i"
# ij => ij
"\u0133" => "ij"
# ð => d
"\u00F0" => "d"
# ñ => n
"\u00F1" => "n"
# ò => o
"\u00F2" => "o"
# ó => o
"\u00F3" => "o"
# ô => o
"\u00F4" => "o"
# õ => o
"\u00F5" => "o"
# ö => o
#"\u00F6" => "o"
# ø => o
"\u00F8" => "o"
# œ => oe
"\u0153" => "oe"
# ß => ss
"\u00DF" => "ss"
# Ś => S
"\u015a" => "S"
# þ => th
"\u00FE" => "th"
# ù => u
"\u00F9" => "u"
# ú => u
"\u00FA" => "u"
# û => u
"\u00FB" => "u"
# ü => u
"\u00FC" => "u"
# ý => y
"\u00FD" => "y"
# ÿ => y
"\u00FF" => "y"
# Ź => Z
"\u0179" => "Z"
# Ż => Z
"\u017b" => "Z"
# ff => ff
"\uFB00" => "ff"
# fi => fi
"\uFB01" => "fi"
# fl => fl
"\uFB02" => "fl"
# ffi => ffi
"\uFB03" => "ffi"
# ffl => ffl
"\uFB04" => "ffl"
# ſt => st
"\uFB05" => "st"
# st => st
"\uFB06" => "st"
# Māori macrons.
# Ā => A
"\u0100" => "A"
# Ē => E
"\u0112" => "E"
# Ī => I
"\u012A" => "I"
# Ō => O
"\u014C" => "O"
# Ū => U
"\u016A" => "U"
# ā => a
"\u0101" => "a"
# ē => e
"\u0113" => "e"
# ī => i
"\u012B" => "i"
# ō => o
"\u014D" => "o"
# ū => u
"\u016B" => "u"
1 change: 1 addition & 0 deletions .docker/config/solr/config-set/nouns_et.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Empty file—verify intent and populate if needed.

The file is created as part of Estonian linguistic resources for Solr 9 language support but contains no content. Clarify whether this is:

  1. A placeholder that should be populated with Estonian nouns for text analysis, or
  2. An optional fallback file that should include a clarifying comment explaining its empty state.

Given that related linguistic files (stopwords_et.txt, synonyms_et.txt) are being added alongside schema field types for Estonian processing, this file likely needs content to be functional.


🏁 Script executed:

#!/bin/bash
# Check content and size of nouns_et.txt
echo "=== Checking nouns_et.txt ===" 
if [ -f ".docker/config/solr/config-set/nouns_et.txt" ]; then
  echo "File exists"
  echo "File size: $(wc -c < .docker/config/solr/config-set/nouns_et.txt) bytes"
  echo "Line count: $(wc -l < .docker/config/solr/config-set/nouns_et.txt) lines"
  echo "Content (hex dump):"
  od -c .docker/config/solr/config-set/nouns_et.txt | head -5
else
  echo "File does not exist"
fi

# Check related Estonian language files
echo -e "\n=== Checking related Estonian files ==="
for file in stopwords_et.txt synonyms_et.txt; do
  path=".docker/config/solr/config-set/$file"
  if [ -f "$path" ]; then
    echo "File: $file - Size: $(wc -c < $path) bytes, Lines: $(wc -l < $path) lines"
    echo "First 3 lines:"
    head -3 "$path" || true
  else
    echo "$file: Not found"
  fi
done

# Search for references to nouns_et in configuration
echo -e "\n=== Searching for references to nouns_et ==="
rg -i "nouns_et" .docker/config/solr/config-set/ -A2 -B2 || echo "No references found"

# Check schema for Estonian language configuration
echo -e "\n=== Checking schema for Estonian references ==="
if [ -f ".docker/config/solr/config-set/managed-schema.xml" ]; then
  rg -i "et|estonia" .docker/config/solr/config-set/managed-schema.xml -C2 || echo "No Estonian references found"
else
  find .docker/config/solr/config-set/ -name "*.xml" | head -5 | xargs grep -i "estonia\|nouns" || true
fi

# List all files in the config-set directory to understand structure
echo -e "\n=== Files in config-set directory ==="
find .docker/config/solr/config-set/ -type f | sort

Length of output: 13810


Populate nouns_et.txt with Estonian nouns—file is actively used in Solr text analysis.

The file is currently empty (1 byte) but is actively referenced in schema_extra_types.xml within the DictionaryCompoundWordTokenFilterFactory filter, which requires a populated dictionary for compound word tokenization. Without content, Estonian language text analysis will fail or produce incorrect results.

Related Estonian files (stopwords_et.txt, synonyms_et.txt) are properly populated. Populate nouns_et.txt with a comprehensive list of Estonian nouns to enable proper compound word analysis.

1 change: 1 addition & 0 deletions .docker/config/solr/config-set/protwords_et.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

79 changes: 42 additions & 37 deletions .docker/config/solr/config-set/schema.xml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
It should be kept correct and concise, usable out-of-the-box.

For more information, on how to customize this file, please see
http://wiki.apache.org/solr/SchemaXml
https://solr.apache.org/guide/solr/latest/indexing-guide/schema-elements.html

PERFORMANCE NOTE: this schema includes many optional features and should not
be used for benchmarking. To improve performance one could
Expand All @@ -49,7 +49,7 @@
that avoids logging every request
-->

<schema name="drupal-4.3.5-solr-8.x-1" version="1.6">
<schema name="drupal-4.3.10-solr-9.x-1" version="1.6">
<!-- attribute "name" is the name of this schema and is only used for display purposes.
version="x.y" is Solr's version number for the schema syntax and
semantics. It should not normally be changed by applications.
Expand Down Expand Up @@ -122,7 +122,9 @@
<!-- points to the root document of a block of nested documents. Required for nested
document support, may be removed otherwise
-->
<field name="_root_" type="string" indexed="true" stored="false" docValues="false"/>
<field name="_root_" type="string" indexed="true" stored="true" docValues="false" />
<fieldType name="_nest_path_" class="solr.NestPathField" />
<field name="_nest_path_" type="_nest_path_" />

<!-- Only remove the "id" field if you have a very good reason to. While not strictly
required, it is highly recommended. A <uniqueKey> is present in almost all Solr
Expand Down Expand Up @@ -156,7 +158,7 @@

<!-- Currently the suggester context filter query (suggest.cfq) accesses the tags using the stored values, neither the indexed terms nor the docValues.
Therefore the dynamicField sm_* isn't suitable at the moment -->
<field name="sm_context_tags" type="string" indexed="true" stored="true" multiValued="true" docValues="false"/>
<field name="sm_context_tags" type="strings" indexed="true" stored="true" docValues="false"/>

<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
Expand All @@ -170,59 +172,59 @@
the last letter is 's' for single valued, 'm' for multi-valued -->

<!-- We use plong for integer since 64 bit ints are now common in PHP. -->
<dynamicField name="is_*" type="plong" indexed="true" stored="false" multiValued="false" docValues="true" termVectors="true"/>
<dynamicField name="im_*" type="plong" indexed="true" stored="false" multiValued="true" docValues="true" termVectors="true"/>
<dynamicField name="is_*" type="plong" indexed="true" stored="false" docValues="true" termVectors="true"/>
<dynamicField name="im_*" type="plongs" indexed="true" stored="false" docValues="true" termVectors="true"/>
<!-- List of floats can be saved in a regular float field -->
<dynamicField name="fs_*" type="pfloat" indexed="true" stored="false" multiValued="false" docValues="true"/>
<dynamicField name="fm_*" type="pfloat" indexed="true" stored="false" multiValued="true" docValues="true"/>
<dynamicField name="fs_*" type="pfloat" indexed="true" stored="false" docValues="true"/>
<dynamicField name="fm_*" type="pfloats" indexed="true" stored="false" docValues="true"/>
<!-- List of doubles can be saved in a regular double field -->
<dynamicField name="ps_*" type="pdouble" indexed="true" stored="false" multiValued="false" docValues="true"/>
<dynamicField name="pm_*" type="pdouble" indexed="true" stored="false" multiValued="true" docValues="true"/>
<dynamicField name="ps_*" type="pdouble" indexed="true" stored="false" docValues="true"/>
<dynamicField name="pm_*" type="pdoubles" indexed="true" stored="false" docValues="true"/>
<!-- List of booleans can be saved in a regular boolean field -->
<dynamicField name="bm_*" type="boolean" indexed="true" stored="false" multiValued="true" docValues="true" termVectors="true"/>
<dynamicField name="bs_*" type="boolean" indexed="true" stored="false" multiValued="false" docValues="true" termVectors="true"/>
<dynamicField name="bm_*" type="booleans" indexed="true" stored="false" docValues="true" termVectors="true"/>
<dynamicField name="bs_*" type="boolean" indexed="true" stored="false" docValues="true" termVectors="true"/>
<!-- Regular text (without processing) can be stored in a string field-->
<dynamicField name="ss_*" type="string" indexed="true" stored="false" multiValued="false" docValues="true" termVectors="true"/>
<dynamicField name="ss_*" type="string" indexed="true" stored="false" docValues="true" termVectors="true"/>
<!-- For field types using SORTED_SET, multiple identical entries are collapsed into a single value.
Thus if I insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5 when enabling docValues.
If you need to preserve the order and duplicate entries, consider to store the values as zm_* (twice). -->
<dynamicField name="sm_*" type="string" indexed="true" stored="false" multiValued="true" docValues="true" termVectors="true"/>
<dynamicField name="sm_*" type="strings" indexed="true" stored="false" docValues="true" termVectors="true"/>
<!-- Special-purpose text fields -->
<dynamicField name="tws_*" type="text_ws" indexed="true" stored="true" multiValued="false"/>
<dynamicField name="twm_*" type="text_ws" indexed="true" stored="true" multiValued="true"/>

<dynamicField name="ds_*" type="pdate" indexed="true" stored="false" multiValued="false" docValues="true"/>
<dynamicField name="dm_*" type="pdate" indexed="true" stored="false" multiValued="true" docValues="true"/>
<dynamicField name="ds_*" type="pdate" indexed="true" stored="false" docValues="true"/>
<dynamicField name="dm_*" type="pdates" indexed="true" stored="false" docValues="true"/>
<!-- This field is used to store date ranges -->
<dynamicField name="drs_*" type="date_range" indexed="true" stored="true" multiValued="false"/>
<dynamicField name="drm_*" type="date_range" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="drs_*" type="date_range" indexed="true" stored="true"/>
<dynamicField name="drm_*" type="date_ranges" indexed="true" stored="true"/>
<!-- Trie fields are deprecated. Point fields solve all needs. But we keep the dedicated field names for backward compatibility. -->
<dynamicField name="its_*" type="plong" indexed="true" stored="false" multiValued="false" docValues="true" termVectors="true"/>
<dynamicField name="itm_*" type="plong" indexed="true" stored="false" multiValued="true" docValues="true" termVectors="true"/>
<dynamicField name="fts_*" type="pfloat" indexed="true" stored="false" multiValued="false" docValues="true"/>
<dynamicField name="ftm_*" type="pfloat" indexed="true" stored="false" multiValued="true" docValues="true"/>
<dynamicField name="pts_*" type="pdouble" indexed="true" stored="false" multiValued="false" docValues="true"/>
<dynamicField name="ptm_*" type="pdouble" indexed="true" stored="false" multiValued="true" docValues="true"/>
<dynamicField name="its_*" type="plong" indexed="true" stored="false" docValues="true" termVectors="true"/>
<dynamicField name="itm_*" type="plongs" indexed="true" stored="false" docValues="true" termVectors="true"/>
<dynamicField name="fts_*" type="pfloat" indexed="true" stored="false" docValues="true"/>
<dynamicField name="ftm_*" type="pfloats" indexed="true" stored="false" docValues="true"/>
<dynamicField name="pts_*" type="pdouble" indexed="true" stored="false" docValues="true"/>
<dynamicField name="ptm_*" type="pdoubles" indexed="true" stored="false" docValues="true"/>
<!-- Binary fields can be populated using base64 encoded data. Useful e.g. for embedding
a small image in a search result using the data URI scheme -->
<dynamicField name="xs_*" type="binary" indexed="false" stored="true" multiValued="false"/>
<dynamicField name="xm_*" type="binary" indexed="false" stored="true" multiValued="true"/>
<dynamicField name="xs_*" type="binary" indexed="false" stored="true" multiValued="false"/>
<dynamicField name="xm_*" type="binary" indexed="false" stored="true" multiValued="true"/>
<!-- Trie fields are deprecated. Point fields solve all needs. But we keep the dedicated field names for backward compatibility. -->
<dynamicField name="dds_*" type="pdate" indexed="true" stored="false" multiValued="false" docValues="true"/>
<dynamicField name="ddm_*" type="pdate" indexed="true" stored="false" multiValued="true" docValues="true"/>
<dynamicField name="dds_*" type="pdate" indexed="true" stored="false" docValues="true"/>
<dynamicField name="ddm_*" type="pdates" indexed="true" stored="false" docValues="true"/>
<!-- In case a 32 bit int is really needed, we provide these fields. 'h' is mnemonic for 'half word', i.e. 32 bit on 64 arch -->
<dynamicField name="hs_*" type="pint" indexed="true" stored="false" multiValued="false" docValues="true"/>
<dynamicField name="hm_*" type="pint" indexed="true" stored="false" multiValued="true" docValues="true"/>
<dynamicField name="hs_*" type="pint" indexed="true" stored="false" docValues="true"/>
<dynamicField name="hm_*" type="pints" indexed="true" stored="false" docValues="true"/>
<!-- Trie fields are deprecated. Point fields solve all needs. But we keep the dedicated field names for backward compatibility. -->
<dynamicField name="hts_*" type="pint" indexed="true" stored="false" multiValued="false" docValues="true"/>
<dynamicField name="htm_*" type="pint" indexed="true" stored="false" multiValued="true" docValues="true"/>
<dynamicField name="hts_*" type="pint" indexed="true" stored="false" docValues="true"/>
<dynamicField name="htm_*" type="pints" indexed="true" stored="false" docValues="true"/>

<!-- Unindexed string fields that can be used to store values that won't be searchable but have docValues -->
<dynamicField name="zdvs_*" type="string" indexed="false" stored="true" multiValued="false" docValues="true"/>
<dynamicField name="zdvm_*" type="string" indexed="false" stored="true" multiValued="true" docValues="true"/>
<dynamicField name="zdvs_*" type="string" indexed="false" stored="true" docValues="true"/>
<dynamicField name="zdvm_*" type="strings" indexed="false" stored="true" docValues="true"/>
<!-- Unindexed string fields that can be used to store values that won't be searchable -->
<dynamicField name="zs_*" type="string" indexed="false" stored="true" multiValued="false"/>
<dynamicField name="zm_*" type="string" indexed="false" stored="true" multiValued="true"/>
<dynamicField name="zs_*" type="string" indexed="false" stored="true"/>
<dynamicField name="zm_*" type="strings" indexed="false" stored="true"/>

<!-- Fields for location searches.
http://wiki.apache.org/solr/SpatialSearch#geodist_-_The_distance_function -->
Expand Down Expand Up @@ -267,9 +269,11 @@
single-valued and either required or have a default value.
-->
<fieldType name="string" class="solr.StrField"/>
<fieldType name="strings" class="solr.StrField" multiValued="true"/>

<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField"/>
<fieldType name="booleans" class="solr.BoolField" multiValued="true"/>

<!-- sortMissingLast and sortMissingFirst attributes are optional attributes are
currently supported on types that are sorted internally as strings
Expand Down Expand Up @@ -334,6 +338,7 @@

<!-- A date range field -->
<fieldType name="date_range" class="solr.DateRangeField"/>
<fieldType name="date_ranges" class="solr.DateRangeField" multiValued="true"/>

<!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings -->
<fieldType name="binary" class="solr.BinaryField"/>
Expand Down Expand Up @@ -372,7 +377,7 @@
-->

<!-- A text field that only splits on whitespace for exact matching of words -->
<fieldType name="text_ws" class="solr.TextField" omitNorms="true" positionIncrementGap="100">
<fieldType name="text_ws" class="solr.TextField" omitNorms="true" positionIncrementGap="100" storeOffsetsWithPositions="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
Expand Down
7 changes: 7 additions & 0 deletions .docker/config/solr/config-set/schema_extra_fields.xml
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,13 @@
<dynamicField name="tus_X3b_en_*" type="text_unstemmed_en" stored="true" indexed="true" multiValued="false" termVectors="true" omitNorms="false" />
<dynamicField name="tum_X3b_en_*" type="text_unstemmed_en" stored="true" indexed="true" multiValued="true" termVectors="true" omitNorms="false" />
<dynamicField name="sort_X3b_en_*" type="collated_en" stored="false" indexed="false" docValues="true" useDocValuesAsStored="false" />
<dynamicField name="ts_X3b_et_*" type="text_et" stored="true" indexed="true" multiValued="false" termVectors="true" omitNorms="false" />
<dynamicField name="tm_X3b_et_*" type="text_et" stored="true" indexed="true" multiValued="true" termVectors="true" omitNorms="false" />
<dynamicField name="tos_X3b_et_*" type="text_et" stored="true" indexed="true" multiValued="false" termVectors="true" omitNorms="true" />
<dynamicField name="tom_X3b_et_*" type="text_et" stored="true" indexed="true" multiValued="true" termVectors="true" omitNorms="true" />
<dynamicField name="tus_X3b_et_*" type="text_unstemmed_et" stored="true" indexed="true" multiValued="false" termVectors="true" omitNorms="false" />
<dynamicField name="tum_X3b_et_*" type="text_unstemmed_et" stored="true" indexed="true" multiValued="true" termVectors="true" omitNorms="false" />
<dynamicField name="sort_X3b_et_*" type="collated_et" stored="false" indexed="false" docValues="true" useDocValuesAsStored="false" />
<dynamicField name="ts_X3b_fi_*" type="text_fi" stored="true" indexed="true" multiValued="false" termVectors="true" omitNorms="false" />
<dynamicField name="tm_X3b_fi_*" type="text_fi" stored="true" indexed="true" multiValued="true" termVectors="true" omitNorms="false" />
<dynamicField name="tos_X3b_fi_*" type="text_fi" stored="true" indexed="true" multiValued="false" termVectors="true" omitNorms="true" />
Expand Down
Loading