Skip to content

Fix CI: build mecab-ko as CMake ExternalProject and fix tokenizer bug#6

Open
ji-heo111 wants to merge 6 commits into
groonga:mainfrom
ji-heo111:fix-ci
Open

Fix CI: build mecab-ko as CMake ExternalProject and fix tokenizer bug#6
ji-heo111 wants to merge 6 commits into
groonga:mainfrom
ji-heo111:fix-ci

Conversation

@ji-heo111
Copy link
Copy Markdown
Contributor

@ji-heo111 ji-heo111 commented Apr 9, 2026

Summary

  • Add support for building mecab-ko and mecab-ko-dic from source via ExternalProject using the MECAB_KO_PRIVATE_PREFIX CMake option
  • Preserve the existing system-installed mecab-ko detection as a fallback
  • Link as a static library (libmecab.a) in bundled builds to eliminate external runtime dependencies
  • Automatically pass the bundled mecabrc path via the --rcfile option in bundled builds

Changes

CMakeLists.txt

  • When MECAB_KO_PRIVATE_PREFIX is defined, build mecab-ko and mecab-ko-dic from source using ExternalProject_Add
  • mecab-ko: built with --enable-static --disable-shared --with-pic for static linking
  • mecab-ko-dic: built after mecab-ko, with dicdir path written into mecabrc
  • When MECAB_KO_PRIVATE_PREFIX is not defined, retain the existing mecab-config based detection logic

tokenizers/CMakeLists.txt

  • Add dependency on mecab_ko_dic_external for bundled builds
  • Add compile definitions GRN_WITH_BUNDLED_MECAB_KO and GRN_BUNDLED_MECAB_KO_RC_PATH
  • Link Threads library (required for static mecab linking)

tokenizers/mecab_ko.c

  • Expand argv array size from 3 to 5
  • When GRN_WITH_BUNDLED_MECAB_KO is defined, pass --rcfile with the bundled mecabrc path

@kou
Copy link
Copy Markdown
Contributor

kou commented Apr 9, 2026

Oh, you also try fixing the CI. I'll merge #5 so that you don't need to re-implement it.

@kou
Copy link
Copy Markdown
Contributor

kou commented Apr 9, 2026

I've merged #5. Could you rebase on main?

Comment thread .github/workflows/test.yaml Outdated
Comment on lines +40 to +44
key: test-ubuntu-ccache-${{ hashFiles('tokenizers/**') }}
restore-keys: cmake-ubuntu-ccache-
key: test-ubuntu-ccache-${{ hashFiles('CMakeLists.txt', 'tokenizers/**') }}
restore-keys: test-ubuntu-ccache-
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split the ccache part: #7

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert needless changes?

How about creating a separated job for bundled mecab-ko? We can use strategy for it.
https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-syntax#jobsjob_idstrategy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou
I've reverted the unnecessary test code. I'll set up a separate job for bundled mecab-ko using a matrix strategy to test across multiple OS environments in a follow-up PR.

Comment thread CMakeLists.txt Outdated
Comment on lines +55 to +64
set(MECAB_KO_INCLUDE_DIRS "${MECAB_KO_PRIVATE_PREFIX}/include")
if(APPLE)
set(MECAB_KO_LIBRARY "${MECAB_KO_PRIVATE_PREFIX}/lib/libmecab.dylib")
elseif(WIN32)
set(MECAB_KO_LIBRARY "${MECAB_KO_PRIVATE_PREFIX}/lib/libmecab.dll")
else()
set(MECAB_KO_LIBRARY "${MECAB_KO_PRIVATE_PREFIX}/lib/libmecab.so")
endif()
set(CMAKE_INSTALL_RPATH "${MECAB_KO_PRIVATE_PREFIX}/lib")
set(CMAKE_INSTALL_RPATH_USE_LINK_PATH FALSE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using static library instead of shared library? Then we don't need to install mecab-ko separately. TokenMecab does so and specify --rcfile explicitly: https://github.com/groonga/groonga/blob/cc4ae7e3ecc241428922a40edb104a2be7b20ea8/plugins/tokenizers/mecab.c#L526-L557

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou
Thanks for the suggestion. I've updated the implementation to build mecab-ko as a static library.

@kou
Copy link
Copy Markdown
Contributor

kou commented Apr 10, 2026

My suggestions: ji-heo111#3

Comment thread tokenizers/CMakeLists.txt
Comment on lines +29 to +39
if(DEFINED MECAB_KO_PRIVATE_PREFIX)
add_dependencies(groonga_tokenizer_mecab_ko mecab_ko_dic_external)
target_compile_definitions(
groonga_tokenizer_mecab_ko
PRIVATE
GRN_WITH_BUNDLED_MECAB_KO
GRN_BUNDLED_MECAB_KO_RC_PATH="${MECAB_KO_RC_PATH}"
)
find_package(Threads REQUIRED)
target_link_libraries(groonga_tokenizer_mecab_ko PRIVATE Threads::Threads)
endif()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ji-heo111#3 added libmecab_ko CMake target and it has all needed options. So we can use libmecab_ko for both system mecab-ko and bundled mecab-ko.

Comment thread tokenizers/mecab_ko.c
argv[argc++] = "-F%m\n";
argv[argc++] = "-E\n";
}
#ifdef GRN_WITH_BUNDLED_MECAB_KO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using GRN_TOKENIZER_MECAB_KO prefix instead of GRN prefix?

Comment thread tokenizers/mecab_ko.c
}
#ifdef GRN_WITH_BUNDLED_MECAB_KO
argv[argc++] = "--rcfile";
argv[argc++] = GRN_BUNDLED_MECAB_KO_RC_PATH;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Comment thread CMakeLists.txt
REQUIRED
NO_DEFAULT_PATH
)
if(DEFINED MECAB_KO_PRIVATE_PREFIX)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about falling back to bundled mecab-ko automatically when system mecab-ko isn't found?

We can use -DMECAB_KO_CONFIG=nonexistent to disable system mecab-ko explicitly.

ji-heo111#3 does so.

Comment thread CMakeLists.txt
https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
PREFIX ${CMAKE_BINARY_DIR}/mecab-ko
CONFIGURE_COMMAND
<SOURCE_DIR>/configure --prefix=${MECAB_KO_PRIVATE_PREFIX}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use --prefix=<INSTALL_DIR> instead of ${MECAB_KO_PRIVATE_PREFIX}. If we use static linking, we don't need to install mecab-ko.

ji-heo111#3 does so.

Comment thread CMakeLists.txt
INSTALL_COMMAND make install
BUILD_IN_SOURCE FALSE
BUILD_BYPRODUCTS ${MECAB_KO_LIBRARY}
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to install license files of mecab-ko when we use bundled mecab-ko.

ji-heo111#3 does so.

Comment thread CMakeLists.txt
Comment on lines +81 to +83
CONFIGURE_COMMAND
sh -c
"cd <SOURCE_DIR> && ./autogen.sh && ./configure --prefix=${MECAB_KO_PRIVATE_PREFIX} --with-mecab-config=${MECAB_KO_PRIVATE_PREFIX}/bin/mecab-config --with-dicdir=${MECAB_KO_DIC_DIR}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use PATCH_COMMAND for autogen.sh.

Let's install mecab-ko-dic files to CMAKE_INSTALL_PREFIX/... not MECAB_KO_PRIVATE_PREFIX/.... In general, users control where they install by CMAKE_INSTALL_PREFIX in CMake.

ji-heo111#3 does so.

Comment thread CMakeLists.txt
Comment on lines +85 to +86
${CMAKE_COMMAND} -E env
PATH=${MECAB_KO_PRIVATE_PREFIX}/bin:$ENV{PATH} make -j${NPROC}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to change PATH here because mecab-ko-dic doesn't use mecab in ${MECAB_KO_PRIVATE_PREFIX}/bin and mecab-dict-index exists in ${MECAB_KO_PRIVATE_PREFIX}/libexec/mecab/.

ji-heo111#3 removes this.

Comment thread CMakeLists.txt
Comment on lines +88 to +90
COMMAND
sed -i "s|^dicdir.*|dicdir = ${MECAB_KO_DIC_DIR}|"
${MECAB_KO_RC_PATH}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use configure_file() instead of sed.

ji-heo111#3 does so.

Comment thread CMakeLists.txt
sed -i "s|^dicdir.*|dicdir = ${MECAB_KO_DIC_DIR}|"
${MECAB_KO_RC_PATH}
BUILD_IN_SOURCE TRUE
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to install license files of mecab-ko-dic when we use bundled mecab-ko-dic.

ji-heo111#3 does so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants