diff --git a/.gitignore b/.gitignore index 1d74e21..dbd5f30 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ .vscode/ +python/__pycache__/ diff --git a/README.md b/README.md index c9d8bc2..fe9b5eb 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,11 @@ -# Khiops Native Interface v11.0.0 +# Khiops Native Interface v11.0.1-a.5 This project provides all the basics to use the Khiops Native Interface (KNI): installation and examples. -The purpose of KNI is to allow a deeper integration of Khiops in information systems, by mean of the C programming language, using a shared library (`.dll` in Windows, `.so` in Linux). This relates specially to the problem of model deployment, which otherwise requires the use of input and output data files when using directly the Khiops tool in batch mode. See Khiops Guide for an introduction to dictionary files, dictionaries, database files and deployment. +The purpose of KNI is to allow a deeper integration of Khiops in information systems, by means of the C programming language, using a shared library (`.dll` in Windows, `.so` in Linux). This relates especially to the problem of model deployment, which otherwise requires the use of input and output data files when using directly the Khiops tool in batch mode. See Khiops Guide for an introduction to dictionary files, dictionaries, database files and deployment. -The Khiops deployment API is thus made public through a shared library. Therefore, a Khiops model can be deployed directly from any programming language, such as C, C++, Java, Python, Matlab, etc. This enables real time model deployment without the overhead of temporary data files or launching executables. This is critical for certain applications, such as marketing or targeted advertising on the web.. +The Khiops deployment API is thus made public through a shared library. Therefore, a Khiops model can be deployed directly from any programming language, such as C, C++, Java, Python, Matlab, etc. This enables real-time model deployment without the overhead of temporary data files or launching executables. This is critical for certain applications, such as marketing or targeted advertising on the web. All KNI functions are C functions for easy use with other programming languages. They return a positive or zero value in case of success, and a negative error code in case of failure. @@ -14,15 +14,32 @@ See [KhiopsNativeInterface.h](include/KhiopsNativeInterface.h) for a detailed de > [!CAUTION] > The functions are not reentrant (thread-safe): the library can be used simultaneously by several executables, but not simultaneously by several threads in the same executable. +## Table of Contents + +- [KNI installation](#kni-installation) + - [Windows](#windows) + - [Linux](#linux) +- [Application examples](#application-examples) +- [Example with C](#example-with-c) + - [Building the examples](#building-the-examples) + - [Launch](#launch) +- [Example with Java](#example-with-java) + - [Building the examples](#building-the-examples-1) + - [Launch](#launch-1) +- [Example with Python](#example-with-python) + - [Requirements](#requirements) + - [Scripts](#scripts) + - [Launch](#launch-2) + # KNI installation ## Windows -Download [KNI-11.0.0.zip](https://github.com/KhiopsML/khiops/releases/tag/11.0.0/KNI-11.0.0.zip) and extract it to your machine. Set the environment variable `KNI_HOME` to the extracted directory. This variable is used in the following examples. +Download [KNI-11.0.1-a.5.zip](https://github.com/KhiopsML/khiops/releases/tag/11.0.1/KNI-11.0.1-a.5.zip) and extract it to your machine. Set the environment variable `KNI_HOME` to the extracted directory. This variable is used in the following examples. ## Linux -On Linux, go to the [release page](https://github.com/KhiopsML/khiops/releases/tag/11.0.0/) and download the KNI package. The name of the package begins with **kni** and ends with the **code name** of the OS. The code name is in the release file of the distribution (here, it is "jammy"): +On Linux, go to the [release page](https://github.com/KhiopsML/khiops/releases/tag/11.0.1/) and download the KNI package. The name of the package begins with **kni** and ends with the **code name** of the OS. The code name is in the release file of the distribution (here, it is "jammy"): ```bash $ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.4 LTS" @@ -46,7 +63,7 @@ Download the package according to the code name of your OS and install it with ` Application examples are available in this repository. The main branch corresponds to the latest version of KNI. To explore older versions, switch between branches, which are named after their respective versions. -Both examples in C and Java produce a sample binary `KNIRecodeFile`. It recodes an input file to an output file, using a Khiops dictionary from a dictionary file. +Examples in C, Java, and Python demonstrate how to use KNI. The main example, `KNIRecodeFile`, recodes an input file to an output file using a Khiops dictionary from a dictionary file. ```bash KNIRecodeFile [Error file] @@ -55,14 +72,14 @@ KNIRecodeFile [Error f # The error file may be useful for debugging purposes. It is optional and may be empty. ``` -A more complex example (available only in C) is `KNIRecodeMTFiles`, it recodes the input files of multi-table dataset to a single output file. +A more complex example (available in C and Python) is `KNIRecodeMTFiles`, which recodes the input files of a multi-table dataset to a single output file. ```bash KNIRecodeMTFiles -d: [-f: -i: [...] - -s: < file name> ... + -s: ... -x: -o: [-e: ] @@ -71,7 +88,7 @@ KNIRecodeMTFiles # Example with C -The files are located in [cpp directory](cpp/). They allow to build `KNIRecodeFile` and `KNIRecodeMTFiles`. +The files are located in [cpp directory](cpp/). They allow you to build `KNIRecodeFile` and `KNIRecodeMTFiles`. ## Building the examples @@ -103,12 +120,12 @@ Recode the "Splice Junction" multi-table dataset using the `SNB_SpliceJunction` ```bash KNIRecodeMTFiles -d data/ModelingSpliceJunction.kdic SNB_SpliceJunction \ - -i .data/SpliceJunction.txt 1 -s DNA data/SpliceJunctionDNA.txt 1 -o R_SpliceJunction.txt + -i data/SpliceJunction.txt 1 -s DNA data/SpliceJunctionDNA.txt 1 -o R_SpliceJunction.txt ``` # Example with Java -The files are located in [java directory](java/). They allow to build `KNIRecodeFile.jar`. This example use [JNA](https://github.com/twall/jna#readme) to make calls to KhiopsNativeInterface.so/dll from Java. +The files are located in [java directory](java/). They allow you to build `KNIRecodeFile.jar`. This example uses [JNA](https://github.com/twall/jna#readme) to make calls to KhiopsNativeInterface.so/dll from Java. ## Building the examples @@ -122,7 +139,7 @@ jar cf kni.jar -C java KNI.class -C java KNIRecodeFile.class ## Launch -Recodes the "Iris" dataset from the data directory using the `SNB_Iris` classifier dictionary. +Recode the "Iris" dataset from the data directory using the `SNB_Iris` classifier dictionary. On Linux: @@ -138,3 +155,54 @@ set path=%KNI_HOME%/bin;%path% java -cp kni.jar;jna.jar KNIRecodeFile data/ModelingIris.kdic SNB_Iris ^ data/Iris.txt R_Iris_java.txt ``` + +# Example with Python + +The files are located in [python directory](python/). They use Python's `ctypes` to call the KhiopsNativeInterface shared library directly. + +## Requirements + +- Python 3.6 or later +- The KNI shared library must be installed and accessible (via `KNI_HOME` environment variable or standard system paths) + +## Scripts + +- `KNI.py`: Python wrapper for KhiopsNativeInterface using ctypes +- `KNIRecodeFile.py`: Single-table recoding example +- `KNIRecodeMTFiles.py`: Multi-table recoding example + +## Launch + +Recode the "Iris" dataset from the data directory using the `SNB_Iris` classifier dictionary. + +On Linux: + +```bash +python3 python/KNIRecodeFile.py data/ModelingIris.kdic SNB_Iris \ + data/Iris.txt R_Iris_python.txt +``` + +On Windows: + +```cmd +set path=%KNI_HOME%/bin;%path% +python python\KNIRecodeFile.py data/ModelingIris.kdic SNB_Iris ^ + data/Iris.txt R_Iris_python.txt +``` + +For the multi-table "Splice Junction" example: + +On Linux: + +```bash +python3 python/KNIRecodeMTFiles.py -d data/ModelingSpliceJunction.kdic SNB_SpliceJunction \ + -i data/SpliceJunction.txt 1 -s DNA data/SpliceJunctionDNA.txt 1 -o R_SpliceJunction_python.txt +``` + +On Windows: + +```cmd +set path=%KNI_HOME%/bin;%path% +python python\KNIRecodeMTFiles.py -d data/ModelingSpliceJunction.kdic SNB_SpliceJunction ^ + -i data/SpliceJunction.txt 1 -s DNA data/SpliceJunctionDNA.txt 1 -o R_SpliceJunction_python.txt +``` diff --git a/python/KNI.py b/python/KNI.py new file mode 100755 index 0000000..41924ad --- /dev/null +++ b/python/KNI.py @@ -0,0 +1,588 @@ +# Copyright (c) 2023-2026 Orange. All rights reserved. +# This software is distributed under the BSD 3-Clause-clear License, the text of which is available +# at https://spdx.org/licenses/BSD-3-Clause-Clear.html or see the "LICENSE" file for more details. + +""" +Khiops Native Interface (KNI) Python wrapper using ctypes. + +This module provides a Python interface to the Khiops Native Interface (KNI) C library, +allowing direct deployment of Khiops models without temporary files. +""" + +import ctypes +import platform +import sys +import os +from pathlib import Path + + +class KNIError(Exception): + """Exception raised for KNI errors.""" + + def __init__(self, message, error_code=None): + super().__init__(message) + self.error_code = error_code + + +class KNI: + """Python wrapper for Khiops Native Interface using ctypes.""" + + # Error codes + KNI_OK = 0 + KNI_ErrorRunningFunction = -1 + KNI_ErrorDictionaryFileName = -2 + KNI_ErrorDictionaryMissingFile = -3 + KNI_ErrorDictionaryFileFormat = -4 + KNI_ErrorDictionaryName = -5 + KNI_ErrorMissingDictionary = -6 + KNI_ErrorTooManyStreams = -7 + KNI_ErrorStreamHeaderLine = -8 + KNI_ErrorFieldSeparator = -9 + KNI_ErrorStreamHandle = -10 + KNI_ErrorStreamOpened = -11 + KNI_ErrorStreamNotOpened = -12 + KNI_ErrorStreamInputRecord = -13 + KNI_ErrorStreamInputRead = -14 + KNI_ErrorStreamOutputRecord = -15 + KNI_ErrorMissingSecondaryHeader = -16 + KNI_ErrorMissingExternalTable = -17 + KNI_ErrorDataRoot = -18 + KNI_ErrorDataPath = -19 + KNI_ErrorDataTableFile = -20 + KNI_ErrorLoadDataTable = -21 + KNI_ErrorMemoryOverflow = -22 + KNI_ErrorStreamOpening = -23 + KNI_ErrorStreamOpeningNotFinished = -24 + KNI_ErrorLogFile = -25 + + # Constants + KNI_MaxStreamNumber = 512 + KNI_DefaultMaxStreamMemory = 100 + KNI_MaxPathNameLength = 1024 + KNI_MaxDictionaryNameLength = 128 + KNI_MaxRecordLength = 8 * 1024 * 1024 # 8 MB + + def __init__(self, library_path=None): + """ + Initialize the KNI wrapper. + + Args: + library_path: Optional path to the KNI shared library. + If None, attempts to locate it automatically. + """ + self._lib = self._load_library(library_path) + self._setup_functions() + + def _load_library(self, library_path): + """Load the KNI shared library.""" + if library_path: + return ctypes.CDLL(str(library_path)) + + # Try to locate library automatically + system = platform.system() + if system == "Windows": + lib_name = "KhiopsNativeInterface.dll" + elif system == "Linux": + lib_name = "libKhiopsNativeInterface.so" + elif system == "Darwin": # macOS + lib_name = "libKhiopsNativeInterface.dylib" + else: + raise RuntimeError(f"Unsupported platform: {system}") + + # Try different search strategies + try: + # First, try loading directly (will use system paths) + return ctypes.CDLL(lib_name) + except OSError: + # Try to find in KNI_HOME environment variable + kni_home = os.environ.get("KNI_HOME") + if kni_home: + lib_path = os.path.join(kni_home, lib_name) + if os.path.exists(lib_path): + return ctypes.CDLL(lib_path) + + raise RuntimeError( + f"Could not find {lib_name}. " + "Please set KNI_HOME environment variable or provide library_path." + ) + + def _setup_functions(self): + """Setup function signatures for KNI library functions.""" + # KNIGetVersion + self._lib.KNIGetVersion.argtypes = [] + self._lib.KNIGetVersion.restype = ctypes.c_int + + # KNIGetFullVersion + self._lib.KNIGetFullVersion.argtypes = [] + self._lib.KNIGetFullVersion.restype = ctypes.c_char_p + + # KNISetLogFileName + self._lib.KNISetLogFileName.argtypes = [ctypes.c_char_p] + self._lib.KNISetLogFileName.restype = ctypes.c_int + + # KNIOpenStream + self._lib.KNIOpenStream.argtypes = [ + ctypes.c_char_p, # sDictionaryFileName + ctypes.c_char_p, # sDictionaryName + ctypes.c_char_p, # sStreamHeaderLine + ctypes.c_char, # cFieldSeparator + ] + self._lib.KNIOpenStream.restype = ctypes.c_int + + # KNICloseStream + self._lib.KNICloseStream.argtypes = [ctypes.c_int] + self._lib.KNICloseStream.restype = ctypes.c_int + + # KNIRecodeStreamRecord + self._lib.KNIRecodeStreamRecord.argtypes = [ + ctypes.c_int, # hStream + ctypes.c_char_p, # sInputRecord + ctypes.c_char_p, # sOutputRecord + ctypes.c_int, # nOutputMaxLength + ] + self._lib.KNIRecodeStreamRecord.restype = ctypes.c_int + + # Multi-table functions + # KNISetSecondaryHeaderLine + self._lib.KNISetSecondaryHeaderLine.argtypes = [ + ctypes.c_int, # hStream + ctypes.c_char_p, # sDataPath + ctypes.c_char_p, # sStreamSecondaryHeaderLine + ] + self._lib.KNISetSecondaryHeaderLine.restype = ctypes.c_int + + # KNISetExternalTable + self._lib.KNISetExternalTable.argtypes = [ + ctypes.c_int, # hStream + ctypes.c_char_p, # sDataRoot + ctypes.c_char_p, # sDataPath + ctypes.c_char_p, # sDataTableFileName + ] + self._lib.KNISetExternalTable.restype = ctypes.c_int + + # KNIFinishOpeningStream + self._lib.KNIFinishOpeningStream.argtypes = [ctypes.c_int] + self._lib.KNIFinishOpeningStream.restype = ctypes.c_int + + # KNISetSecondaryInputRecord + self._lib.KNISetSecondaryInputRecord.argtypes = [ + ctypes.c_int, # hStream + ctypes.c_char_p, # sDataPath + ctypes.c_char_p, # sStreamSecondaryInputRecord + ] + self._lib.KNISetSecondaryInputRecord.restype = ctypes.c_int + + # Advanced parameters + # KNIGetStreamMaxMemory + self._lib.KNIGetStreamMaxMemory.argtypes = [] + self._lib.KNIGetStreamMaxMemory.restype = ctypes.c_int + + # KNISetStreamMaxMemory + self._lib.KNISetStreamMaxMemory.argtypes = [ctypes.c_int] + self._lib.KNISetStreamMaxMemory.restype = ctypes.c_int + + def get_version(self): + """Get KNI version as integer (10*major + minor).""" + return self._lib.KNIGetVersion() + + def get_full_version(self): + """Get KNI full version string.""" + return self._lib.KNIGetFullVersion().decode("utf-8") + + def set_log_file_name(self, log_file_name): + """ + Set the log file name for error messages. + + Args: + log_file_name: Path to log file (str or bytes, empty string for no logging) + + Raises: + KNIError: If setting log file fails + TypeError: If log_file_name is not str or bytes + """ + if isinstance(log_file_name, str): + log_file_name_bytes = log_file_name.encode("utf-8") + elif isinstance(log_file_name, bytes): + log_file_name_bytes = log_file_name + else: + raise TypeError( + f"log_file_name must be str or bytes, not {type(log_file_name).__name__}" + ) + ret_code = self._lib.KNISetLogFileName(log_file_name_bytes) + if ret_code != self.KNI_OK: + raise KNIError( + f"Failed to set log file: {self.get_error_message(ret_code)}", + ret_code, + ) + + def open_stream( + self, dictionary_file_name, dictionary_name, header_line, field_separator="\t" + ): + """ + Open a KNI stream for recoding. + + Args: + dictionary_file_name: Path to the dictionary file (str or bytes) + dictionary_name: Name of the dictionary to use (str or bytes) + header_line: Header line with field names (str or bytes) + field_separator: Character used to separate fields (str or bytes, default: tab) + + Returns: + Stream handle (positive integer) + + Raises: + KNIError: If opening stream fails + TypeError: If arguments have invalid types + """ + # Type checking and conversion + if isinstance(dictionary_file_name, str): + dictionary_file_name_bytes = dictionary_file_name.encode("utf-8") + elif isinstance(dictionary_file_name, bytes): + dictionary_file_name_bytes = dictionary_file_name + else: + raise TypeError( + f"dictionary_file_name must be str or bytes, not {type(dictionary_file_name).__name__}" + ) + if isinstance(dictionary_name, str): + dictionary_name_bytes = dictionary_name.encode("utf-8") + elif isinstance(dictionary_name, bytes): + dictionary_name_bytes = dictionary_name + else: + raise TypeError( + f"dictionary_name must be str or bytes, not {type(dictionary_name).__name__}" + ) + if isinstance(header_line, str): + header_line_bytes = header_line.encode("utf-8") + elif isinstance(header_line, bytes): + header_line_bytes = header_line + else: + raise TypeError( + f"header_line must be str or bytes, not {type(header_line).__name__}" + ) + + # Convert field_separator to a single byte + if isinstance(field_separator, str): + field_separator_byte = field_separator.encode("utf-8")[0] + elif isinstance(field_separator, bytes): + field_separator_byte = field_separator[0] + else: + raise TypeError( + f"field_separator must be str or bytes, not {type(field_separator).__name__}" + ) + + stream_handle = self._lib.KNIOpenStream( + dictionary_file_name_bytes, + dictionary_name_bytes, + header_line_bytes, + field_separator_byte, + ) + if stream_handle < 0: + raise KNIError( + f"Failed to open stream: {self.get_error_message(stream_handle)}", + stream_handle, + ) + return stream_handle + + def close_stream(self, stream_handle): + """ + Close a KNI stream. + + Args: + stream_handle: Handle returned by open_stream + + Raises: + KNIError: If closing stream fails + TypeError: If stream_handle is not int + """ + if not isinstance(stream_handle, int): + raise TypeError( + f"stream_handle must be int, not {type(stream_handle).__name__}" + ) + ret_code = self._lib.KNICloseStream(stream_handle) + if ret_code != self.KNI_OK: + raise KNIError( + f"Failed to close stream: {self.get_error_message(ret_code)}", + ret_code, + ) + + def recode_stream_record(self, stream_handle, input_record, max_output_length=None): + """ + Recode an input record using the stream's dictionary. + + Args: + stream_handle: Handle returned by open_stream + input_record: Input record string or bytes + max_output_length: Maximum output buffer size (default: KNI_MaxRecordLength) + + Returns: + Recoded output string + + Raises: + KNIError: If recoding fails + TypeError: If arguments have invalid types + """ + if not isinstance(stream_handle, int): + raise TypeError( + f"stream_handle must be int, not {type(stream_handle).__name__}" + ) + if isinstance(input_record, str): + input_record_bytes = input_record.encode("utf-8") + elif isinstance(input_record, bytes): + input_record_bytes = input_record + else: + raise TypeError( + f"input_record must be str or bytes, not {type(input_record).__name__}" + ) + if max_output_length is None: + max_output_length = self.KNI_MaxRecordLength + elif not isinstance(max_output_length, int): + raise TypeError( + f"max_output_length must be int or None, not {type(max_output_length).__name__}" + ) + + output_buffer = ctypes.create_string_buffer(max_output_length) + ret_code = self._lib.KNIRecodeStreamRecord( + stream_handle, + input_record_bytes, + output_buffer, + max_output_length, + ) + + if ret_code != self.KNI_OK: + raise KNIError( + f"Failed to recode record: {self.get_error_message(ret_code)}", + ret_code, + ) + return output_buffer.value.decode("utf-8") + + def set_secondary_header_line(self, stream_handle, data_path, header_line): + """ + Set the header line of a secondary table (multi-table only). + + Args: + stream_handle: Handle returned by open_stream + data_path: Data path identifying the secondary table (str or bytes) + header_line: Header line with field names (str or bytes) + + Raises: + KNIError: If setting secondary header fails + TypeError: If arguments have invalid types + """ + if not isinstance(stream_handle, int): + raise TypeError( + f"stream_handle must be int, not {type(stream_handle).__name__}" + ) + if isinstance(data_path, str): + data_path_bytes = data_path.encode("utf-8") + elif isinstance(data_path, bytes): + data_path_bytes = data_path + else: + raise TypeError( + f"data_path must be str or bytes, not {type(data_path).__name__}" + ) + if isinstance(header_line, str): + header_line_bytes = header_line.encode("utf-8") + elif isinstance(header_line, bytes): + header_line_bytes = header_line + else: + raise TypeError( + f"header_line must be str or bytes, not {type(header_line).__name__}" + ) + ret_code = self._lib.KNISetSecondaryHeaderLine( + stream_handle, data_path_bytes, header_line_bytes + ) + if ret_code != self.KNI_OK: + raise KNIError( + f"Failed to set secondary header line: {self.get_error_message(ret_code)}", + ret_code, + ) + + def set_external_table( + self, stream_handle, data_root, data_path, data_table_file_name + ): + """ + Set the name of a data file for an external table (multi-table only). + + Args: + stream_handle: Handle returned by open_stream + data_root: Root dictionary of the external table (str or bytes) + data_path: Data path for secondary external tables (str or bytes, empty for root) + data_table_file_name: Path to the external table data file (str or bytes) + + Raises: + KNIError: If setting external table fails + TypeError: If arguments have invalid types + """ + if not isinstance(stream_handle, int): + raise TypeError( + f"stream_handle must be int, not {type(stream_handle).__name__}" + ) + if isinstance(data_root, str): + data_root_bytes = data_root.encode("utf-8") + elif isinstance(data_root, bytes): + data_root_bytes = data_root + else: + raise TypeError( + f"data_root must be str or bytes, not {type(data_root).__name__}" + ) + if isinstance(data_path, str): + data_path_bytes = data_path.encode("utf-8") + elif isinstance(data_path, bytes): + data_path_bytes = data_path + else: + raise TypeError( + f"data_path must be str or bytes, not {type(data_path).__name__}" + ) + if isinstance(data_table_file_name, str): + data_table_file_name_bytes = data_table_file_name.encode("utf-8") + elif isinstance(data_table_file_name, bytes): + data_table_file_name_bytes = data_table_file_name + else: + raise TypeError( + f"data_table_file_name must be str or bytes, not {type(data_table_file_name).__name__}" + ) + ret_code = self._lib.KNISetExternalTable( + stream_handle, + data_root_bytes, + data_path_bytes, + data_table_file_name_bytes, + ) + if ret_code != self.KNI_OK: + raise KNIError( + f"Failed to set external table: {self.get_error_message(ret_code)}", + ret_code, + ) + + def finish_opening_stream(self, stream_handle): + """ + Finish opening a stream (multi-table only). + + Must be called after all secondary headers and external tables are set. + + Args: + stream_handle: Handle returned by open_stream + + Raises: + KNIError: If finishing opening stream fails + TypeError: If stream_handle is not int + """ + if not isinstance(stream_handle, int): + raise TypeError( + f"stream_handle must be int, not {type(stream_handle).__name__}" + ) + ret_code = self._lib.KNIFinishOpeningStream(stream_handle) + if ret_code != self.KNI_OK: + raise KNIError( + f"Failed to finish opening stream: {self.get_error_message(ret_code)}", + ret_code, + ) + + def set_secondary_input_record(self, stream_handle, data_path, input_record): + """ + Set a secondary input record for multi-table recoding. + + All secondary records must be set before recoding the primary record. + + Args: + stream_handle: Handle returned by open_stream + data_path: Data path identifying the secondary table (str or bytes) + input_record: Secondary input record string or bytes + + Raises: + KNIError: If setting secondary input record fails + TypeError: If arguments have invalid types + """ + if not isinstance(stream_handle, int): + raise TypeError( + f"stream_handle must be int, not {type(stream_handle).__name__}" + ) + if isinstance(data_path, str): + data_path_bytes = data_path.encode("utf-8") + elif isinstance(data_path, bytes): + data_path_bytes = data_path + else: + raise TypeError( + f"data_path must be str or bytes, not {type(data_path).__name__}" + ) + if isinstance(input_record, str): + input_record_bytes = input_record.encode("utf-8") + elif isinstance(input_record, bytes): + input_record_bytes = input_record + else: + raise TypeError( + f"input_record must be str or bytes, not {type(input_record).__name__}" + ) + ret_code = self._lib.KNISetSecondaryInputRecord( + stream_handle, data_path_bytes, input_record_bytes + ) + if ret_code != self.KNI_OK: + raise KNIError( + f"Failed to set secondary input record: {self.get_error_message(ret_code)}", + ret_code, + ) + + def get_stream_max_memory(self): + """ + Get the maximum amount of memory (in MB) for stream opening. + + Returns: + Maximum memory in MB + """ + return self._lib.KNIGetStreamMaxMemory() + + def set_stream_max_memory(self, max_mb): + """ + Set the maximum amount of memory (in MB) for stream opening. + + Args: + max_mb: Maximum memory in MB + + Returns: + Accepted value (bounded by system limits) + """ + if not isinstance(max_mb, int): + raise TypeError(f"max_mb must be int, not {type(max_mb).__name__}") + return self._lib.KNISetStreamMaxMemory(max_mb) + + @staticmethod + def get_error_message(error_code): + """ + Get a human-readable error message for an error code. + + Args: + error_code: KNI error code + + Returns: + Error message string + """ + if not isinstance(error_code, int): + raise TypeError(f"error_code must be int, not {type(error_code).__name__}") + error_messages = { + KNI.KNI_OK: "Success", + KNI.KNI_ErrorRunningFunction: "Another KNI function is currently running", + KNI.KNI_ErrorDictionaryFileName: "Bad dictionary file name", + KNI.KNI_ErrorDictionaryMissingFile: "Dictionary file does not exist", + KNI.KNI_ErrorDictionaryFileFormat: "Bad dictionary format", + KNI.KNI_ErrorDictionaryName: "Bad dictionary name", + KNI.KNI_ErrorMissingDictionary: "Dictionary not found in dictionary file", + KNI.KNI_ErrorTooManyStreams: "Too many streams opened", + KNI.KNI_ErrorStreamHeaderLine: "Bad stream header line", + KNI.KNI_ErrorFieldSeparator: "Bad field separator", + KNI.KNI_ErrorStreamHandle: "Bad stream handle", + KNI.KNI_ErrorStreamOpened: "Stream already opened", + KNI.KNI_ErrorStreamNotOpened: "Stream not opened", + KNI.KNI_ErrorStreamInputRecord: "Bad input record", + KNI.KNI_ErrorStreamInputRead: "Problem reading input record", + KNI.KNI_ErrorStreamOutputRecord: "Output record too long", + KNI.KNI_ErrorMissingSecondaryHeader: "Missing secondary table header", + KNI.KNI_ErrorMissingExternalTable: "Missing external table", + KNI.KNI_ErrorDataRoot: "Bad data root", + KNI.KNI_ErrorDataPath: "Bad data path", + KNI.KNI_ErrorDataTableFile: "Bad data table file", + KNI.KNI_ErrorLoadDataTable: "Problem loading external data tables", + KNI.KNI_ErrorMemoryOverflow: "Memory overflow", + KNI.KNI_ErrorStreamOpening: "Stream could not be opened", + KNI.KNI_ErrorStreamOpeningNotFinished: "Multi-table stream opening not finished", + KNI.KNI_ErrorLogFile: "Bad error file", + } + return error_messages.get(error_code, f"Unknown error code: {error_code}") diff --git a/python/KNIRecodeFile.py b/python/KNIRecodeFile.py new file mode 100755 index 0000000..ddcb704 --- /dev/null +++ b/python/KNIRecodeFile.py @@ -0,0 +1,147 @@ +#!/usr/bin/env python3 +# Copyright (c) 2023-2026 Orange. All rights reserved. +# This software is distributed under the BSD 3-Clause-clear License, the text of which is available +# at https://spdx.org/licenses/BSD-3-Clause-Clear.html or see the "LICENSE" file for more details. + +""" +KNIRecodeFile: Recode an input file to an output file using a Khiops dictionary. + +This script demonstrates the use of the Khiops Native Interface (KNI) from Python +to deploy a Khiops model for real-time scoring without temporary files. +""" + +import sys +import argparse +from KNI import KNI, KNIError + + +def recode_file( + dictionary_file_name, + dictionary_name, + input_file_name, + output_file_name, + error_file_name="", +): + """ + Recode an input file to an output file using a Khiops dictionary. + + Args: + dictionary_file_name: Path to the dictionary file + dictionary_name: Name of the dictionary to use + input_file_name: Path to input file (must have header line) + output_file_name: Path to output file + error_file_name: Optional path to error log file (empty for no logging) + + Raises: + KNIError: If any KNI operation fails + FileNotFoundError: If input file is not found + ValueError: If input file is empty or invalid + """ + # Initialize KNI + kni = KNI() + + # Set error log file + if error_file_name: + kni.set_log_file_name(error_file_name) + + print(f"\nRecode records of {input_file_name} to {output_file_name}") + + # Open input and output files + with open(input_file_name, "r", encoding="utf-8") as input_file, open( + output_file_name, "w", encoding="utf-8" + ) as output_file: + + # Read header line + header_line = input_file.readline().rstrip() + if not header_line: + raise ValueError("Empty input file") + + # Open KNI stream + stream_handle = kni.open_stream( + dictionary_file_name, dictionary_name, header_line, "\t" + ) + + try: + # Process all records + record_number = 0 + for line_number, line in enumerate(input_file, start=2): + # Remove trailing whitespace + input_record = line.rstrip() + + # Skip empty lines + if not input_record: + continue + + # Recode the record + output_record = kni.recode_stream_record(stream_handle, input_record) + + # Write output record + output_file.write(f"{output_record}\n") + record_number += 1 + finally: + # Close stream + kni.close_stream(stream_handle) + + print(f"{record_number} records recoded") + + +def main(): + """Main entry point for command-line execution.""" + parser = argparse.ArgumentParser( + description="Recode an input file to an output file using a Khiops dictionary.", + epilog="The input file must have a header line, describing the structure of all its instances. " + "The input and output files have a tabular format. " + "The error file may be useful for debugging purposes.", + ) + + parser.add_argument( + "dictionary_file", + help="Path to the dictionary file", + ) + parser.add_argument( + "dictionary_name", + help="Name of the dictionary to use", + ) + parser.add_argument( + "input_file", + help="Path to input file (must have header line)", + ) + parser.add_argument( + "output_file", + help="Path to output file", + ) + parser.add_argument( + "error_file", + nargs="?", + default="", + help="Optional path to error log file (empty for no logging)", + ) + + args = parser.parse_args() + + # Execute recoding + try: + recode_file( + args.dictionary_file, + args.dictionary_name, + args.input_file, + args.output_file, + args.error_file, + ) + return 0 + except KNIError as e: + print(f"Error: {e}", file=sys.stderr) + return 1 + except FileNotFoundError as e: + print(f"Error: File not found: {e.filename}", file=sys.stderr) + return 1 + except ValueError as e: + print(f"Error: {e}", file=sys.stderr) + return 1 + except Exception as e: + print(f"Error: {e}", file=sys.stderr) + return 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/python/KNIRecodeMTFiles.py b/python/KNIRecodeMTFiles.py new file mode 100755 index 0000000..b7748dd --- /dev/null +++ b/python/KNIRecodeMTFiles.py @@ -0,0 +1,273 @@ +#!/usr/bin/env python3 +# Copyright (c) 2023-2026 Orange. All rights reserved. +# This software is distributed under the BSD 3-Clause-clear License, the text of which is available +# at https://spdx.org/licenses/BSD-3-Clause-Clear.html or see the "LICENSE" file for more details. + +""" +KNIRecodeMTFiles: Recode multi-table input files using a Khiops dictionary. + +This script demonstrates multi-table recoding with the Khiops Native Interface (KNI) +from Python. It supports secondary tables and external tables. +""" + +import sys +import argparse +from KNI import KNI, KNIError + + +def recode_mt_files( + dictionary_file, + dictionary_name, + input_specs, + secondary_specs, + external_specs, + output_file, + field_separator="\t", + error_file="", + max_memory=None, +): + """ + Recode multi-table input files to a single output file. + + Args: + dictionary_file: Path to the dictionary file + dictionary_name: Name of the dictionary to use + input_specs: Dict with 'file': input file path, 'keys': list of key column indices (1-based) + secondary_specs: List of dicts with 'path': data path, 'file': file path, 'keys': key indices + external_specs: List of dicts with 'root': data root, 'path': data path, 'file': file path + output_file: Path to output file + field_separator: Character to separate fields + error_file: Optional path to error log file + max_memory: Optional maximum memory in MB + + Raises: + KNIError: If any KNI operation fails + FileNotFoundError: If input file is not found + """ + # Initialize KNI + kni = KNI() + + # Set error log file + if error_file: + kni.set_log_file_name(error_file) + + # Set max memory if specified + if max_memory: + actual_memory = kni.set_stream_max_memory(max_memory) + print(f"Stream max memory set to {actual_memory} MB") + + print(f"\nRecode multi-table data to {output_file}") + + # Read headers from all input files + input_file_path = input_specs["file"] + with open(input_file_path, "r", encoding="utf-8") as f: + main_header = f.readline().rstrip() + + # Open stream with main table header + stream_handle = kni.open_stream( + dictionary_file, dictionary_name, main_header, field_separator + ) + + try: + # Set secondary table headers + secondary_files = {} + for spec in secondary_specs: + data_path = spec["path"] + sec_file = spec["file"] + + with open(sec_file, "r", encoding="utf-8") as f: + sec_header = f.readline().rstrip() + + kni.set_secondary_header_line(stream_handle, data_path, sec_header) + + # Store opened file for reading records + secondary_files[data_path] = { + "file": open(sec_file, "r", encoding="utf-8"), + "keys": spec["keys"], + "records": {}, + } + # Skip header + secondary_files[data_path]["file"].readline() + + # Set external tables + for spec in external_specs: + kni.set_external_table( + stream_handle, spec["root"], spec.get("path", ""), spec["file"] + ) + + # Finish opening stream (required for multi-table) + kni.finish_opening_stream(stream_handle) + + # Load all secondary records into memory indexed by key + for data_path, sec_info in secondary_files.items(): + print(f"Loading secondary table: {data_path}") + for line in sec_info["file"]: + record = line.rstrip() + if not record: + continue + + # Extract key from record + fields = record.split(field_separator) + key_values = [fields[idx - 1] for idx in sec_info["keys"]] + key = tuple(key_values) + + # Store record by key (support multiple records per key) + if key not in sec_info["records"]: + sec_info["records"][key] = [] + sec_info["records"][key].append(record) + + sec_info["file"].close() + + # Process main table records + record_number = 0 + with open(input_file_path, "r", encoding="utf-8") as main_file, open( + output_file, "w", encoding="utf-8" + ) as out_file: + + for line_number, line in enumerate(main_file, start=1): + # Skip header + if line_number == 1: + continue + + # Get main record + main_record = line.rstrip() + if not main_record: + continue + + # Extract key from main record + main_fields = main_record.split(field_separator) + main_key = tuple([main_fields[idx - 1] for idx in input_specs["keys"]]) + + # Set all secondary records matching the main record key + for data_path, sec_info in secondary_files.items(): + matching_records = sec_info["records"].get(main_key, []) + for sec_record in matching_records: + kni.set_secondary_input_record( + stream_handle, data_path, sec_record + ) + + # Recode the main record + output_record = kni.recode_stream_record(stream_handle, main_record) + out_file.write(f"{output_record}\n") + record_number += 1 + + print(f"{record_number} records recoded") + finally: + # Close any open secondary files + for sec_info in secondary_files.values(): + if not sec_info["file"].closed: + sec_info["file"].close() + + # Close stream + kni.close_stream(stream_handle) + + +def main(): + """Main entry point for command-line execution.""" + parser = argparse.ArgumentParser( + description="Recode multi-table input files using a Khiops dictionary.", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Example: + KNIRecodeMTFiles -d data/ModelingSpliceJunction.kdic SNB_SpliceJunction \\ + -i data/SpliceJunction.txt 1 \\ + -s DNA data/SpliceJunctionDNA.txt 1 \\ + -o R_SpliceJunction.txt + """, + ) + + parser.add_argument( + "-d", + "--dictionary", + nargs=2, + required=True, + metavar=("FILE", "NAME"), + help="Dictionary file and dictionary name", + ) + parser.add_argument( + "-f", + "--field-separator", + default="\t", + help="Field separator character (default: tab)", + ) + parser.add_argument( + "-i", + "--input", + nargs="+", + required=True, + metavar="ARG", + help="Input file name followed by key column indices (1-based): FILE KEY...", + ) + parser.add_argument( + "-s", + "--secondary", + action="append", + nargs="+", + metavar="ARG", + help="Secondary data path, file name, and key indices: PATH FILE KEY...", + ) + parser.add_argument( + "-x", + "--external", + action="append", + nargs=3, + metavar=("ROOT", "PATH", "FILE"), + help="External data root, path, and file name: ROOT PATH FILE", + ) + parser.add_argument("-o", "--output", required=True, help="Output file name") + parser.add_argument( + "-e", "--error-file", default="", help="Error log file name (optional)" + ) + parser.add_argument("-m", "--max-memory", type=int, help="Maximum memory in MB") + + args = parser.parse_args() + + # Parse input specification + input_file = args.input[0] + key_indices = [int(k) for k in args.input[1:]] + input_specs = {"file": input_file, "keys": key_indices} + + # Parse secondary specifications + secondary_specs = [] + if args.secondary: + for sec in args.secondary: + data_path = sec[0] + sec_file = sec[1] + sec_keys = [int(k) for k in sec[2:]] + secondary_specs.append( + {"path": data_path, "file": sec_file, "keys": sec_keys} + ) + + # Parse external specifications + external_specs = [] + if args.external: + for ext in args.external: + external_specs.append({"root": ext[0], "path": ext[1], "file": ext[2]}) + + # Execute recoding + try: + recode_mt_files( + args.dictionary[0], + args.dictionary[1], + input_specs, + secondary_specs, + external_specs, + args.output, + args.field_separator, + args.error_file, + args.max_memory, + ) + return 0 + except KNIError as e: + print(f"Error: {e}", file=sys.stderr) + return 1 + except FileNotFoundError as e: + print(f"Error: File not found: {e.filename}", file=sys.stderr) + return 1 + except Exception as e: + print(f"Error: {e}", file=sys.stderr) + return 1 + + +if __name__ == "__main__": + sys.exit(main())