|
| 1 | +# PyPI Integration |
| 2 | + |
| 3 | +CORE is available as a Python package for direct integration into your own pipelines and tooling. |
| 4 | + |
| 5 | +```bash |
| 6 | +pip install cdisc-rules-engine |
| 7 | +``` |
| 8 | + |
| 9 | +This installs the engine underlying the CLI and executable, but **does not include `core.py`** or the CLI entrypoints. If you need the full CLI, use the [executable or source code](quick-start.md) instead. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## What You'll Need |
| 14 | + |
| 15 | +Installing the package alone is not enough to run validations. You also need: |
| 16 | + |
| 17 | +1. **The rules cache** — download the contents of `resources/cache/` from the [repository](https://github.com/cdisc-org/cdisc-rules-engine) and store them somewhere in your project. Keep this in sync with the package version you're using. |
| 18 | +2. **A CDISC Library API key** — required for controlled terminology and library metadata. See [update-cache](cli-reference.md#updating-the-cache-update-cache) for how to obtain one. |
| 19 | + |
| 20 | +The package also includes the USDM and Dataset-JSON schemas, available if you use the dataset reader classes in `cdisc_rules_engine/services/data_readers` or the metadata readers in `cdisc_rules_engine/services`. |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +## Choosing an Approach |
| 25 | + |
| 26 | +| | Option A: Business Rules Engine | Option B: RulesEngine Class | |
| 27 | +| -------------- | ------------------------------- | -------------------------------------- | |
| 28 | +| **Interface** | Low-level, rule-by-rule | High-level, dataset-oriented | |
| 29 | +| **Data input** | pandas DataFrame | XPT or other file-based datasets | |
| 30 | +| **Setup** | Minimal | More configuration required | |
| 31 | +| **Best for** | Simple in-memory validation | Full multi-domain validation pipelines | |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## Loading the Rules Cache |
| 36 | + |
| 37 | +Both options start by loading the cache: |
| 38 | + |
| 39 | +```python |
| 40 | +import os |
| 41 | +import pathlib |
| 42 | +import pickle |
| 43 | +from multiprocessing.managers import SyncManager |
| 44 | +from cdisc_rules_engine.services.cache import InMemoryCacheService |
| 45 | + |
| 46 | +class CacheManager(SyncManager): |
| 47 | + pass |
| 48 | + |
| 49 | +CacheManager.register("InMemoryCacheService", InMemoryCacheService) |
| 50 | + |
| 51 | +def load_rules_cache(path_to_rules_cache): |
| 52 | + cache_path = pathlib.Path(path_to_rules_cache) |
| 53 | + manager = CacheManager() |
| 54 | + manager.start() |
| 55 | + cache = manager.InMemoryCacheService() |
| 56 | + |
| 57 | + files = next(os.walk(cache_path), (None, None, []))[2] |
| 58 | + for fname in files: |
| 59 | + with open(cache_path / fname, "rb") as f: |
| 60 | + cache.add_all(pickle.load(f)) |
| 61 | + |
| 62 | + return cache |
| 63 | +``` |
| 64 | + |
| 65 | +Retrieve rules for a standard and version: |
| 66 | + |
| 67 | +```python |
| 68 | +from cdisc_rules_engine.utilities.utils import get_rules_cache_key |
| 69 | + |
| 70 | +cache = load_rules_cache("path/to/rules/cache") |
| 71 | +# Note: version uses dashes, not dots |
| 72 | +rules = cache.get_all_by_prefix(get_rules_cache_key("sdtmig", "3-4")) |
| 73 | +``` |
| 74 | + |
| 75 | +Each rule is a dict with keys: `core_id`, `domains`, `author`, `reference`, `sensitivity`, `executability`, `description`, `authorities`, `standards`, `classes`, `rule_type`, `conditions`, `actions`, `datasets`, `output_variables`. |
| 76 | + |
| 77 | +If you have rules in raw CDISC metadata format, convert them first: |
| 78 | + |
| 79 | +```python |
| 80 | +from cdisc_rules_engine.models.rule import Rule |
| 81 | + |
| 82 | +rule_dict = Rule.from_cdisc_metadata(rule_metadata) |
| 83 | +rule_obj = Rule(rule_dict) |
| 84 | +``` |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +## Option A: Business Rules Engine |
| 89 | + |
| 90 | +Minimal setup — good for validating a single domain against an in-memory DataFrame. |
| 91 | + |
| 92 | +### Prepare Your Data |
| 93 | + |
| 94 | +```python |
| 95 | +from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset |
| 96 | +from cdisc_rules_engine.models.dataset_variable import DatasetVariable |
| 97 | +from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata |
| 98 | + |
| 99 | +pandas_dataset = PandasDataset(data=df) |
| 100 | +dataset_metadata = SDTMDatasetMetadata( |
| 101 | + name="AE", |
| 102 | + label="Adverse Events", |
| 103 | + first_record=df.iloc[0].to_dict() if not df.empty else None |
| 104 | +) |
| 105 | +dataset_variable = DatasetVariable( |
| 106 | + pandas_dataset, |
| 107 | + column_prefix_map={"--": dataset_metadata.domain}, |
| 108 | +) |
| 109 | +``` |
| 110 | + |
| 111 | +`DatasetVariable` accepts additional optional arguments for richer validation: |
| 112 | + |
| 113 | +```python |
| 114 | +dataset_variable = DatasetVariable( |
| 115 | + pandas_dataset, |
| 116 | + column_prefix_map={"--": dataset_metadata.domain}, |
| 117 | + value_level_metadata=value_level_metadata, |
| 118 | + column_codelist_map=variable_codelist_map, |
| 119 | + codelist_term_maps=codelist_term_maps, |
| 120 | +) |
| 121 | +``` |
| 122 | + |
| 123 | +### Run Rules |
| 124 | + |
| 125 | +```python |
| 126 | +from business_rules.engine import run |
| 127 | +from cdisc_rules_engine.models.actions import COREActions |
| 128 | + |
| 129 | +ae_rules = [ |
| 130 | + r for r in rules |
| 131 | + if "AE" in r.get("domains", {}).get("Include", []) |
| 132 | + or "ALL" in r.get("domains", {}).get("Include", []) |
| 133 | +] |
| 134 | + |
| 135 | +all_results = [] |
| 136 | +for rule in ae_rules: |
| 137 | + results = [] |
| 138 | + core_actions = COREActions( |
| 139 | + output_container=results, |
| 140 | + variable=dataset_variable, |
| 141 | + dataset_metadata=dataset_metadata, |
| 142 | + rule=rule, |
| 143 | + value_level_metadata=None, |
| 144 | + ) |
| 145 | + try: |
| 146 | + was_triggered = run(rule=rule, defined_variables=dataset_variable, defined_actions=core_actions) |
| 147 | + if was_triggered: |
| 148 | + all_results.extend(results) |
| 149 | + except Exception as e: |
| 150 | + print(f"Error in {rule.get('core_id')}: {e}") |
| 151 | +``` |
| 152 | + |
| 153 | +`was_triggered` is `True` if issues were found. Each result in `all_results` looks like: |
| 154 | + |
| 155 | +```python |
| 156 | +{ |
| 157 | + 'executionStatus': 'success', |
| 158 | + 'domain': 'AE', |
| 159 | + 'variables': ['AESLIFE'], |
| 160 | + 'message': 'AESLIFE is completed, but not equal to "N" or "Y"', |
| 161 | + 'errors': [{'value': {'AESLIFE': 'Maybe'}, 'row': 1}] |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## Option B: RulesEngine Class |
| 168 | + |
| 169 | +More setup, but handles dataset reading, preprocessing, and multi-domain validation. The source code in `cdisc_rules_engine/rules_engine.py` and the existing CLI implementation in `core.py` are the best reference for wiring this together — the initializer arguments map closely to the CLI flags documented in the [CLI Reference](cli-reference.md). |
| 170 | + |
| 171 | +### Step 1: Prepare Dataset Metadata |
| 172 | + |
| 173 | +```python |
| 174 | +import os |
| 175 | +import pyreadstat |
| 176 | +from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata |
| 177 | + |
| 178 | +def create_dataset_metadata(file_path): |
| 179 | + data, meta = pyreadstat.read_xport(file_path) |
| 180 | + first_record = data.iloc[0].to_dict() if not data.empty else None |
| 181 | + return SDTMDatasetMetadata( |
| 182 | + name=os.path.basename(file_path).split('.')[0].upper(), |
| 183 | + label=meta.file_label if hasattr(meta, 'file_label') else "", |
| 184 | + filename=os.path.basename(file_path), |
| 185 | + full_path=file_path, |
| 186 | + file_size=os.path.getsize(file_path), |
| 187 | + record_count=len(data), |
| 188 | + first_record=first_record, |
| 189 | + ) |
| 190 | + |
| 191 | +datasets = [ |
| 192 | + create_dataset_metadata(os.path.join(directory, f)) |
| 193 | + for f in os.listdir(directory) |
| 194 | + if f.lower().endswith('.xpt') |
| 195 | +] |
| 196 | +``` |
| 197 | + |
| 198 | +You don't need to manually create `PandasDataset` or `DatasetVariable` objects for Option B — the engine handles this internally. |
| 199 | + |
| 200 | +### Step 2: Initialize Library Metadata |
| 201 | + |
| 202 | +```python |
| 203 | +from cdisc_rules_engine.models.library_metadata_container import LibraryMetadataContainer |
| 204 | +from cdisc_rules_engine.utilities.utils import ( |
| 205 | + get_library_variables_metadata_cache_key, |
| 206 | + get_model_details_cache_key_from_ig, |
| 207 | + get_standard_details_cache_key, |
| 208 | + get_variable_codelist_map_cache_key, |
| 209 | +) |
| 210 | + |
| 211 | +standard = "sdtmig" |
| 212 | +standard_version = "3-4" |
| 213 | +standard_substandard = None |
| 214 | + |
| 215 | +standard_metadata = cache.get(get_standard_details_cache_key(standard, standard_version, standard_substandard)) |
| 216 | +model_metadata = cache.get(get_model_details_cache_key_from_ig(standard_metadata)) if standard_metadata else {} |
| 217 | + |
| 218 | +ct_packages = ["sdtmct-2021-12-17"] # replace with your CT package versions |
| 219 | +ct_package_metadata = {pkg: cache.get(pkg) for pkg in ct_packages} |
| 220 | + |
| 221 | +library_metadata = LibraryMetadataContainer( |
| 222 | + standard_metadata=standard_metadata, |
| 223 | + model_metadata=model_metadata, |
| 224 | + variables_metadata=cache.get(get_library_variables_metadata_cache_key(standard, standard_version, standard_substandard)), |
| 225 | + variable_codelist_map=cache.get(get_variable_codelist_map_cache_key(standard, standard_version, standard_substandard)), |
| 226 | + ct_package_metadata=ct_package_metadata, |
| 227 | +) |
| 228 | +``` |
| 229 | + |
| 230 | +### Step 3: Initialize Data Service |
| 231 | + |
| 232 | +```python |
| 233 | +from cdisc_rules_engine.config import config as default_config |
| 234 | +from cdisc_rules_engine.services.data_services import DataServiceFactory |
| 235 | + |
| 236 | +max_dataset_size = max(datasets, key=lambda x: x.file_size).file_size |
| 237 | +# Set max_dataset_size=0 to force Dask processing for all datasets |
| 238 | + |
| 239 | +data_service_factory = DataServiceFactory( |
| 240 | + config=default_config, |
| 241 | + cache_service=cache, |
| 242 | + standard=standard, |
| 243 | + standard_version=standard_version, |
| 244 | + standard_substandard=standard_substandard, |
| 245 | + library_metadata=library_metadata, |
| 246 | + max_dataset_size=max_dataset_size, |
| 247 | +) |
| 248 | + |
| 249 | +data_service = data_service_factory.get_data_service(dataset_paths) |
| 250 | +``` |
| 251 | + |
| 252 | +### Step 4: Initialize Rules Engine |
| 253 | + |
| 254 | +```python |
| 255 | +from cdisc_rules_engine.rules_engine import RulesEngine |
| 256 | + |
| 257 | +rules_engine = RulesEngine( |
| 258 | + cache=cache, |
| 259 | + data_service=data_service, |
| 260 | + config_obj=default_config, |
| 261 | + external_dictionaries=None, |
| 262 | + standard=standard, |
| 263 | + standard_version=standard_version, |
| 264 | + standard_substandard=None, |
| 265 | + library_metadata=library_metadata, |
| 266 | + max_dataset_size=max_dataset_size, |
| 267 | + dataset_paths=dataset_paths, |
| 268 | + ct_packages=ct_packages, |
| 269 | + define_xml_path="path/to/define.xml", # optional |
| 270 | + validate_xml=False, |
| 271 | +) |
| 272 | +``` |
| 273 | + |
| 274 | +### Step 5: Run Validation |
| 275 | + |
| 276 | +Note the `ConditionCompositeFactory` conversion step — this is required before passing rules to `validate_single_rule`: |
| 277 | + |
| 278 | +```python |
| 279 | +import time |
| 280 | +from cdisc_rules_engine.models.rule_conditions import ConditionCompositeFactory |
| 281 | +from cdisc_rules_engine.models.rule_validation_result import RuleValidationResult |
| 282 | + |
| 283 | +start_time = time.time() |
| 284 | +validation_results = [] |
| 285 | + |
| 286 | +for rule in rules: |
| 287 | + try: |
| 288 | + if isinstance(rule["conditions"], dict): |
| 289 | + rule["conditions"] = ConditionCompositeFactory.get_condition_composite(rule["conditions"]) |
| 290 | + results = rules_engine.validate_single_rule(rule, datasets) |
| 291 | + flattened = [r for domain_results in results.values() for r in domain_results] |
| 292 | + validation_results.append(RuleValidationResult(rule, flattened)) |
| 293 | + except Exception as e: |
| 294 | + print(f"Error validating rule {rule.get('core_id')}: {e}") |
| 295 | + |
| 296 | +elapsed_time = time.time() - start_time |
| 297 | +``` |
| 298 | + |
| 299 | +### Step 6: Generate Report |
| 300 | + |
| 301 | +Simple text output: |
| 302 | + |
| 303 | +```python |
| 304 | +import json |
| 305 | + |
| 306 | +with open("validation_results.txt", "w") as f: |
| 307 | + for result in validation_results: |
| 308 | + rule_id = result.rule.get("core_id", "Unknown") |
| 309 | + f.write(f"Rule: {rule_id}\n") |
| 310 | + if hasattr(result, 'violations') and result.violations: |
| 311 | + f.write(f"Found {len(result.violations)} violations\n") |
| 312 | + for violation in result.violations: |
| 313 | + f.write(f" - {json.dumps(violation, default=str)}\n") |
| 314 | + else: |
| 315 | + f.write(" No violations found\n") |
| 316 | + f.write("\n") |
| 317 | +``` |
| 318 | + |
| 319 | +For structured output, use `ReportFactory`: |
| 320 | + |
| 321 | +```python |
| 322 | +reporting_factory = ReportFactory( |
| 323 | + datasets=datasets, |
| 324 | + validation_results=validation_results, |
| 325 | + elapsed_time=elapsed_time, |
| 326 | + args=args, |
| 327 | + data_service=data_service, |
| 328 | +) |
| 329 | +reporting_services = reporting_factory.get_report_services() |
| 330 | +``` |
| 331 | + |
| 332 | +--- |
| 333 | + |
| 334 | +## Notes |
| 335 | + |
| 336 | +**Cache key format** — always use dashes in version strings (`3-4`, not `3.4`). |
| 337 | + |
| 338 | +**`column_prefix_map`** — maps the `--` variable prefix to the dataset domain (e.g. `{"--": "AE"}`), resolving placeholders like `--SEQ` → `AESEQ`. |
| 339 | + |
| 340 | +**External dictionaries** — pass an `ExternalDictionariesContainer` to `RulesEngine` if validating rules that require MedDRA, WHODrug, LOINC, UNII, MedRT, or SNOMED. See the [External Dictionary Reference](https://cdisc-org.github.io/conformance-rules-editor/#/exdictionary). |
| 341 | + |
| 342 | +**Dask** — set `max_dataset_size=0` when initializing `DataServiceFactory` to force Dask processing for all datasets. |
| 343 | + |
| 344 | +**Windows compatibility** — add `freeze_support()` for multiprocessing: |
| 345 | + |
| 346 | +```python |
| 347 | +from multiprocessing import freeze_support |
| 348 | + |
| 349 | +if __name__ == "__main__": |
| 350 | + freeze_support() |
| 351 | + main() |
| 352 | +``` |
| 353 | + |
| 354 | +--- |
| 355 | + |
| 356 | +## Troubleshooting |
| 357 | + |
| 358 | +- Ensure the DataFrame contains all required columns for the rules being run |
| 359 | +- `column_prefix_map` must correctly map `"--"` to the domain (e.g. `{"--": "AE"}`) |
| 360 | +- The dataset object must be a `PandasDataset` instance, not a raw pandas DataFrame |
| 361 | +- `full_path` must be set in `SDTMDatasetMetadata` when using the `RulesEngine` approach |
| 362 | +- The rule's `domains.Include` must match your dataset's domain |
| 363 | +- `standard_version` format must be consistent throughout (`3-4`, not `3.4`) |
| 364 | +- CT package metadata must be present in the cache if validating against controlled terminology |
| 365 | +- When using `define.xml`, the file must be named `define.xml` and the path must be valid |
| 366 | +- If using external dictionaries, verify all file paths are correct and accessible |
| 367 | +- Don't forget the `ConditionCompositeFactory` conversion before calling `validate_single_rule` (Option B) |
0 commit comments