Skip to content

Commit ecb58e7

Browse files
authored
readme overhaul (#1731)
* readme overhaul * prettier * moved logo * pic * prettier * fixes * remove -vo * lint
1 parent 98677bb commit ecb58e7

16 files changed

Lines changed: 1481 additions & 1542 deletions

PYPI.md

Lines changed: 0 additions & 779 deletions
This file was deleted.

README.md

Lines changed: 11 additions & 745 deletions
Large diffs are not rendered by default.

docs/.nojekyll

Whitespace-only changes.

docs/PYPI.md

Lines changed: 367 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
# PyPI Integration
2+
3+
CORE is available as a Python package for direct integration into your own pipelines and tooling.
4+
5+
```bash
6+
pip install cdisc-rules-engine
7+
```
8+
9+
This installs the engine underlying the CLI and executable, but **does not include `core.py`** or the CLI entrypoints. If you need the full CLI, use the [executable or source code](quick-start.md) instead.
10+
11+
---
12+
13+
## What You'll Need
14+
15+
Installing the package alone is not enough to run validations. You also need:
16+
17+
1. **The rules cache** — download the contents of `resources/cache/` from the [repository](https://github.com/cdisc-org/cdisc-rules-engine) and store them somewhere in your project. Keep this in sync with the package version you're using.
18+
2. **A CDISC Library API key** — required for controlled terminology and library metadata. See [update-cache](cli-reference.md#updating-the-cache-update-cache) for how to obtain one.
19+
20+
The package also includes the USDM and Dataset-JSON schemas, available if you use the dataset reader classes in `cdisc_rules_engine/services/data_readers` or the metadata readers in `cdisc_rules_engine/services`.
21+
22+
---
23+
24+
## Choosing an Approach
25+
26+
| | Option A: Business Rules Engine | Option B: RulesEngine Class |
27+
| -------------- | ------------------------------- | -------------------------------------- |
28+
| **Interface** | Low-level, rule-by-rule | High-level, dataset-oriented |
29+
| **Data input** | pandas DataFrame | XPT or other file-based datasets |
30+
| **Setup** | Minimal | More configuration required |
31+
| **Best for** | Simple in-memory validation | Full multi-domain validation pipelines |
32+
33+
---
34+
35+
## Loading the Rules Cache
36+
37+
Both options start by loading the cache:
38+
39+
```python
40+
import os
41+
import pathlib
42+
import pickle
43+
from multiprocessing.managers import SyncManager
44+
from cdisc_rules_engine.services.cache import InMemoryCacheService
45+
46+
class CacheManager(SyncManager):
47+
pass
48+
49+
CacheManager.register("InMemoryCacheService", InMemoryCacheService)
50+
51+
def load_rules_cache(path_to_rules_cache):
52+
cache_path = pathlib.Path(path_to_rules_cache)
53+
manager = CacheManager()
54+
manager.start()
55+
cache = manager.InMemoryCacheService()
56+
57+
files = next(os.walk(cache_path), (None, None, []))[2]
58+
for fname in files:
59+
with open(cache_path / fname, "rb") as f:
60+
cache.add_all(pickle.load(f))
61+
62+
return cache
63+
```
64+
65+
Retrieve rules for a standard and version:
66+
67+
```python
68+
from cdisc_rules_engine.utilities.utils import get_rules_cache_key
69+
70+
cache = load_rules_cache("path/to/rules/cache")
71+
# Note: version uses dashes, not dots
72+
rules = cache.get_all_by_prefix(get_rules_cache_key("sdtmig", "3-4"))
73+
```
74+
75+
Each rule is a dict with keys: `core_id`, `domains`, `author`, `reference`, `sensitivity`, `executability`, `description`, `authorities`, `standards`, `classes`, `rule_type`, `conditions`, `actions`, `datasets`, `output_variables`.
76+
77+
If you have rules in raw CDISC metadata format, convert them first:
78+
79+
```python
80+
from cdisc_rules_engine.models.rule import Rule
81+
82+
rule_dict = Rule.from_cdisc_metadata(rule_metadata)
83+
rule_obj = Rule(rule_dict)
84+
```
85+
86+
---
87+
88+
## Option A: Business Rules Engine
89+
90+
Minimal setup — good for validating a single domain against an in-memory DataFrame.
91+
92+
### Prepare Your Data
93+
94+
```python
95+
from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset
96+
from cdisc_rules_engine.models.dataset_variable import DatasetVariable
97+
from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata
98+
99+
pandas_dataset = PandasDataset(data=df)
100+
dataset_metadata = SDTMDatasetMetadata(
101+
name="AE",
102+
label="Adverse Events",
103+
first_record=df.iloc[0].to_dict() if not df.empty else None
104+
)
105+
dataset_variable = DatasetVariable(
106+
pandas_dataset,
107+
column_prefix_map={"--": dataset_metadata.domain},
108+
)
109+
```
110+
111+
`DatasetVariable` accepts additional optional arguments for richer validation:
112+
113+
```python
114+
dataset_variable = DatasetVariable(
115+
pandas_dataset,
116+
column_prefix_map={"--": dataset_metadata.domain},
117+
value_level_metadata=value_level_metadata,
118+
column_codelist_map=variable_codelist_map,
119+
codelist_term_maps=codelist_term_maps,
120+
)
121+
```
122+
123+
### Run Rules
124+
125+
```python
126+
from business_rules.engine import run
127+
from cdisc_rules_engine.models.actions import COREActions
128+
129+
ae_rules = [
130+
r for r in rules
131+
if "AE" in r.get("domains", {}).get("Include", [])
132+
or "ALL" in r.get("domains", {}).get("Include", [])
133+
]
134+
135+
all_results = []
136+
for rule in ae_rules:
137+
results = []
138+
core_actions = COREActions(
139+
output_container=results,
140+
variable=dataset_variable,
141+
dataset_metadata=dataset_metadata,
142+
rule=rule,
143+
value_level_metadata=None,
144+
)
145+
try:
146+
was_triggered = run(rule=rule, defined_variables=dataset_variable, defined_actions=core_actions)
147+
if was_triggered:
148+
all_results.extend(results)
149+
except Exception as e:
150+
print(f"Error in {rule.get('core_id')}: {e}")
151+
```
152+
153+
`was_triggered` is `True` if issues were found. Each result in `all_results` looks like:
154+
155+
```python
156+
{
157+
'executionStatus': 'success',
158+
'domain': 'AE',
159+
'variables': ['AESLIFE'],
160+
'message': 'AESLIFE is completed, but not equal to "N" or "Y"',
161+
'errors': [{'value': {'AESLIFE': 'Maybe'}, 'row': 1}]
162+
}
163+
```
164+
165+
---
166+
167+
## Option B: RulesEngine Class
168+
169+
More setup, but handles dataset reading, preprocessing, and multi-domain validation. The source code in `cdisc_rules_engine/rules_engine.py` and the existing CLI implementation in `core.py` are the best reference for wiring this together — the initializer arguments map closely to the CLI flags documented in the [CLI Reference](cli-reference.md).
170+
171+
### Step 1: Prepare Dataset Metadata
172+
173+
```python
174+
import os
175+
import pyreadstat
176+
from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata
177+
178+
def create_dataset_metadata(file_path):
179+
data, meta = pyreadstat.read_xport(file_path)
180+
first_record = data.iloc[0].to_dict() if not data.empty else None
181+
return SDTMDatasetMetadata(
182+
name=os.path.basename(file_path).split('.')[0].upper(),
183+
label=meta.file_label if hasattr(meta, 'file_label') else "",
184+
filename=os.path.basename(file_path),
185+
full_path=file_path,
186+
file_size=os.path.getsize(file_path),
187+
record_count=len(data),
188+
first_record=first_record,
189+
)
190+
191+
datasets = [
192+
create_dataset_metadata(os.path.join(directory, f))
193+
for f in os.listdir(directory)
194+
if f.lower().endswith('.xpt')
195+
]
196+
```
197+
198+
You don't need to manually create `PandasDataset` or `DatasetVariable` objects for Option B — the engine handles this internally.
199+
200+
### Step 2: Initialize Library Metadata
201+
202+
```python
203+
from cdisc_rules_engine.models.library_metadata_container import LibraryMetadataContainer
204+
from cdisc_rules_engine.utilities.utils import (
205+
get_library_variables_metadata_cache_key,
206+
get_model_details_cache_key_from_ig,
207+
get_standard_details_cache_key,
208+
get_variable_codelist_map_cache_key,
209+
)
210+
211+
standard = "sdtmig"
212+
standard_version = "3-4"
213+
standard_substandard = None
214+
215+
standard_metadata = cache.get(get_standard_details_cache_key(standard, standard_version, standard_substandard))
216+
model_metadata = cache.get(get_model_details_cache_key_from_ig(standard_metadata)) if standard_metadata else {}
217+
218+
ct_packages = ["sdtmct-2021-12-17"] # replace with your CT package versions
219+
ct_package_metadata = {pkg: cache.get(pkg) for pkg in ct_packages}
220+
221+
library_metadata = LibraryMetadataContainer(
222+
standard_metadata=standard_metadata,
223+
model_metadata=model_metadata,
224+
variables_metadata=cache.get(get_library_variables_metadata_cache_key(standard, standard_version, standard_substandard)),
225+
variable_codelist_map=cache.get(get_variable_codelist_map_cache_key(standard, standard_version, standard_substandard)),
226+
ct_package_metadata=ct_package_metadata,
227+
)
228+
```
229+
230+
### Step 3: Initialize Data Service
231+
232+
```python
233+
from cdisc_rules_engine.config import config as default_config
234+
from cdisc_rules_engine.services.data_services import DataServiceFactory
235+
236+
max_dataset_size = max(datasets, key=lambda x: x.file_size).file_size
237+
# Set max_dataset_size=0 to force Dask processing for all datasets
238+
239+
data_service_factory = DataServiceFactory(
240+
config=default_config,
241+
cache_service=cache,
242+
standard=standard,
243+
standard_version=standard_version,
244+
standard_substandard=standard_substandard,
245+
library_metadata=library_metadata,
246+
max_dataset_size=max_dataset_size,
247+
)
248+
249+
data_service = data_service_factory.get_data_service(dataset_paths)
250+
```
251+
252+
### Step 4: Initialize Rules Engine
253+
254+
```python
255+
from cdisc_rules_engine.rules_engine import RulesEngine
256+
257+
rules_engine = RulesEngine(
258+
cache=cache,
259+
data_service=data_service,
260+
config_obj=default_config,
261+
external_dictionaries=None,
262+
standard=standard,
263+
standard_version=standard_version,
264+
standard_substandard=None,
265+
library_metadata=library_metadata,
266+
max_dataset_size=max_dataset_size,
267+
dataset_paths=dataset_paths,
268+
ct_packages=ct_packages,
269+
define_xml_path="path/to/define.xml", # optional
270+
validate_xml=False,
271+
)
272+
```
273+
274+
### Step 5: Run Validation
275+
276+
Note the `ConditionCompositeFactory` conversion step — this is required before passing rules to `validate_single_rule`:
277+
278+
```python
279+
import time
280+
from cdisc_rules_engine.models.rule_conditions import ConditionCompositeFactory
281+
from cdisc_rules_engine.models.rule_validation_result import RuleValidationResult
282+
283+
start_time = time.time()
284+
validation_results = []
285+
286+
for rule in rules:
287+
try:
288+
if isinstance(rule["conditions"], dict):
289+
rule["conditions"] = ConditionCompositeFactory.get_condition_composite(rule["conditions"])
290+
results = rules_engine.validate_single_rule(rule, datasets)
291+
flattened = [r for domain_results in results.values() for r in domain_results]
292+
validation_results.append(RuleValidationResult(rule, flattened))
293+
except Exception as e:
294+
print(f"Error validating rule {rule.get('core_id')}: {e}")
295+
296+
elapsed_time = time.time() - start_time
297+
```
298+
299+
### Step 6: Generate Report
300+
301+
Simple text output:
302+
303+
```python
304+
import json
305+
306+
with open("validation_results.txt", "w") as f:
307+
for result in validation_results:
308+
rule_id = result.rule.get("core_id", "Unknown")
309+
f.write(f"Rule: {rule_id}\n")
310+
if hasattr(result, 'violations') and result.violations:
311+
f.write(f"Found {len(result.violations)} violations\n")
312+
for violation in result.violations:
313+
f.write(f" - {json.dumps(violation, default=str)}\n")
314+
else:
315+
f.write(" No violations found\n")
316+
f.write("\n")
317+
```
318+
319+
For structured output, use `ReportFactory`:
320+
321+
```python
322+
reporting_factory = ReportFactory(
323+
datasets=datasets,
324+
validation_results=validation_results,
325+
elapsed_time=elapsed_time,
326+
args=args,
327+
data_service=data_service,
328+
)
329+
reporting_services = reporting_factory.get_report_services()
330+
```
331+
332+
---
333+
334+
## Notes
335+
336+
**Cache key format** — always use dashes in version strings (`3-4`, not `3.4`).
337+
338+
**`column_prefix_map`** — maps the `--` variable prefix to the dataset domain (e.g. `{"--": "AE"}`), resolving placeholders like `--SEQ``AESEQ`.
339+
340+
**External dictionaries** — pass an `ExternalDictionariesContainer` to `RulesEngine` if validating rules that require MedDRA, WHODrug, LOINC, UNII, MedRT, or SNOMED. See the [External Dictionary Reference](https://cdisc-org.github.io/conformance-rules-editor/#/exdictionary).
341+
342+
**Dask** — set `max_dataset_size=0` when initializing `DataServiceFactory` to force Dask processing for all datasets.
343+
344+
**Windows compatibility** — add `freeze_support()` for multiprocessing:
345+
346+
```python
347+
from multiprocessing import freeze_support
348+
349+
if __name__ == "__main__":
350+
freeze_support()
351+
main()
352+
```
353+
354+
---
355+
356+
## Troubleshooting
357+
358+
- Ensure the DataFrame contains all required columns for the rules being run
359+
- `column_prefix_map` must correctly map `"--"` to the domain (e.g. `{"--": "AE"}`)
360+
- The dataset object must be a `PandasDataset` instance, not a raw pandas DataFrame
361+
- `full_path` must be set in `SDTMDatasetMetadata` when using the `RulesEngine` approach
362+
- The rule's `domains.Include` must match your dataset's domain
363+
- `standard_version` format must be consistent throughout (`3-4`, not `3.4`)
364+
- CT package metadata must be present in the cache if validating against controlled terminology
365+
- When using `define.xml`, the file must be named `define.xml` and the path must be valid
366+
- If using external dictionaries, verify all file paths are correct and accessible
367+
- Don't forget the `ConditionCompositeFactory` conversion before calling `validate_single_rule` (Option B)

docs/README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
<p align="center">
2+
<a href="https://www.cdisc.org">
3+
<img src="./CORE_logo_sm.png" alt="CORE Logo">
4+
</a>
5+
</p>
6+
7+
[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3120)
8+
[![PyPI](https://img.shields.io/pypi/v/cdisc-rules-engine.svg)](https://pypi.org/project/cdisc-rules-engine)
9+
[![Docker](https://img.shields.io/docker/v/cdiscdocker/cdisc-rules-engine?label=docker)](https://hub.docker.com/r/cdiscdocker/cdisc-rules-engine)
10+
11+
# CDISC Rules Engine (CORE)
12+
13+
Open source offering of the CDISC Conformance Rules Engine — a tool for validating clinical trial data against CDISC data standards. CORE validates study data structure and conformance against both published CDISC conformance rules for the various CDISC standards and custom rules authored in the CORE rule format.

0 commit comments

Comments
 (0)