Relax dependency constraints#1713
Conversation
14ffff8 to
a646ccf
Compare
|
@filippsatverily nice work, this looks good-- can you please resolve the merge conflicts so we can get this merged. |
Moves dependency constraints to pyproject.toml. Makes requirements.txt a lockfile.
Fixes an incompatibility caused by click 8.3.0, which passes the default value as-is.
Fixes an incompatibility caused by pyreadstat 1.2.9, which changed original_variable_type from 'NULL' to None
Works around an behavior change in jsonpath-ng 1.8.0 where Child.str gets wrapped in parenthesis.
Fixes tokenization errors when using dask 2024.8.1+. Starting with this version, dask enforces that tokens remain stable across pickle round-trips (dask/dask#11320). Capturing self in a lambda fails this check because instance objects can have non-deterministic pickle representations. Since calculate_variable_value_length is already a static method, replacing self with the class name is enough to remove the capture.
Fixes an import error caused by dask 2024.12.1, which removed the legacy dask.dataframe.dd submodule (dask/dask#11604). Changes the import to `import dask.dataframe as dd`, consistent with every other file in the codebase.
Dask 2025.4.0 optimizes multiple DataFrames together, which exposes division mismatches when assigning a pandas Series to a dask DataFrame column. The old reset_index/set_index workaround no longer avoids this. Replacing it with compute-assign-rewrap via dd.from_pandas, which builds a clean expression graph. This is safe because __getitem__ already computes the DataFrame to produce the Series being assigned.
Fixes a unit test to support pandas 2.2.0+. The pandas release fixes a sorting bug with pandas-dev/pandas#54611. This commit changes the expected results accordingly. Also fixes a merge type mismatch introduced by upstream cdisc-org#1709: the codelist metadata side was cast to StringDtype but the evaluation dataset side was not. With pandas 2.2.0, empty columns infer as float64, and merging float64 with string is rejected. Casting both sides to string before the merge resolves this.
43a93e8 to
4e9f24e
Compare
@SFJohnson24 Please see updated PR. I added a couple small changes to accommodate the code churn over the last 2 months. |
There was a problem hiding this comment.
This PR correctly updates dependencies and and resolves several other issues
Edit: I reverted this PR after merging & running our test suite with the changes. It appears to have broken several USDM as well as two rules in Dask. I am still sorting the details for both and will report back as I parse the details.
-
With Dask, this appears to be related to CORE-000334 and CORE-000355 both reporting
Cannot mask with non-boolean array containing NA / NaN valuesin get_error_rows() which I suspect is a downstream result of the Dask indexing changes.
pdf[key] = value.valuesmay be the issue as .values is stripping the index from the series, making it a raw numpy array. This is then assigned positionally, giving a length mismatch causing downstream NaN indexing. I think adding
'pdf[key] = value.reindex(pdf.index)' will resolve this. I made a small push (meant to push to my own branch that fetched your fork but accidentally pushed to your fork, apologies for this) and made another pull request from that to rerun the test suite: #1755 -
there is a slight linter issue
-
1 unit test is failing: test_dataset_metadata_define_dataset_builder as the result needs the location sort that was added to expected.
-
can you update the docs as
pip install -e .; pip install --group devis the command for windows powershell, your command is for mac/linux; we would like both listed in the docs -
as for the USDM issues that arose: https://github.com/cdisc-org/cdisc-rules-engine/actions/runs/27025025445?pr=1755 you can scroll to the bottom and download the differences.
you can parse the logs-- DDF00175/CORE-00085 in the report has attributes: "", empty as well as
I suspect the changes to __get_full_path are producing subtly different content_path strings that _traverse_path in the USDM data service
|
@filippsatverily i tweaked your branch on my own to get immediate feedback from our test suite. #1757 |
Summary: expand dependency constraints and turn requirements.txt from a requirements file into a lockfile. This will allow the PyPi version of CORE to be used as a library in a larger project where other dependency constraints exist.
List of changes:
pyproject.tomldefaults that clash withrequiredto support a behavior change/fix in click 8.3.0DatasetXPTMetadataReader.readto support a behavior change in pyreadstat 1.2.9USDMDataService.__get_full_pathto support a behavior change in jsonpath-ng 1.8.0selfcapture inContentsDefineVLMDatasetBuilderto support dask 2024.8.1__setitem__reindexing inDaskDatasetto fix errors surfaced in dask 2025.4.0test_dataset_metadata_define_dataset_builderto support a sort behavior fix in pandas 2.2.0Tested scenarios: