Skip to content

Fix library agnosticism in random locations#679

Closed
tylerriccio33 wants to merge 7 commits into
posit-dev:mainfrom
tylerriccio33:lib-agnostic
Closed

Fix library agnosticism in random locations#679
tylerriccio33 wants to merge 7 commits into
posit-dev:mainfrom
tylerriccio33:lib-agnostic

Conversation

@tylerriccio33
Copy link
Copy Markdown
Contributor

This is meant to fix #675 in GT and posit-dev/pointblank#132 in pointblank.

I'm brute forcing it since I see no other way. I'm replacing the single dispatch paradigm (in effected places) with an attempt to import the library in question.

FYI I'm 99% sure this would be solved by Narwhals if you wanted to go that route down the line.

@tylerriccio33
Copy link
Copy Markdown
Contributor Author

Actually, can I get another set of eyes on this issue? The scope all of the sudden seems massive. I thought it was contained to a few accidental instance checks on single dispatch methods, but it appears to be instance checking a TON of functions.

If I'm scoping this correctly, I think there are 3 options:

  1. Doing a naive check like 'pandas' in str(type(df)): ...
  2. Doing a try-except import check like try: import('pandas') ...
  3. Adopting narwhals

I'm currently unaware of a way to do an instance check on a class you do not have available in your process. Happy to help out w/any of these, I would be aided by this being solved.

I'd be even happier if you were to tell me I'm wrong and the problem is far more contained :)

@machow
Copy link
Copy Markdown
Collaborator

machow commented May 6, 2025

Hey, thanks for taking a look at this, and raising the original issue. If you have the chance, do you mind taking a look at this explanation, and letting me know if it does what you need?

The tl;dr is that you should be able to pass a polars series to the val functions, but we should better document this 😅

#675 (comment)

edit: in case it's useful, the approach we're using in Great Tables comes from https://github.com/machow/databackend, which has some explanation around how we're able to use singledispatch without requiring pandas/polars imports

@jrycw
Copy link
Copy Markdown
Collaborator

jrycw commented May 7, 2025

@tylerriccio33, You might find our Bring Your Own DataFrame blog post helpful—feel free to check it out.

@tylerriccio33
Copy link
Copy Markdown
Contributor Author

tylerriccio33 commented May 19, 2025

I'm not the expert in how single dispatch decorator works in python, but it appears to be running an instance check when to_frame is called, which I don't think is intended. Even when i add code to convert the the list to a polars series, then pass it to to_frame, it does an instance check and fails since pandas isn't available.

I feel I may be not understanding a key concept here, and therefore can't fix this. I'd really like to fix this bug, I'm running into some compatibility issues at work; it's sometimes difficult to load prod containers w/pandas due to other compatibility concerns (arrow/numpy). Let me know if i'm understanding this correctly.

Pushed a new pr to show the test I'm using, which may also be the problem?

@jrycw
Copy link
Copy Markdown
Collaborator

jrycw commented May 24, 2025

I came across a PyCon 2025 talk on YouTube that briefly yet clearly outlines four different ways to choose actions based on instance types.

@tylerriccio33
Copy link
Copy Markdown
Contributor Author

The error is happening before the function code is called, it's happening during the evaluation process in singledispatch. In AbstractBackend __subclasshook__ method, I pass a series but the only cls._backend is a Pandas series, so it cannot import the module. This feels very specific to the data backend implementation so I'm having a hard time debugging where the backend class attribute is set...

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2025

Codecov Report

Attention: Patch coverage is 81.81818% with 4 lines in your changes missing coverage. Please review.

Project coverage is 91.05%. Comparing base (4a11db5) to head (eff1363).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
great_tables/_databackend.py 86.66% 2 Missing ⚠️
great_tables/_formats_vals.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #679      +/-   ##
==========================================
- Coverage   91.18%   91.05%   -0.13%     
==========================================
  Files          47       47              
  Lines        5465     5480      +15     
==========================================
+ Hits         4983     4990       +7     
- Misses        482      490       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tylerriccio33 tylerriccio33 marked this pull request as ready for review June 1, 2025 18:12
@tylerriccio33
Copy link
Copy Markdown
Contributor Author

tylerriccio33 commented Jun 1, 2025

This pr is imperfect but working. I found it wasn't working because Polar's Series object isn't defined or available at the top level of the module. Instead, it's down a few levels. To get around this I added the full path (series.series.Series) to the backend.

class PlSeries(AbstractBackend): _backends: list[tuple[str, str]] = [("polars", "series.series.Series")]

I then added additional logic to allow for a class name not defined at the top level:

`class AbstractBackend(metaclass=_AbstractBackendMeta):
@classmethod
def init_subclass(cls):
if not hasattr(cls, "_backends"):
cls._backends = []

@classmethod
def __subclasshook__(cls, subclass):
    for mod_name, cls_name in cls._backends:
        try:
            parent_candidate = _load_class(mod_name, cls_name)
            if issubclass(subclass, parent_candidate):
                return True
        except (ModuleNotFoundError, ImportError):
            continue

    return NotImplemented`

And the new load_class function which is the workhorse:
`def _load_class(mod_name: str, cls_name: str) -> type:
# follow the path of mods if it is one
listified_mod_path: list[str] = cls_name.split(".")

sub_module_path: list[str] = [mod_name] + listified_mod_path[:-1]

cur_mod: ModuleType = importlib.import_module(sub_module_path[0])
_cls_name: str = listified_mod_path[-1]

if len(sub_module_path) == 1:  # no submodules, just the module itself
    return getattr(cur_mod, _cls_name)

# now, we must increment the module paths to import the final class
remaining_mods: list[str] = sub_module_path[1:]
for mod in remaining_mods:
    cur_mod: ModuleType = getattr(cur_mod, mod)

return getattr(cur_mod, _cls_name)`

Again, not the most elegant but it appears to be effective. I've struggled to wrap my head around some aspects of how the backend works, so it's possible I'm missing finer points.

Also, strictly for development purposes, I redefined some dev deps in the pyproject.toml file for uv to find, and added uv.lock to the gitignore. Feel free to reject that, just helped me code faster!

@machow
Copy link
Copy Markdown
Collaborator

machow commented Jun 2, 2025

I've added a test in a PR to confirm that pandas does not get imported when the vals functions run. More context in the original issue. Do you mind taking a look, and letting me know if more tests / other testing behaviors would be useful?

#675 (comment)

@machow
Copy link
Copy Markdown
Collaborator

machow commented Jun 5, 2025

I'm going to close (see comment in #691) -- but happy to tee up anything useful in an issue

@machow machow closed this Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pandas is a ghost requirement?

3 participants