Skip to content

[ENH] feat: add flag_outliers function for outlier detection#1602

Open
Psycoder0611 wants to merge 2 commits into
pyjanitor-devs:devfrom
Psycoder0611:feature/flag-outliers
Open

[ENH] feat: add flag_outliers function for outlier detection#1602
Psycoder0611 wants to merge 2 commits into
pyjanitor-devs:devfrom
Psycoder0611:feature/flag-outliers

Conversation

@Psycoder0611
Copy link
Copy Markdown

Summary

Adds a new flag_outliers() function to pyjanitor that detects and flags
outlier values in a numeric DataFrame column.

Motivation

Data pipelines frequently need to identify anomalous values before
aggregation or modeling. This function provides a clean, chainable
pandas method for outlier flagging during ETL workflows.

Changes

  • Added janitor/functions/flag_outliers.py with full implementation
  • Registered flag_outliers in janitor/functions/__init__.py
  • Added tests/functions/test_flag_outliers.py with 7 passing tests

Features

  • Supports IQR method (default, threshold=1.5)
  • Supports Z-score method (threshold configurable)
  • Chainable as a DataFrame method via pandas-flavor
  • Custom output column naming
  • Does not mutate the original DataFrame
  • Full input validation with meaningful error messages

Tests

All 7 tests pass covering:

  • IQR outlier detection
  • Z-score outlier detection
  • No outliers case
  • Custom flag column name
  • Immutability of original DataFrame
  • Invalid method error handling
  • Non-numeric column error handling

- Implements flag_outliers() as a pandas DataFrame method
- Supports IQR and Z-score detection methods
- Adds boolean flag column to indicate outlier rows
- Includes comprehensive unit tests (7 tests, all passing)
- Follows existing pyjanitor code style and conventions
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 85.65%. Comparing base (901f4b3) to head (a6bac54).
⚠️ Report is 183 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1602      +/-   ##
==========================================
- Coverage   87.56%   85.65%   -1.92%     
==========================================
  Files          95      126      +31     
  Lines        6819     9932    +3113     
==========================================
+ Hits         5971     8507    +2536     
- Misses        848     1425     +577     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ericmjl
Copy link
Copy Markdown
Member

ericmjl commented May 30, 2026

Hey @Psycoder0611, thanks for this contribution! Adding outlier detection to pyjanitor is a great fit for the library. The overall structure follows our conventions well — chainable API, pandas_flavor registration, immutability, good test coverage. Nice work.

I have a few suggestions organized by priority.

Must fix

  1. Use the existing check_column utility instead of rolling your own. The _check_column helper in flag_outliers.py is a duplicate of janitor.utils.check_column — same logic, same error messages. flag_nulls.py already imports it with from janitor.utils import check_column. Do the same here and delete the private helper.

  2. The shared threshold default is a footgun for Z-score users. The default threshold=1.5 makes sense for IQR but is misleading for Z-score, where 3.0 is standard. A user calling df.flag_outliers("col", method="zscore") gets a much more aggressive filter than they'd expect. Consider either: (a) making the default method-specific, or (b) raising a warning when method="zscore" and threshold wasn't explicitly provided.

  3. Missing from __future__ import annotations. Other function modules in pyjanitor include this at the top. Add it for consistency.

Should fix

  1. Tests should use match= in pytest.raises. Bare pytest.raises(ValueError) only checks that some ValueError was raised — it could be the wrong one. Use pytest.raises(ValueError, match="Invalid method") etc. to verify the correct error path.

  2. No test for NaN handling. IQR quantiles and Z-score mean/std have specific behavior with NaN values. This should be tested and the behavior documented in the docstring.

  3. test_non_method_functional is a misleading name. It tests calling flag_outliers(df, ...) as a standalone function. Something like test_standalone_function_call would be clearer.

  4. Missing Z-score example in the docstring. Only the IQR method is demonstrated. Adding a Z-score example would help users.

Using AI to address these

If you use an AI coding agent, here's a prompt you can copy-paste to work through these review comments:

I need to address code review feedback on my flag_outliers PR in pyjanitor. Please make the following changes to janitor/functions/flag_outliers.py and tests/functions/test_flag_outliers.py:

  1. Remove the _check_column helper function and replace its usage with from janitor.utils import check_column. Follow the pattern in janitor/functions/flag_nulls.py.
  2. Add from __future__ import annotations at the top of flag_outliers.py.
  3. When method="zscore" and the user did not explicitly pass a threshold, raise a UserWarning suggesting that the default 1.5 is designed for IQR and that 3.0 is typical for Z-score.
  4. In all tests that use pytest.raises, add the match= parameter to verify the correct error message is raised.
  5. Add a test for NaN handling — when the column contains NaN values, the function should still produce correct boolean flags and not crash.
  6. Rename test_non_method_functional to test_standalone_function_call.
  7. Add a Z-score usage example to the docstring of flag_outliers.

Read janitor/functions/flag_nulls.py and janitor/utils.py first to match existing codebase patterns.

Looking forward to the next revision!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants