fix(chart): word cloud secondary sort prevents Druid TopN optimization when sort_by_metric enabled#39073
Conversation
Code Review Agent Run #e84fc6Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
|
Hey @bdonovan1, Thank you so much for the complete issue description with steps to repro + a fix with test coverage! I just approved CI here, but note you need a rebase. This makes sense to me, but I'm also tagging @michael-s-molina, @villebro, @rusackas and @betodealmeida to see if they have any concerns. Technically this could change existing charts for users that have a row limit smaller than the amount of data (and multiple dimensions with the same metric result), but I think it's ok. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #39073 +/- ##
=======================================
Coverage 64.47% 64.47%
=======================================
Files 2558 2558
Lines 133241 133241
Branches 30955 30955
=======================================
+ Hits 85901 85902 +1
+ Misses 45849 45848 -1
Partials 1491 1491
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
131beef to
573c520
Compare
Code Review Agent Run #42a88fActionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
alexandrusoare
left a comment
There was a problem hiding this comment.
This LGTM but waiting on the other reviewers mentioned to see if they have any concerns
| orderby.push([metric, false]); | ||
| } | ||
| if (series) { | ||
| } else if (series) { |
There was a problem hiding this comment.
Simple change, but I don't know the side effects of this change.
There was a problem hiding this comment.
Possible side-effects:
Unstable ordering among ties (same metric value) If multiple “series” values have the exact same metric value, then without the secondary series ASC tiebreaker, the ordering between those tied rows becomes database/engine dependent (and may vary between runs).
LIMIT / row limit can return different members of a “tied” group If the query has a row limit (Top-N) and there are ties at the cutoff boundary, the previous secondary sort gave a consistent way to decide which tied rows make it into the result set. Without it, you may get different tied rows.
I just want to make sure we don't get another issue about ties not being sorted as desired.
There was a problem hiding this comment.
yeah that's what I thought too, so I think it's important to get thoughts from others. The fact is with the additional sorting, while it was consistent, users had no control over it I think? Like unsure if you can choose to be asc, desc, etc.
Do we need to look into make this configurable? One, the other, or both?
There was a problem hiding this comment.
Hi @rusackas @alexandrusoare and @Vitor-Avila , thanks for taking a look at this! It sounds like there are concerns about consistently handling tie-breakers when two words have the same popularity, which is valid. I am fine with making this configurable with a second checkbox if necessary. Just one thing I would like to understand:
What is the use case for sorting only by series? Based on my own observations, this will return the top N unique words alphabetically, regardless of popularity in the dataset. That is the current default behavior, but it isn't really a "word cloud" in the traditional sense. In high cardinality datasets, this returns long-tail results that are a bit arbitrary and nonsensical. E.g. in our use case, we are generating a word cloud of most popular terms that end users typed into a particular search box. Sort-only-by-series gives us the few random users who typed something like !!, !., !.., etc (all with a count of 1), rather than actual popular search terms.
I almost feel like it should ALWAYS sort by metric, and sort by series should be optional for handling tie-breakers when two words have the exact same count. But I am happy to go either route:
- Add a second checkbox, so you can independently decide to sort by metric and/or series.
- Keep one checkbox, but change the meaning / description. Default behavior is to only sort by metric (DESC). But if you check the box, it will sort by metric (DESC), series (ASC).
Of those two choices (1) is more backwards-compatible, but (2) is more intuitive. Please advise on which choice we should make.
There was a problem hiding this comment.
I totally agree with you, not sure what's the purpose of sorting only by dimension, but can see users wanting to either doing only metric, or metric + dimension.
@yousoph @kasiazjc some product input would be greatly helpful here :) What are your thoughts on a sort control for Word Cloud? Currently, it always sort by dimension, but there's a checkbox to also sort by metric. The purpose of this PR is to be able to sort only for metric (as sorting by both is inefficient for some DBs) while keeping backwards compatibility. Some options:
- Keep the "Sort by Metric" checkbox in there, add a new "Sort by Series" checkbox that's enabled by default (for backwards compatibility).
- Have a dropdown selector to choose between sorting options (Similar to the X-Axis sort, or the un-aggregated table "Ordering" dropdown).
THank you!
There was a problem hiding this comment.
I think option 1 makes sense to me (keep "Sort by "Metric", add "Sort by Series")
But one change from what you said Vitor - I think it's enabled by default for existing charts but ideally we could not check it for new Word Cloud charts going forward
(The "Ordering" dropdown in tables is multi select with both asc and desc options that you can select both on, that seems like it could be confusing here... and separately may be worth looking into on those charts, but that's a different problem :) )
There was a problem hiding this comment.
thanks, @yousoph!
@bdonovan1 do you think it would be possible to add a second checkbox, and have "Sort by Series" enabled for existing charts, but not checked for new charts? I think that might need to require a DB migration.
Thank you!
There was a problem hiding this comment.
Thanks for the discussion and quick consensus everyone!
I have updated the approach to add a separate, independent checkbox for "sort by series". This allows users to consistently handle tiebreakers (with a performance cost on some DBs like Druid). The default for this new checkbox is False - new charts will need to explicitly check the box if they want alphabetic tiebreakers.
However, as suggested by @Vitor-Avila , I also added a DB migration so existing charts will not be affected. If you apply the DB upgrade, it will automatically set this value to True for existing word cloud charts. So, existing users will still get the alphabetic tiebreakers without having to modify their charts.
There was a problem hiding this comment.
Pull request overview
This PR fixes Word Cloud query generation so that enabling sort_by_metric no longer adds a secondary series sort, which can prevent Apache Druid from using its native TopN optimization and cause severe performance regressions on high-cardinality dimensions.
Changes:
- Make
seriesordering mutually exclusive withsort_by_metricin the Word CloudbuildQueryimplementation. - Add unit tests to cover
orderbybehavior for bothsort_by_metric=trueandsort_by_metric=false.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
superset-frontend/plugins/plugin-chart-word-cloud/src/plugin/buildQuery.ts |
Updates orderby construction to avoid multi-column ordering when metric sorting is enabled. |
superset-frontend/plugins/plugin-chart-word-cloud/test/buildQuery.test.ts |
Adds test coverage validating the resulting orderby output in both sorting modes. |
|
The change modifies the ordering logic in buildQuery.ts to make it mutually exclusive: if sort_by_metric is true, it orders by the metric descending; otherwise, if series is present, it orders by series ascending. Previously, both could apply simultaneously. Side effects include altered query results when both conditions are met—now only the primary sort (metric) takes precedence, potentially changing data display order in word clouds if users expect secondary sorting by series. superset-frontend/plugins/plugin-chart-word-cloud/src/plugin/buildQuery.ts |
573c520 to
427d94b
Compare
Code Review Agent Run #2b28a2Actionable Suggestions - 0Additional Suggestions - 1
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
| params = json.loads(slc.params or "{}") | ||
| slc.params = json.dumps(upgrade_params(params)) |
There was a problem hiding this comment.
Suggestion: The migration upgrade path assumes every params value is valid JSON object data. If any word_cloud slice has malformed JSON or a non-object JSON payload, json.loads/dictionary mutation will raise and abort the migration for the whole release. Guard parsing and skip invalid rows so one bad record does not break migration execution. [possible bug]
Severity Level: Critical 🚨
- ❌ Database upgrade can fail on single corrupt word_cloud slice.
- ⚠️ Prevents deploying code changes relying on this migration.
- ⚠️ Inconsistent with other migrations that guard slice JSON parsing.| params = json.loads(slc.params or "{}") | |
| slc.params = json.dumps(upgrade_params(params)) | |
| try: | |
| params = json.loads(slc.params or "{}") | |
| if not isinstance(params, dict): | |
| continue | |
| slc.params = json.dumps(upgrade_params(params)) | |
| except (json.JSONDecodeError, TypeError): | |
| continue |
Steps of Reproduction ✅
1. Start Superset against a database where at least one `word_cloud` chart row in
`slices.params` is malformed JSON or a non-object value (for example, `params='\"just a
string\"'` or `params='not-json'`) in the `slices` table; these rows are what `Slice`
models in
`superset/migrations/versions/2026-04-13_19-28_fd0c8583b46d_add_sort_by_series_to_word_cloud_charts.py:40-45`
read.
2. Run the database upgrade (e.g. `superset db upgrade` / Alembic `upgrade head`), which
executes the `upgrade()` function at
`superset/migrations/versions/2026-04-13_19-28_fd0c8583b46d_add_sort_by_series_to_word_cloud_charts.py:58-67`.
3. Inside `upgrade()`, the loop over
`paginated_update(session.query(Slice).filter(Slice.viz_type == "word_cloud"))` at lines
62-64 yields the malformed slice, and line 65 executes `params = json.loads(slc.params or
"{}")` without any `try/except`.
4. For a non-JSON value this raises `json.JSONDecodeError`, or for a non-dict JSON (e.g.
string/array) `upgrade_params(params)` at lines 47-50/66 will hit
`params["sort_by_series"] = True` and raise `TypeError` ("list indices must be integers or
slices, not str"), propagating out of the loop and causing the whole migration (and thus
the overall upgrade command) to fail instead of skipping just the bad slice; this risk is
reinforced by older migrations like
`2017-01-24_12-31_db0c65b146bd_update_slice_model_json.py:6-13` and
`2018-04-10_11-19_bf706ae5eb46_cal_heatmap_metric_to_metrics.py:5-14`, which already wrap
similar `json.loads(slc.params or "{}")` calls in `try/except` to tolerate bad slice data.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** superset/migrations/versions/2026-04-13_19-28_fd0c8583b46d_add_sort_by_series_to_word_cloud_charts.py
**Line:** 65:66
**Comment:**
*Possible Bug: The migration upgrade path assumes every `params` value is valid JSON object data. If any `word_cloud` slice has malformed JSON or a non-object JSON payload, `json.loads`/dictionary mutation will raise and abort the migration for the whole release. Guard parsing and skip invalid rows so one bad record does not break migration execution.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.There was a problem hiding this comment.
@betodealmeida do you know if we need this? I imagine we have schema validation to prevent malformed charts from being created but not 100% sure
There was a problem hiding this comment.
I don't think we need it, but it wouldn't hurt. What would be better is to define these columns as JSON, like we've been doing for newer models, since then you get the validation for free.
For this PR I think it's fine, it's very unlikely that this would be needed.
| params = json.loads(slc.params or "{}") | ||
| slc.params = json.dumps(downgrade_params(params)) |
There was a problem hiding this comment.
Suggestion: The downgrade path has the same unguarded JSON parsing/object assumption, so a single invalid params value can raise and stop downgrade entirely. Add safe parsing with type checks to keep downgrade resilient across corrupted rows. [possible bug]
Severity Level: Critical 🚨
- ❌ Database downgrade can be blocked by one bad word_cloud slice.
- ⚠️ Makes rollbacks brittle in presence of corrupted chart params.
- ⚠️ Diverges from established resilient patterns in similar migrations.| params = json.loads(slc.params or "{}") | |
| slc.params = json.dumps(downgrade_params(params)) | |
| try: | |
| params = json.loads(slc.params or "{}") | |
| if not isinstance(params, dict): | |
| continue | |
| slc.params = json.dumps(downgrade_params(params)) | |
| except (json.JSONDecodeError, TypeError): | |
| continue |
Steps of Reproduction ✅
1. Use a database where at least one `word_cloud` slice row has malformed or non-object
JSON in `slices.params` (same concrete shape as in the upgrade case, e.g.
`params='not-json'` or `params='\"string\"'`) referenced by the `Slice` model at
`superset/migrations/versions/2026-04-13_19-28_fd0c8583b46d_add_sort_by_series_to_word_cloud_charts.py:40-45`.
2. Run a database downgrade that crosses this revision (e.g. `superset db downgrade` /
Alembic `downgrade`), which calls the `downgrade()` function at
`superset/migrations/versions/2026-04-13_19-28_fd0c8583b46d_add_sort_by_series_to_word_cloud_charts.py:70-79`.
3. In `downgrade()`, the loop over
`paginated_update(session.query(Slice).filter(Slice.viz_type == "word_cloud"))` at lines
74-76 yields the malformed slice, and line 77 runs `params = json.loads(slc.params or
"{}")` with no error handling.
4. A malformed `params` value triggers `json.JSONDecodeError` at line 77, or a non-dict
JSON causes `downgrade_params(params)` at lines 53-55/78 to call
`params.pop("sort_by_series", None)` and raise `AttributeError` for non-mapping types,
bubbling up and aborting the entire downgrade operation; other slice-migration files like
`2024-03-01_10-47_be1b217cd8cd_big_number_kpi_single_metric.py:14-27,37-50` and
`2025-04-13_22-10_378cecfdba9f_merge_x_axis_sort_series_with_x_axis_.py:7-27,37-55`
demonstrate that migrations touching `Slice.params` typically wrap
`json.loads(slc.params)` in `try/except` specifically to avoid this class of failure.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** superset/migrations/versions/2026-04-13_19-28_fd0c8583b46d_add_sort_by_series_to_word_cloud_charts.py
**Line:** 77:78
**Comment:**
*Possible Bug: The downgrade path has the same unguarded JSON parsing/object assumption, so a single invalid `params` value can raise and stop downgrade entirely. Add safe parsing with type checks to keep downgrade resilient across corrupted rows.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.There was a problem hiding this comment.
the two comments are related: addressing one covers the other
| 'Include a secondary sort by series name (ascending). ' | ||
| + 'This produces consistent ordering when multiple series have the ' | ||
| + 'same metric value, but may reduce query performance on some databases.', |
There was a problem hiding this comment.
Here we're calling it a "secondary" sorting rule, but the controls actually allow independent selection, right? So someone would be able to enable just it, making it "primary"?
There was a problem hiding this comment.
Good point, maybe we can remove "secondary" from this wording.
|
@bdonovan1 this LGTM! Left a minor comment and approved CI. Will reach out to a codeowner too since it's needed here |
Vitor-Avila
left a comment
There was a problem hiding this comment.
Left a minor suggestion on display text. Otherwise LGTM
|
@bdonovan1 thanks for dialing this in! Ready to click the merge button, if you're willing to fix |
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Thanks everyone! I addressed the two comments in the review, and I believe I also fixed the code style, which was getting caught by Prettier and failing the pre-commit check. If all checks pass, then I think we are ready to merge! |
Code Review Agent Run #0ffd47Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
…n when sort_by_metric enabled When sort_by_metric is true, buildQuery unconditionally appended a secondary ORDER BY series ASC alongside the metric sort. On Druid, any multi-column ORDER BY prevents the native TopN query optimization, forcing a full GroupBy scan that can cause dramatic slowdowns and timeouts on high-cardinality dimensions. Make the series sort mutually exclusive with sort_by_metric, and add test coverage for both sort behaviors. Fixes apache#39072
Make sort_by_metric and sort_by_series independent controls instead of the previous behavior where series sort was unconditionally appended alongside the metric sort. This allows users to choose metric-only sorting (enabling Druid TopN optimization) while optionally adding a secondary series sort for tie-breaking. Changes: - buildQuery.ts: sort_by_metric and sort_by_series are now independent - controlPanel.tsx: add sort_by_series checkbox (default: false) - buildQuery.test.ts: cover all four sort combinations - DB migration: set sort_by_series=true for existing word_cloud charts to preserve backward compatibility Fixes apache#39072
Remove "secondary" from the description since sort_by_series and sort_by_metric are independent controls — either can be used alone. Clarify that the tiebreaker behavior only applies when both are enabled. Fixes prettier formatting (+ at end of line, not beginning).
Guard against malformed or non-dict params in word_cloud slices during upgrade/downgrade. Skips invalid rows instead of aborting the entire migration.
864a07d to
2e1f747
Compare

SUMMARY
When
sort_by_metricis enabled on a Word Cloud chart,buildQuery.tsunconditionally appended a secondaryORDER BY [series, ASC]alongside the metric sort. On Apache Druid, any multi-columnORDER BYprevents the native TopN query optimization, forcing a full GroupBy scan over the entire dataset before applyingLIMIT. On high-cardinality dimensions this can cause dramatic query slowdowns and timeouts.Root cause — in
buildQuery.ts, the series sort was always appended whenserieswas set, regardless ofsort_by_metric:Fix — add an independent
sort_by_seriescontrol so users can choose whether to include the secondary dimension sort:This gives users full control over sorting behavior:
Changes:
buildQuery.ts:sort_by_metricandsort_by_seriesare now independent controlscontrolPanel.tsx: new "Sort by series" checkbox (default:falsefor new charts)buildQuery.test.ts: test coverage for all four sort combinationssort_by_series=truefor all existingword_cloudcharts to preserve backward compatibilityFixes #39072
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
Not applicable — no visual change to the word cloud rendering. The new "Sort by series" checkbox appears in the Query section of the word cloud control panel, below "Sort by metric".
TESTING INSTRUCTIONS
ORDER BY <metric> DESCORDER BY <metric> DESC, <series> ASCUnit tests:
Migration test:
ADDITIONAL INFORMATION
Runtime estimate: Migration processes only
word_cloudcharts (typically a small number). Tested against 6 charts in < 1 second. No downtime required.