Check sqllogictests for any dangling config settings(#17914)#20474
Check sqllogictests for any dangling config settings(#17914)#20474cj-zhukov wants to merge 3 commits intoapache:mainfrom
Conversation
High-Level OverviewThis PR introduces a validation script to prevent dangling configuration settings in sqllogictest ( This PR Adds:
Current Limitations:
Follow-Up Improvements:
|
|
@Jefffrey Since you originally opened this issue, could you please take a look at this PR? |
| # Any configuration set with: | ||
| # set datafusion.<config> = true | ||
| # must be reset in the same file with: | ||
| # set datafusion.<config> = false |
There was a problem hiding this comment.
What if the setting's default is true and an .slt test needed to set it to false but forgot to unset it ?
It would be better to use the reset datafusion.***; command instead
There was a problem hiding this comment.
Good point - using RESET is semantically more correct since it restores the default value rather than assuming it is false. I'm going to update the script to accept RESET datafusion.<config> as a valid restoration mechanism.
ci/scripts/check_slt_configs.sh
Outdated
| return | ||
| fi | ||
|
|
||
| echo "" |
There was a problem hiding this comment.
This looks strange.
Why not log the error from line 74 here and exit ?
There was a problem hiding this comment.
I agree with you - let's do it
ci/scripts/check_slt_configs.sh
Outdated
| # Process each match line-by-line | ||
| while IFS= read -r match; do | ||
| line_number=$(echo "$match" | cut -d: -f1) | ||
| line_content=$(echo "$match" | cut -d: -f2-) |
There was a problem hiding this comment.
nit: You could avoid starting a new process by using Bash-isms like:
line_number=${match%%:*}
line_content=${match#*:}
| set datafusion.execution.collect_statistics = false; | ||
|
|
||
| statement ok | ||
| set datafusion.execution.parquet.pushdown_filters = false; |
There was a problem hiding this comment.
Thank you for catching this - I missed that duplicate configuration in the push_down_filter.slt
| datafusion.runtime.temp_directory | ||
|
|
||
| statement ok | ||
| set datafusion.catalog.information_schema = false |
There was a problem hiding this comment.
| set datafusion.catalog.information_schema = false | |
| set datafusion.catalog.information_schema = false; |
There was a problem hiding this comment.
I noticed that other SET statements in this file do not use semicolons.
Should we standardize on including ;, or keep the existing style for consistency?
.github/workflows/rust.yml
Outdated
| needs: linux-build-lib | ||
| runs-on: ubuntu-latest | ||
| container: | ||
| image: amd64/rust |
There was a problem hiding this comment.
Any reason to use a container ?
The CI job just runs a Bash shell script
There was a problem hiding this comment.
That's correct - no reason for a container. I'll run it directly on ubuntu-latest instead of using the Rust container so it runs earlier and fails fast.
| fi | ||
|
|
||
| matches=$(grep -En \ | ||
| 'set[[:space:]]+datafusion\.[a-zA-Z0-9_.]+[[:space:]]*=[[:space:]]*true' \ |
There was a problem hiding this comment.
set could also be SET
See https://github.com/cj-zhukov/datafusion/blob/fe921dca542e84a48bc02e3c409ce7ac45686283/datafusion/sqllogictest/test_files/set_variable.slt#L23 for such example
There was a problem hiding this comment.
Good catch - I missed that SET can appear in uppercase.
I’ll update the script to perform case-insensitive matching so both set and SET are handled correctly.
| drop table t3 | ||
|
|
||
| statement ok | ||
| set datafusion.catalog.default_catalog = my_catalog; |
There was a problem hiding this comment.
The catalog and the schema below are never reset.
This is an example of non-boolean settings.
There was a problem hiding this comment.
Thank you for pointing this out. You’re right - this is currently a limitation. The script only checks boolean settings enabled via = true and does not detect non-boolean configuration changes.
At this stage, I’d prefer to keep the script simple rather than significantly increasing its complexity in Bash. If we decide to expand the scope to cover all SET datafusion.* mutations (including non-boolean values), I think it would be better to address that in a follow-up by implementing a more robust solution in Rust (e.g., using nom). That would allow us to properly track configuration changes and ensure they are restored in a more reliable way. I'm ready to work on this.
Happy to hear your thoughts on whether we should keep this PR focused or broaden its scope.
There was a problem hiding this comment.
Agreed! Keep it simple for the first version!
Maybe add a comment/TODO to the new Bash script about this use case.
ci/scripts/check_slt_configs.sh
Outdated
|
|
||
| # Check if reset exists anywhere in file | ||
| if ! grep -Eq \ | ||
| "set[[:space:]]+$config[[:space:]]*=[[:space:]]*false" \ |
There was a problem hiding this comment.
| "set[[:space:]]+$config[[:space:]]*=[[:space:]]*false" \ | |
| "set[[:space:]]+${config}[[:space:]]*=[[:space:]]*false" \ |
| DROP TABLE t0; | ||
|
|
||
| statement ok | ||
| set datafusion.explain.logical_plan_only = false; |
There was a problem hiding this comment.
Jefffrey
left a comment
There was a problem hiding this comment.
Is there a way to do this check in the SLT runner test code itself? e.g. after it runs an SLT file it checks if the config state is dirty and reports it via a warning in test output? I'm not particularly sure of adding more and more bash scripts like this
Thank you - that makes sense. I agree that enforcing this at the SLT runner level would be a cleaner and more robust solution than adding additional Bash-based checks in CI. I initially considered implementing this directly in Rust, but the shell script felt like a simpler and less invasive first step. That said, I understand the concern about accumulating more CI scripting logic, and I agree that handling this inside the SLT runner is architecturally the better approach. We can mark this PR as draft (without closing the root issue). I'm going to work on a follow-up PR that implements the check directly in the SLT runner - e.g., verifying after each SLT file execution that the configuration state matches the default and reporting it in the test output. That should give us a more complete and future-proof solution. Let me know if that sounds like a good direction, and I’ll proceed accordingly. |
|
That sounds good |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes ##17914. ## Rationale for this change In a previous PR #20474, I added a bash script that parsed the `SLT` files and checked whether any DataFusion configuration options were modified without being reset. While that approach worked, it relied on external scripting and additional parsing logic. This PR introduces a simpler and more direct solution implemented in Rust. At the end of each `SLT` test file execution, the current configuration is compared with the default configuration using a `Drop` implementation. If any configuration values were modified and not restored, a warning is printed. This approach is easier to maintain and keeps the validation logic within the Rust codebase rather than relying on an external bash script. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Capture the default DataFusion configuration when the `SLT` runner is initialized. - Implement `Drop` for the DataFusion SLT engine. - When an `SLT` file finishes executing, compare the current configuration with the default configuration. - If differences are detected, print a warning showing which configuration options were modified. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? This behavior is exercised by the existing `SLT` test suite. The configuration check runs automatically when each `SLT` file completes execution. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No. This change only affects internal `SLT` test infrastructure and does not modify any public APIs. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
|
I'm closing this PR as all the work was completed and merged in #20838 |
…pache#20838) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #apache#17914. ## Rationale for this change In a previous PR apache#20474, I added a bash script that parsed the `SLT` files and checked whether any DataFusion configuration options were modified without being reset. While that approach worked, it relied on external scripting and additional parsing logic. This PR introduces a simpler and more direct solution implemented in Rust. At the end of each `SLT` test file execution, the current configuration is compared with the default configuration using a `Drop` implementation. If any configuration values were modified and not restored, a warning is printed. This approach is easier to maintain and keeps the validation logic within the Rust codebase rather than relying on an external bash script. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Capture the default DataFusion configuration when the `SLT` runner is initialized. - Implement `Drop` for the DataFusion SLT engine. - When an `SLT` file finishes executing, compare the current configuration with the default configuration. - If differences are detected, print a warning showing which configuration options were modified. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? This behavior is exercised by the existing `SLT` test suite. The configuration check runs automatically when each `SLT` file completes execution. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No. This change only affects internal `SLT` test infrastructure and does not modify any public APIs. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Which issue does this PR close?
Rationale for this change
This PR introduces a validation script to prevent dangling configuration settings in sqllogictest (
.slt) files.What changes are included in this PR?
ci/scripts/check_slt_configs.shthat scans all.sltfiles and detects DataFusion configuration options that are set to:set datafusion.<config> = truebut are never reset back to:set datafusion.<config> = false.sltfiles where dangling boolean configs were found.All identified configs have been explicitly reset to false to ensure files restore the default session state.
sqllogictest-config-checkthat runs the validation script and fails the workflow if any abandoned boolean configuration is detected.Are these changes tested?
Yes, these changes are tested.
Are there any user-facing changes?
No, there are no user-facing changes.