feat(terraform_validate): Expand list of TF errors when t init will rerun, if --retry-once-with-cleanup specified#991
Conversation
…active When TF_PLUGIN_CACHE_DIR is set and parallelism is enabled, the optimistic first validate attempt can fail due to race conditions with concurrent plugin cache access. Run terraform init first to ensure the cache is properly populated before validation. Fixes antonbabenko#640
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a retry match for provider-version retrieval failures, documents the validate→init→retry flow (including plugin-cache parallelism races), and changes the retry-path init call to pass "true" as the third argument. ChangesPlugin Cache Race Prevention
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hooks/terraform_validate.sh (1)
66-83:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate function documentation to reflect the new initialization guard.
The function documentation describes the old flow but doesn't mention the new conditional guard. When plugin caching and parallelism are active, the function now always runs
terraform initfirst (lines 118-126), not just when.terraformdir is missing.📝 Suggested documentation update
####################################################################### # Unique part of `common::per_dir_hook`. The function is executed in loop # on each provided dir path. Run wrapped tool with specified arguments -# 1. Check if `.terraform` dir exists and if not - run `terraform init` -# 2. Run `terraform validate` -# 3. If at least 1 check failed - change the exit code to non-zero +# 1. If plugin cache is enabled and parallelism is active, run `terraform init` +# to avoid race conditions with concurrent cache access +# 2. Run `terraform validate` +# 3. If validate fails, run `terraform init` (if not already done) and retry +# 4. If at least 1 check failed - change the exit code to non-zero # Arguments:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hooks/terraform_validate.sh` around lines 66 - 83, Update the function header docs for the unique part of common::per_dir_hook to describe the new initialization guard: explain that when plugin caching and parallelism are enabled the hook will run terraform init unconditionally before validate (regardless of .terraform presence), otherwise it preserves the old behavior of running init only if the .terraform directory is missing; mention the relevant parameters that affect this (parallelism_disabled and the plugin-caching/parallelism flag) and keep existing descriptions for dir_path, change_dir_in_unique_part, args and tf_path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@hooks/terraform_validate.sh`:
- Around line 66-83: Update the function header docs for the unique part of
common::per_dir_hook to describe the new initialization guard: explain that when
plugin caching and parallelism are enabled the hook will run terraform init
unconditionally before validate (regardless of .terraform presence), otherwise
it preserves the old behavior of running init only if the .terraform directory
is missing; mention the relevant parameters that affect this
(parallelism_disabled and the plugin-caching/parallelism flag) and keep existing
descriptions for dir_path, change_dir_in_unique_part, args and tf_path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f299bfee-cfa6-4974-8365-08b881269b25
📒 Files selected for processing (1)
hooks/terraform_validate.sh
| # When plugin cache is enabled and parallelism is active, always run | ||
| # terraform init first to avoid race conditions with concurrent access. | ||
| # https://github.com/hashicorp/terraform/issues/31964 | ||
| if [[ -n $TF_PLUGIN_CACHE_DIR && $parallelism_disabled != true ]]; then | ||
| common::terraform_init "$tf_path validate" "$dir_path" "$parallelism_disabled" "$tf_path" || { | ||
| exit_code=$? | ||
| return $exit_code | ||
| } | ||
| fi |
There was a problem hiding this comment.
I might be missing the gist or am misreading the code: ~10 lines down the code I can see the common::terraform_init "$tf_path validate" "$dir_path" "$parallelism_disabled" "$tf_path" is executed unconditionally if the tf validate fails — ain't the code you're suggesting sort of duplicating that mostly one-to-one? 🤔
Given @MaxymVlasov participated in the linked issue, I guess he's got much more details on parallelism and its issues and implementation in pre-commit-terraform, so let's see what Max would say.
As a generic code review: in modern Bash using the [[ operator the -n $var can be expressed as $var (w/o -n) if the var isn't of quasi-"boolean" type (which is when the var is assigned the command, e.g. var=/bin/date, and hence the expansion of the var evaluates to the command's success or failure)
|
Thanks for the review @MaxymVlasov. Here's the full error context: When This happens because:
The fix runs The |
|
Yes, it's decribed in https://github.com/antonbabenko/pre-commit-terraform#terraform_validate Just add it to list of other errors as I mentioned above |
- Only run early init when .terraform dir doesn't exist (avoids unnecessary time penalty on subsequent commits) - Drop -n from [[ -n $var ]] per modern bash style - Update function docs to reflect new flow - Add provider fetch error to match_validate_errors retry list
|
Updated based on all three reviews:
|
There was a problem hiding this comment.
LGTM
Shifting final approval to @MaxymVlasov though.
| # When plugin cache is enabled and parallelism is active, run | ||
| # terraform init before validate to avoid race conditions with | ||
| # concurrent cache access. Only needed when .terraform doesn't exist. | ||
| # https://github.com/hashicorp/terraform/issues/31964 | ||
| if [[ ! -d $dir_path/.terraform && $TF_PLUGIN_CACHE_DIR && $parallelism_disabled != true ]]; then | ||
| common::terraform_init "$tf_path validate" "$dir_path" "$parallelism_disabled" "$tf_path" || { | ||
| exit_code=$? | ||
| return $exit_code | ||
| } | ||
| fi | ||
|
|
There was a problem hiding this comment.
Can you please clarify how this will help to deal with parallelism issue?
Rn, it's literally a copy of L142, just with some extra conditions.
Terraform validate not always requires terraform init, and it fails instantly when it requires init, so it cheap to place optimistic validate before even try initing.
There was a problem hiding this comment.
It was my concern too. And I erroneously concluded this point has been resolved 🤦🏻 To much of context switching makes me "shortsighted" =((((
Taking my LGTM back =)
There was a problem hiding this comment.
If you want, you can simply create file with something like terraform {} and run on it
time terraform validate
Double check that there still nothing more in repo, and then run
time terraform init
time terraform validateAnd then make same tests for file with
provider "aws" {}
Actually, I'll do them myself tomorrow too, just to double check is it still same way as it was years ago
There was a problem hiding this comment.
terraform {}
vm@vm:/tmp/test$ time terraform validate
Success! The configuration is valid.
real 0m0,241s
user 0m0,175s
sys 0m0,081s
vm@vm:/tmp/test$ time terraform init
Initializing the backend...
Initializing provider plugins...
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
real 0m0,143s
user 0m0,150s
sys 0m0,056s
vm@vm:/tmp/test$ time terraform validate
Success! The configuration is valid.
real 0m0,140s
user 0m0,135s
sys 0m0,074s
vm@vm:/tmp/test$ ls -lah
total 68K
drwxrwxr-x 2 vm vm 4,0K чер 9 12:47 .
drwxrwxrwt 33 root root 56K чер 9 12:47 ..
-rw-rw-r-- 1 vm vm 17 чер 9 12:48 main.tfAnd there is no .terraform/ at the end.
provider "aws" {}
vm@vm:/tmp/test$ time terraform validate
╷
│ Error: Missing required provider
│
│ This configuration requires provider registry.terraform.io/hashicorp/aws, but that provider isn't available. You
│ may be able to install it automatically by running:
│ terraform init
╵
real 0m0,143s
user 0m0,134s
sys 0m0,074s
vm@vm:/tmp/test$ time terraform init
Initializing the backend...
Initializing provider plugins...
- Finding latest version of hashicorp/aws...
- Installing hashicorp/aws v6.49.0...
- Installed hashicorp/aws v6.49.0 (signed by HashiCorp)
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
real 0m23,565s
user 0m4,617s
sys 0m3,852s
vm@vm:/tmp/test$
vm@vm:/tmp/test$ time terraform validate
Success! The configuration is valid.
real 0m2,390s
user 0m4,003s
sys 0m0,468s
vm@vm:/tmp/test$ ls -lah
total 76K
drwxrwxr-x 3 vm vm 4,0K чер 9 12:50 .
drwxrwxrwt 33 root root 56K чер 9 12:50 ..
-rw-rw-r-- 1 vm vm 2,6K чер 9 12:51 main.tf
drwxr-xr-x 3 vm vm 4,0K чер 9 12:50 .terraform
-rw-r--r-- 1 vm vm 1,4K чер 9 12:50 .terraform.lock.hcl…e init The previous approach ran terraform init before every validate when TF_PLUGIN_CACHE_DIR was set and parallelism was active. This added significant overhead (23+ seconds with providers) on every commit. Instead, add 'unexpected value returned by API' to the retryable error patterns in match_validate_errors. The existing retry mechanism at line 142 already handles running terraform init when validate fails, so the proactive init is unnecessary. This is a much lighter-weight fix: - No overhead when there's no race condition - Only runs terraform init when the specific error occurs - Backward compatible: existing behavior preserved for all other cases Signed-off-by: Arunesh Dwivedi <arunesh.devops@gmail.com>
|
Thanks for the detailed benchmarks @MaxymVlasov. You're absolutely right — running I've updated the fix to avoid the proactive init entirely. Instead, I added Changes:
This should have zero overhead when there's no race condition. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hooks/terraform_validate.sh (1)
68-74: ⚡ Quick winClarify that retry requires
--retry-once-with-cleanup=true.The comment states that plugin cache parallelism race conditions are "retried automatically," but this only happens when users enable
--retry-once-with-cleanup=true(see line 136-148). Without this flag,match_validate_errorsis never called, and the new error pattern on line 61 won't trigger any retry.Consider updating the comment to clarify this dependency, for example:
-# 3. If plugin cache parallelism causes a race condition, the error is -# caught by match_validate_errors and retried automatically +# 3. If plugin cache parallelism causes a race condition and +# --retry-once-with-cleanup=true is set, the error is caught by +# match_validate_errors and retried automaticallyThis makes the conditional nature explicit and aligns with the README recommendation to enable
--retry-once-with-cleanup=truewhen usingTF_PLUGIN_CACHE_DIR.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hooks/terraform_validate.sh` around lines 68 - 74, Update the existing comment in the unique part of common::per_dir_hook to explicitly state that automatic retries for plugin-cache parallelism require the CLI flag --retry-once-with-cleanup=true so that match_validate_errors will be invoked; mention match_validate_errors and the new error pattern (related to TF_PLUGIN_CACHE_DIR) so readers know the retry behavior is conditional on that flag and not automatic by default.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@hooks/terraform_validate.sh`:
- Around line 68-74: Update the existing comment in the unique part of
common::per_dir_hook to explicitly state that automatic retries for plugin-cache
parallelism require the CLI flag --retry-once-with-cleanup=true so that
match_validate_errors will be invoked; mention match_validate_errors and the new
error pattern (related to TF_PLUGIN_CACHE_DIR) so readers know the retry
behavior is conditional on that flag and not automatic by default.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 028fb0df-48d4-4a76-b039-6d4ac601f37f
📒 Files selected for processing (1)
hooks/terraform_validate.sh
terraform_validate): Expand list of TF errors when t init will run
terraform_validate): Expand list of TF errors when t init will runterraform_validate): Expand list of TF errors when t init will be rerun, if --retry-once-with-cleanup specified
terraform_validate): Expand list of TF errors when t init will be rerun, if --retry-once-with-cleanup specifiedterraform_validate): Expand list of TF errors when t init will rerun, if --retry-once-with-cleanup specified
| "Could not load plugin") return 1 ;; | ||
| "Missing required provider") return 1 ;; | ||
| *"there is no package for"*"cached in .terraform/providers") return 1 ;; | ||
| *"could not retrieve the list of available versions for provider"*) return 1 ;; |
There was a problem hiding this comment.
Are you sure that it lower-cased?
https://github.com/search?q=repo%3Ahashicorp%2Fterraform%20%22could%20not%20retrieve%20the%20list%20of%20available%20versions%20for%20provider%22&type=code
And that it in the middle of sentence, and not complete sentence by itself.
| "Missing required provider") return 1 ;; | ||
| *"there is no package for"*"cached in .terraform/providers") return 1 ;; | ||
| *"could not retrieve the list of available versions for provider"*) return 1 ;; | ||
| *"unexpected value returned by API"*) return 1 ;; |
There was a problem hiding this comment.
Interesting, I can't find out such error in terraform. Can you create a minimum reproducible example?
https://github.com/search?q=repo%3Ahashicorp%2Fterraform+%22unexpected+value+returned+by+API%22&type=code
Dismissing my approval as it looks like added cases aren't fully correct. Kudos to Maxym for being thorough and picky.
When TF_PLUGIN_CACHE_DIR is set and the plugin cache race condition triggers a retry, the second terraform init call must force parallelism_disabled=true to avoid re-entering the retry loop in common::terraform_init. Previously the retry call passed the same $parallelism_disabled variable, making it identical to the first call. Also add plugin-cache-related error patterns to match_validate_errors so that cache corruption races are detected and retried. Signed-off-by: Arunesh Dwivedi <arunesh.devops@gmail.com>
|
Thanks for the review. Addressing the feedback:
The fix now differs from the original init call — the retry path forces single-process init to avoid the cache race. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hooks/terraform_validate.sh (1)
73-76: 💤 Low valueConsider clarifying that retry is conditional on the
--retry-once-with-cleanupflag.The comment states that plugin-cache errors are "retried automatically," but the retry logic (lines 143-169) only executes when
retry_once_with_cleanup == "true". Without this flag, validation errors are not matched or retried.📝 Suggested documentation improvement
# 1. Run `terraform validate` # 2. If validate fails, run `terraform init` and retry -# 3. If plugin cache parallelism causes a race condition, the error is -# caught by match_validate_errors and retried automatically +# 3. If --retry-once-with-cleanup is enabled and plugin cache parallelism +# causes a race condition, the error is caught by match_validate_errors +# and retried with cleanup🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hooks/terraform_validate.sh` around lines 73 - 76, Update the top comment block to state that the automatic retry behavior for plugin-cache race condition errors is conditional on the --retry-once-with-cleanup flag (controlled by the retry_once_with_cleanup variable) and only performed when that flag is true; reference the retry logic implemented around lines that call match_validate_errors and the retry block (the code that performs terraform init and a single retry when retry_once_with_cleanup == "true") so readers know retries are not unconditional.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@hooks/terraform_validate.sh`:
- Around line 73-76: Update the top comment block to state that the automatic
retry behavior for plugin-cache race condition errors is conditional on the
--retry-once-with-cleanup flag (controlled by the retry_once_with_cleanup
variable) and only performed when that flag is true; reference the retry logic
implemented around lines that call match_validate_errors and the retry block
(the code that performs terraform init and a single retry when
retry_once_with_cleanup == "true") so readers know retries are not
unconditional.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d08e1c8a-56e7-42a8-ae17-70cbd8f02ed5
📒 Files selected for processing (1)
hooks/terraform_validate.sh
|
Understood the concern about duplicating The specific errors I was targeting (parallelism-related init failures) may be better handled by fixing the retry logic in the existing init path rather than adding a pre-check. I will leave this open for now — if you think the approach can be salvaged with a narrower condition, let me know. Otherwise feel free to close. |
The comment now explicitly states that automatic retry for plugin-cache race conditions requires the --retry-once-with-cleanup flag to be set. Signed-off-by: Arunesh Dwivedi <arunesh.devops@gmail.com>
|
Updated the comment to clarify that retry is conditional on (CodeRabbit nitpick). The retry logic itself is unchanged — is only called when the flag is enabled. The fix is minimal: 4 new error patterns in , the retry path forces , and comments are accurate. No proactive init — zero overhead when there is no race condition. Let me know if this looks good or if you'd prefer a different approach. |
|
Updated the comment to clarify that retry is conditional on --retry-once-with-cleanup=true (CodeRabbit nitpick). The retry logic itself is unchanged. The fix is minimal: 4 new error patterns in match_validate_errors, the retry path forces parallelism_disabled=true, and comments are accurate. No proactive init — zero overhead when there is no race condition. Let me know if this looks good or if you'd prefer a different approach. |
Problem
When TF_PLUGIN_CACHE_DIR is set and parallelism is enabled (from #620),
terraform validate can fail intermittently due to race conditions with
concurrent plugin cache access. The error looks like:
Error: unexpected value returned by API
The current code tries terraform validate first (optimistically), then
runs terraform init only if validate fails. But with plugin cache races,
the init retry doesn't always resolve the issue because not all errors are catched.
Related to #640