Skip to content

test: log ns deletion timestamp in e2e to debug resource cleanup flakiness#1085

Closed
jwtty wants to merge 6 commits intoAzure:mainfrom
jwtty:fix-e2e
Closed

test: log ns deletion timestamp in e2e to debug resource cleanup flakiness#1085
jwtty wants to merge 6 commits intoAzure:mainfrom
jwtty:fix-e2e

Conversation

@jwtty
Copy link
Copy Markdown
Contributor

@jwtty jwtty commented Mar 18, 2025

Description of your changes

Resource cleanup relies on garbage collection of child resources when their owner (AppliedWork) is deleted. The controllers do not wait for the clean up to fully complete to return. Our e2e test is flaky when checking the resources are completely deleted as current 10s wait time is not enough. Sample failures:

https://github.com/Azure/fleet/actions/runs/13911855578/job/38927583698
https://github.com/Azure/fleet/actions/runs/13840189622/job/38725462003?pr=1079

This PR logs the ns deletion timestamp to help debug.

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

Comment thread test/e2e/utils_test.go Outdated

workResourcesRemovedActual := workNamespaceRemovedFromClusterActual(memberCluster)
Eventually(workResourcesRemovedActual, eventuallyDuration, eventuallyInterval).Should(Succeed(), "Failed to remove work resources from member cluster %s", memberCluster.ClusterName)
Eventually(workResourcesRemovedActual, 3*eventuallyDuration, eventuallyInterval).Should(Succeed(), "Failed to remove work resources from member cluster %s", memberCluster.ClusterName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a regression on the clean up. We should not blindly relax the timeout. Instead, we should fix the root cause which is member agent GC slow

@ryanzhang-oss
Copy link
Copy Markdown
Contributor

another case
https://github.com/Azure/fleet/actions/runs/13930432233/job/38996353144?pr=1086

probably easy to repro locally, need to see what actually happened on the GC

@jwtty jwtty force-pushed the fix-e2e branch 3 times, most recently from a6d0e64 to 0e03798 Compare March 25, 2025 21:03
@jwtty
Copy link
Copy Markdown
Contributor Author

jwtty commented Mar 25, 2025

another case https://github.com/Azure/fleet/actions/runs/13930432233/job/38996353144?pr=1086

probably easy to repro locally, need to see what actually happened on the GC

This one should be related to the bug fixed in #1096. The test didn't delete the namespace on the hub cluster at all. This timeout happens when we indeed try to delete the namespace but it times out waiting for GC.

@jwtty jwtty changed the title test: increase e2e wait time for resource cleanup test: long ns deletion timestamp in e2e to debug resource cleanup flakiness Mar 25, 2025
@jwtty jwtty changed the title test: long ns deletion timestamp in e2e to debug resource cleanup flakiness test: log ns deletion timestamp in e2e to debug resource cleanup flakiness Mar 25, 2025
@kaito-pr-agent
Copy link
Copy Markdown

kaito-pr-agent Bot commented Mar 27, 2025

Title

(Describe updated until commit ada18dc)

test: log ns deletion timestamp in e2e to debug resource cleanup flakiness


User description

Description of your changes

Resource cleanup relies on garbage collection of child resources when their owner (AppliedWork) is deleted. The controllers do not wait for the clean up to fully complete to return. Our e2e test is flaky when checking the resources are completely deleted as current 10s wait time is not enough. Sample failures:

https://github.com/Azure/fleet/actions/runs/13911855578/job/38927583698
https://github.com/Azure/fleet/actions/runs/13840189622/job/38725462003?pr=1079

This PR logs the ns deletion timestamp to help debug.

Fixes #

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer


PR Type

Enhancement


Description

  • Increased e2e wait time for resource cleanup.

  • Added logging of namespace deletion timestamp for debugging.

  • Enhanced resource listing and status reporting during namespace removal.

  • Fixed linting issues.


Changes walkthrough 📝

Relevant files
Enhancement
actuals_test.go
Add namespace deletion timestamp and resource status logging

test/e2e/actuals_test.go

  • Imported necessary packages.
  • Added timestamp logging for namespace deletion.
  • Implemented detailed resource listing during namespace removal.
  • Updated function to include resource status summary.
  • +45/-4   

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent
    Copy link
    Copy Markdown

    kaito-pr-agent Bot commented Mar 27, 2025

    PR Reviewer Guide 🔍

    (Review updated until commit ada18dc)

    Here are some key observations to aid the review process:

    🎫 Ticket compliance analysis 🔶

    500 - Partially compliant

    Compliant requirements:

    • Add logging for namespace deletion timestamp

    Non-compliant requirements:

    • Ensure e2e tests pass

    Requires further human verification:

    • None
     Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ No major issues detected

    @kaito-pr-agent
    Copy link
    Copy Markdown

    Persistent review updated to latest commit e23caca

    @kaito-pr-agent
    Copy link
    Copy Markdown

    Persistent review updated to latest commit ada18dc

    @michaelawyu
    Copy link
    Copy Markdown
    Contributor

    Hi Wantong! I am closing this PR as part of the CNCF repo migration process; please consider moving (re-creating) this PR in the new repo once the sync PR is merged. If there's any question/concern, please let me know. Thanks 🙏

    @jwtty jwtty deleted the fix-e2e branch August 27, 2025 22:44
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants