Skip to content

backend: handle & log pulp publishing errors#4292

Open
praiskup wants to merge 5 commits into
fedora-copr:mainfrom
praiskup:praiskup-handle-publishing-errors
Open

backend: handle & log pulp publishing errors#4292
praiskup wants to merge 5 commits into
fedora-copr:mainfrom
praiskup:praiskup-handle-publishing-errors

Conversation

@praiskup
Copy link
Copy Markdown
Member

Relates: #4290

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the create_publication method to return a response object and adds error logging in both pulp.py and storage.py. However, the review feedback points out that the added if not response.ok checks are unreachable because the underlying SafeRequest raises a RequestError for non-200 status codes. The reviewer suggests using try...except blocks to handle these errors and recommends logging response.text instead of response.json() to avoid potential decoding issues when the error response is not valid JSON.

Comment thread backend/copr_backend/pulp.py Outdated
Comment thread backend/copr_backend/storage.py Outdated
Comment thread backend/copr_backend/pulp.py Outdated
FrostyX
FrostyX previously approved these changes Apr 29, 2026
@FrostyX FrostyX dismissed their stale review April 30, 2026 08:03

Hmm, reading the Gemini comments now, and may have a point

praiskup added a commit to praiskup/copr that referenced this pull request May 13, 2026
Some callers of create_publication() only checked whether the POST
request succeeded, but not whether the asynchronous Pulp task actually
completed.  Add a publish() method that creates the publication and
waits for the task, so failures are detected and logged consistently.

Relates: fedora-copr#4290
Closes: fedora-copr#4292
@praiskup praiskup force-pushed the praiskup-handle-publishing-errors branch from d71e32c to 87b39e0 Compare May 13, 2026 18:44
@praiskup praiskup requested a review from FrostyX May 13, 2026 18:45
@praiskup
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Pulp backend to improve asynchronous task management, notably by updating wait_for_finished_task to use monotonic time and introducing a publish helper method. The changes also include updating storage.py to better handle distribution and publication tasks. Reviewers identified several high-severity issues, including potential KeyError crashes when accessing task IDs directly, a regression in init_project where a necessary task wait was removed, and unreachable error-handling logic in pulp.py caused by the exception-raising behavior of SafeRequest.

distribution_name,
repository_href,
)
task = response.json()["task"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing ["task"] directly is risky here. If create_distribution returns a 201 Created status (which can happen depending on Pulp's configuration or version), the "task" key will be missing, causing a KeyError. Since this is inside a try...except RequestError block, a KeyError will not be caught and will crash the backend process.

Suggested change
task = response.json()["task"]
task = response.json().get("task")
if task and not self.client.wait_for_finished_task(
task, f"create distribution {distribution_name}",
):
return False

# Wait until we can get the publication
if task := response.json().get("task"):
self.client.wait_for_finished_task(task)
response = self.client.get_publication(repository_href)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change introduces a regression. The previous implementation waited for the task returned by create_repository (at line 312) before proceeding to fetch the publication. By removing that wait, get_publication might be called before the repository is fully initialized in Pulp, potentially leading to empty results or errors.

Suggested change
response = self.client.get_publication(repository_href)
if self.devel:
if task := response.json().get("task"):
self.client.wait_for_finished_task(task, f"create repository {repository_name}")
response = self.client.get_publication(repository_href)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are still waiting, a few lines ago (actually a bug-fix, since we were not always waiting), or what am I missing gemini?

Comment on lines +390 to +393
if not response.ok:
self.log.error("Failed to update distribution %s: %s",
distribution, response.text)
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository rules, SafeRequest (used in self.send) raises a RequestError for non-200 status codes. Therefore, this if not response.ok check is unreachable for error responses. If the intention is to return False instead of allowing the exception to propagate, you should wrap the self.send call in a try...except RequestError block. Note that RequestError would need to be imported from copr_common.request.

References
  1. When using SafeRequest, an explicit call to response.raise_for_status() is not needed as SafeRequest raises a RequestError for non-200 status codes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably right; I think I just copy-pasted dead code.

Create a publication and wait for the task to finish.
"""
response = self.create_publication(repository)
task = response.json()["task"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This access is unsafe. If create_publication returns a 201 Created (or any 2xx status other than 202), the response might not contain a "task" key, leading to a KeyError. It is safer to check the status code or use .get("task").

Suggested change
task = response.json()["task"]
task = response.json().get("task")
if task and not self.wait_for_finished_task(
task, f"publish {repository}",
):
return False

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns 202.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both create_publication and publish? I'd prefer to have just one. And if we need to wait on some places and not on some other places, I'd rather control it using wait=True or wait=False parameter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I in general hate large methods, please double check (no big deal, I can merge those into one).

Copy link
Copy Markdown
Member

@FrostyX FrostyX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to be negative, but to be honest, I am not happy about the proposed changes.

My first problem is that I don't know what we are trying to solve in this PR. The referenced issue #4290 was IIRC caused by us accidentally removing the Pulp repository+distribution for that project, which was solved in PR #4293 and PR #4289 added logs which would help us understand that problem immediately. I consider that problem solved.

I don't see any other referenced issues, so I am a bit lost what we are doing, which leads to the feeling of "if it works, don't touch it".

We have two valid issues #4313 and #4317, and it seems we are trying to solve them here?

Create a publication and wait for the task to finish.
"""
response = self.create_publication(repository)
task = response.json()["task"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both create_publication and publish? I'd prefer to have just one. And if we need to wait on some places and not on some other places, I'd rather control it using wait=True or wait=False parameter.

response = self.send("PATCH", url, data)
if not response.ok:
self.log.error("Failed to update distribution %s: %s",
distribution, response.text)
Copy link
Copy Markdown
Member

@FrostyX FrostyX May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had two reasons why I didn't do the error handling in pulp.py methods:

  • Sometimes it depends on the caller if non-200 is really an error or not. For example when creating a distribution and it already exists:
    try:
    response = self.client.create_distribution(
    distribution_name,
    repository_href,
    )
    except RequestError as ex:
    if "This field must be unique" not in ex.response.text:
    self.log.error(
    "Failed to create a Pulp distribution %s because of %s",
    public_distribution_name,
    ex.response.text,
    )
    return False
  • I wanted the pulp.py to be the thinnest possible layer over the Pulp API, so that we can easily reason about it with Pulp folks.

But maybe it makes sense to move the error handling here, I need to think about it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a copy-paste from elsewhere, I need to take one more look, too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just moved that no-op if not response.ok: check. While we really should be handling RequestError (it’s a pain to duplicate that pattern, nah), my main goal was to address the 202 cases. It felt inconsistent to ignore the initial request error in one place while handling the background task error elsewhere.

self.log.info("Pulp: updating distribution %s", distribution)
return self.send("PATCH", url, data)
response = self.send("PATCH", url, data)
if not response.ok:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this ever happen? We are using SafeRequest so if response is not ok then it shoulud raise exception:

if response.status_code >= 500:
# Server error. Hopefully this is only temporary problem, we wan't
# to re-try, and wait till the server works again.
raise RequestRetryError(
"Request server error on {}: {} {}".format(
url, response.status_code, response.reason))
if response.status_code == 408:
# 408 Request Timeout - We couldn't send the request within
# a reasonable time-frame, so let's retry.
raise RequestRetryError(
f"Request Timeout {response.status_code}: {response.reason}: {url}")
if response.status_code >= 400:
# Client error. The mistake is on our side, it doesn't make sense
# to continue with retries.
msg = "Request client error on {}: {} {}".format(
url, response.status_code, response.reason)
raise RequestError(msg, response)

There is not an explicit:

if not response.ok:
    raise ...

but those status code ranges should cover all not ok cases, no?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, similar feedback by Gemini above.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking once more, and I'm tempted to create another commit removing a dozen of checks for response.ok :D I'd prefer having a brief 1:1 discussing what is going on here.

response = self.wait_for_finished_task(task)
data = response.json()
if response.ok and data["state"] == "completed":
self.log.info("Successfully created Pulp publication of repository %s", repository)
Copy link
Copy Markdown
Member

@FrostyX FrostyX May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will start the thread right here, but it concerns multiple other places.
This PR seems to remove a lot of log lines (both info and error). I don't like it at all because I've already debugged many Pulp-related problems and the logging was either helpful or I added new logging that would help to debug the same issue in the future. This feels like a big regression.

Edit: Okay they were just moved to wait_for_finished_task, but I still don't know if it is worth the gamble.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just wanted to de-duplicate the waiting pattern, including logging success/failure cases.

Not sure if gambling ...? Perhaps we can freeze here until we have the PULP tests in TF?

@praiskup
Copy link
Copy Markdown
Member Author

praiskup commented May 14, 2026

I don't want to be negative, but to be honest, I am not happy about the proposed changes.

Thank you for the feedback! ❤️
This was a chain of thoughts, going to try to explain the bugs/inconsistencies/duplicities I was trying to fix. Let's find some common ground, and perhaps split the PR.

@praiskup
Copy link
Copy Markdown
Member Author

Promised motivation for commits:

  1. backend: wait for Pulp publication tasks to finish -- one? of the callers didn't wait, and it was a bug imvho (namely forking)
  2. backend: handle async Pulp distribution updates (202 responses) - is a bugfix, I noticed 202 and 201 are valid outputs of this API call
  3. backend: consolidate Pulp task error handling in wait_for_finished_task - deduplication mostly, but not yet correct (some info message is redundant) .. and that commit needs a fix (json response should be printed out in failure case) .. also, there's a bugfix - we should always wait for the create_distribution task to avoid races...
  4. should be squashed with 3. :-( while 3. should be split (bugfix vs. deduplication)

praiskup added a commit to praiskup/copr that referenced this pull request May 21, 2026
Some callers of create_publication() only checked whether the POST
request succeeded, but not whether the asynchronous Pulp task actually
completed.  Add a publish() method that creates the publication and
waits for the task, so failures are detected and logged consistently.

Relates: fedora-copr#4290
Closes: fedora-copr#4292
@praiskup praiskup force-pushed the praiskup-handle-publishing-errors branch from b129e2e to 9f249db Compare May 21, 2026 06:25
@praiskup
Copy link
Copy Markdown
Member Author

Fixed conflict with 460e725. PTAL

@praiskup
Copy link
Copy Markdown
Member Author

/packit test

praiskup added 5 commits June 1, 2026 10:14
Some callers of create_publication() only checked whether the POST
request succeeded, but not whether the asynchronous Pulp task actually
completed.  Add a publish() method that creates the publication and
waits for the task, so failures are detected and logged consistently.

Relates: fedora-copr#4290
Closes: fedora-copr#4292
The update_distribution() PATCH request can return either 200
(synchronous) or 202 (asynchronous task).  Handle the 202 case by
waiting for the task to finish, and move error handling into the
method so callers just check the bool return value.
Several callers of wait_for_finished_task() duplicated the same pattern:
check response.ok, parse JSON, compare state to "completed", log error.
This made it easy to miss failures or log them inconsistently (some
callers didn't check the task result at all, just the initial POST).

Move the success/failure check into wait_for_finished_task() itself so
it returns the task data on success or None on failure, with consistent
error logging including a caller-provided description.  Also switch from
time.time() to time.monotonic(), and retry on transient status-polling
failures instead of giving up immediately.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Each caller logged its own "Successfully ..." message after
wait_for_finished_task returned.  Since wait_for_finished_task already
has the description parameter and knows whether the task succeeded,
log it there once instead of repeating it in every caller.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
@praiskup praiskup force-pushed the praiskup-handle-publishing-errors branch from 9f249db to 6651aca Compare June 1, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants