backend: handle & log pulp publishing errors#4292
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the create_publication method to return a response object and adds error logging in both pulp.py and storage.py. However, the review feedback points out that the added if not response.ok checks are unreachable because the underlying SafeRequest raises a RequestError for non-200 status codes. The reviewer suggests using try...except blocks to handle these errors and recommends logging response.text instead of response.json() to avoid potential decoding issues when the error response is not valid JSON.
Hmm, reading the Gemini comments now, and may have a point
Some callers of create_publication() only checked whether the POST request succeeded, but not whether the asynchronous Pulp task actually completed. Add a publish() method that creates the publication and waits for the task, so failures are detected and logged consistently. Relates: fedora-copr#4290 Closes: fedora-copr#4292
d71e32c to
87b39e0
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the Pulp backend to improve asynchronous task management, notably by updating wait_for_finished_task to use monotonic time and introducing a publish helper method. The changes also include updating storage.py to better handle distribution and publication tasks. Reviewers identified several high-severity issues, including potential KeyError crashes when accessing task IDs directly, a regression in init_project where a necessary task wait was removed, and unreachable error-handling logic in pulp.py caused by the exception-raising behavior of SafeRequest.
| distribution_name, | ||
| repository_href, | ||
| ) | ||
| task = response.json()["task"] |
There was a problem hiding this comment.
Accessing ["task"] directly is risky here. If create_distribution returns a 201 Created status (which can happen depending on Pulp's configuration or version), the "task" key will be missing, causing a KeyError. Since this is inside a try...except RequestError block, a KeyError will not be caught and will crash the backend process.
| task = response.json()["task"] | |
| task = response.json().get("task") | |
| if task and not self.client.wait_for_finished_task( | |
| task, f"create distribution {distribution_name}", | |
| ): | |
| return False |
| # Wait until we can get the publication | ||
| if task := response.json().get("task"): | ||
| self.client.wait_for_finished_task(task) | ||
| response = self.client.get_publication(repository_href) |
There was a problem hiding this comment.
This change introduces a regression. The previous implementation waited for the task returned by create_repository (at line 312) before proceeding to fetch the publication. By removing that wait, get_publication might be called before the repository is fully initialized in Pulp, potentially leading to empty results or errors.
| response = self.client.get_publication(repository_href) | |
| if self.devel: | |
| if task := response.json().get("task"): | |
| self.client.wait_for_finished_task(task, f"create repository {repository_name}") | |
| response = self.client.get_publication(repository_href) |
There was a problem hiding this comment.
we are still waiting, a few lines ago (actually a bug-fix, since we were not always waiting), or what am I missing gemini?
| if not response.ok: | ||
| self.log.error("Failed to update distribution %s: %s", | ||
| distribution, response.text) | ||
| return False |
There was a problem hiding this comment.
According to the repository rules, SafeRequest (used in self.send) raises a RequestError for non-200 status codes. Therefore, this if not response.ok check is unreachable for error responses. If the intention is to return False instead of allowing the exception to propagate, you should wrap the self.send call in a try...except RequestError block. Note that RequestError would need to be imported from copr_common.request.
References
- When using SafeRequest, an explicit call to response.raise_for_status() is not needed as SafeRequest raises a RequestError for non-200 status codes.
There was a problem hiding this comment.
This is probably right; I think I just copy-pasted dead code.
| Create a publication and wait for the task to finish. | ||
| """ | ||
| response = self.create_publication(repository) | ||
| task = response.json()["task"] |
There was a problem hiding this comment.
This access is unsafe. If create_publication returns a 201 Created (or any 2xx status other than 202), the response might not contain a "task" key, leading to a KeyError. It is safer to check the status code or use .get("task").
| task = response.json()["task"] | |
| task = response.json().get("task") | |
| if task and not self.wait_for_finished_task( | |
| task, f"publish {repository}", | |
| ): | |
| return False |
There was a problem hiding this comment.
Do we need both create_publication and publish? I'd prefer to have just one. And if we need to wait on some places and not on some other places, I'd rather control it using wait=True or wait=False parameter.
There was a problem hiding this comment.
I in general hate large methods, please double check (no big deal, I can merge those into one).
FrostyX
left a comment
There was a problem hiding this comment.
I don't want to be negative, but to be honest, I am not happy about the proposed changes.
My first problem is that I don't know what we are trying to solve in this PR. The referenced issue #4290 was IIRC caused by us accidentally removing the Pulp repository+distribution for that project, which was solved in PR #4293 and PR #4289 added logs which would help us understand that problem immediately. I consider that problem solved.
I don't see any other referenced issues, so I am a bit lost what we are doing, which leads to the feeling of "if it works, don't touch it".
We have two valid issues #4313 and #4317, and it seems we are trying to solve them here?
| Create a publication and wait for the task to finish. | ||
| """ | ||
| response = self.create_publication(repository) | ||
| task = response.json()["task"] |
There was a problem hiding this comment.
Do we need both create_publication and publish? I'd prefer to have just one. And if we need to wait on some places and not on some other places, I'd rather control it using wait=True or wait=False parameter.
| response = self.send("PATCH", url, data) | ||
| if not response.ok: | ||
| self.log.error("Failed to update distribution %s: %s", | ||
| distribution, response.text) |
There was a problem hiding this comment.
I originally had two reasons why I didn't do the error handling in pulp.py methods:
- Sometimes it depends on the caller if non-200 is really an error or not. For example when creating a distribution and it already exists:
copr/backend/copr_backend/storage.py
Lines 335 to 347 in 55e4faa
- I wanted the
pulp.pyto be the thinnest possible layer over the Pulp API, so that we can easily reason about it with Pulp folks.
But maybe it makes sense to move the error handling here, I need to think about it.
There was a problem hiding this comment.
This is just a copy-paste from elsewhere, I need to take one more look, too.
There was a problem hiding this comment.
Ok, I just moved that no-op if not response.ok: check. While we really should be handling RequestError (it’s a pain to duplicate that pattern, nah), my main goal was to address the 202 cases. It felt inconsistent to ignore the initial request error in one place while handling the background task error elsewhere.
| self.log.info("Pulp: updating distribution %s", distribution) | ||
| return self.send("PATCH", url, data) | ||
| response = self.send("PATCH", url, data) | ||
| if not response.ok: |
There was a problem hiding this comment.
Will this ever happen? We are using SafeRequest so if response is not ok then it shoulud raise exception:
copr/common/copr_common/request.py
Lines 102 to 120 in 55e4faa
There is not an explicit:
if not response.ok:
raise ...but those status code ranges should cover all not ok cases, no?
There was a problem hiding this comment.
Yeah, similar feedback by Gemini above.
There was a problem hiding this comment.
Thinking once more, and I'm tempted to create another commit removing a dozen of checks for response.ok :D I'd prefer having a brief 1:1 discussing what is going on here.
| response = self.wait_for_finished_task(task) | ||
| data = response.json() | ||
| if response.ok and data["state"] == "completed": | ||
| self.log.info("Successfully created Pulp publication of repository %s", repository) |
There was a problem hiding this comment.
I will start the thread right here, but it concerns multiple other places.
This PR seems to remove a lot of log lines (both info and error). I don't like it at all because I've already debugged many Pulp-related problems and the logging was either helpful or I added new logging that would help to debug the same issue in the future. This feels like a big regression.
Edit: Okay they were just moved to wait_for_finished_task, but I still don't know if it is worth the gamble.
There was a problem hiding this comment.
Yeah, I just wanted to de-duplicate the waiting pattern, including logging success/failure cases.
Not sure if gambling ...? Perhaps we can freeze here until we have the PULP tests in TF?
Thank you for the feedback! ❤️ |
|
Promised motivation for commits:
|
Some callers of create_publication() only checked whether the POST request succeeded, but not whether the asynchronous Pulp task actually completed. Add a publish() method that creates the publication and waits for the task, so failures are detected and logged consistently. Relates: fedora-copr#4290 Closes: fedora-copr#4292
b129e2e to
9f249db
Compare
|
Fixed conflict with 460e725. PTAL |
|
/packit test |
Some callers of create_publication() only checked whether the POST request succeeded, but not whether the asynchronous Pulp task actually completed. Add a publish() method that creates the publication and waits for the task, so failures are detected and logged consistently. Relates: fedora-copr#4290 Closes: fedora-copr#4292
The update_distribution() PATCH request can return either 200 (synchronous) or 202 (asynchronous task). Handle the 202 case by waiting for the task to finish, and move error handling into the method so callers just check the bool return value.
Several callers of wait_for_finished_task() duplicated the same pattern: check response.ok, parse JSON, compare state to "completed", log error. This made it easy to miss failures or log them inconsistently (some callers didn't check the task result at all, just the initial POST). Move the success/failure check into wait_for_finished_task() itself so it returns the task data on success or None on failure, with consistent error logging including a caller-provided description. Also switch from time.time() to time.monotonic(), and retry on transient status-polling failures instead of giving up immediately. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Each caller logged its own "Successfully ..." message after wait_for_finished_task returned. Since wait_for_finished_task already has the description parameter and knows whether the task succeeded, log it there once instead of repeating it in every caller. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
9f249db to
6651aca
Compare
Relates: #4290