Improve C++ error handling, resource cleanup, and API return checks#1089
Improve C++ error handling, resource cleanup, and API return checks#1089rgsl888prabhu wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
- Add fread return value checks and dimension validation in presolve binary reader - Wrap CUDA kernel launches and graph API calls with RAFT_CUDA_TRY in feasibility_jump - Wrap cudaMemcpy calls with RAFT_CUDA_TRY in optimization_problem tests - Add cudaGetDevice/cudaGetDeviceProperties return checks in version_info - Use PID-based shared memory segment names in gRPC server - Use RAII (unique_ptr) for FILE handle in MPS parser to prevent fd leak on error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaced manual FILE* handling with RAII in the MPS parser; removed LP problem serialization/deserialization methods from dual_simplex; made POSIX shared‑memory names PID‑scoped and adjusted Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu (1)
756-805:⚠️ Potential issue | 🟠 MajorCUDA error handling is incomplete in
run_step_device.Lines 709, 722, 730, 741, and 747 contain unchecked CUDA calls that violate the requirement for error checking on every kernel launch. Specifically:
- Line 709:
cudaStreamBeginCapture- Lines 722 & 730:
cudaLaunchCooperativeKernel(binary/non-binary branches)- Lines 741 & 747:
cudaLaunchCooperativeKernel(withinFJ_DEBUG_LOAD_BALANCINGblock)Suggested patch
- if (use_graph) { cudaStreamBeginCapture(climber_stream, cudaStreamCaptureModeThreadLocal); } + if (use_graph) { + RAFT_CUDA_TRY(cudaStreamBeginCapture(climber_stream, cudaStreamCaptureModeThreadLocal)); + } @@ - cudaLaunchCooperativeKernel( + RAFT_CUDA_TRY(cudaLaunchCooperativeKernel( (void*)compute_mtm_moves_kernel<i_t, f_t, MTMMoveType::FJ_MTM_VIOLATED, true>, grid_resetmoves_bin, blocks_resetmoves_bin, reset_moves_args, 0, - climber_stream); + climber_stream)); @@ - cudaLaunchCooperativeKernel( + RAFT_CUDA_TRY(cudaLaunchCooperativeKernel( (void*)compute_mtm_moves_kernel<i_t, f_t, MTMMoveType::FJ_MTM_VIOLATED, false>, grid_resetmoves, blocks_resetmoves, reset_moves_args, 0, - climber_stream); + climber_stream)); @@ - cudaLaunchCooperativeKernel((void*)compute_mtm_moves_kernel<i_t, f_t>, + RAFT_CUDA_TRY(cudaLaunchCooperativeKernel((void*)compute_mtm_moves_kernel<i_t, f_t>, grid_resetmoves_bin, blocks_resetmoves_bin, reset_moves_args, 0, - climber_stream); - cudaLaunchCooperativeKernel((void*)load_balancing_sanity_checks<i_t, f_t>, + climber_stream)); + RAFT_CUDA_TRY(cudaLaunchCooperativeKernel((void*)load_balancing_sanity_checks<i_t, f_t>, 512, 128, kernel_args, 0, - climber_stream); + climber_stream));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 756 - 805, The run_step_device function has unchecked CUDA calls—specifically cudaStreamBeginCapture and several cudaLaunchCooperativeKernel invocations in the binary/non-binary branches and inside the FJ_DEBUG_LOAD_BALANCING block—so add proper error checking around each of these calls (use RAFT_CUDA_TRY or equivalent error-check macro) and ensure any failure is handled consistently (log and abort/return) just like the surrounding kernel launches (e.g., wraps around cudaStreamBeginCapture, the cooperative launches that invoke kernels such as handle_local_minimum_kernel<i_t,f_t> and other cooperative kernel calls in the binary/non-binary paths and the FJ_DEBUG_LOAD_BALANCING block) so every CUDA API and kernel launch in run_step_device is checked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/dual_simplex/presolve.hpp`:
- Around line 76-84: The current code silently proceeds when FILE* fid =
fopen(path.c_str(), "r") returns null and manually calls fclose(fid) inside the
check_read lambda, which can leak descriptors on exceptions; change this to fail
fast by checking fid immediately after fopen and throwing a clear
std::runtime_error if fopen fails for path, and replace manual fclose usage with
an RAII wrapper (e.g., std::unique_ptr<FILE, decltype(&fclose)> or a small
FileHandle class) so that the file is always closed on scope exit; update the
check_read lambda (and any code paths that currently call fclose) to rely on the
RAII handle instead of calling fclose directly.
In `@cpp/src/utilities/version_info.cpp`:
- Around line 169-177: The call to cudaRuntimeGetVersion(&version) in function
that queries CUDA info is missing error checking, causing an
uninitialized/incorrect version to be logged if it fails; update the code near
cudaRuntimeGetVersion to check its return value (similar to how cudaGetDevice
and cudaGetDeviceProperties are checked), log a warning via CUOPT_LOG_WARN with
context (e.g., "Failed to query CUDA runtime version") and return early if the
call is not cudaSuccess so version isn't used, referencing
cudaRuntimeGetVersion, cudaGetDevice, cudaGetDeviceProperties, device_id, and
device_prop to locate the change.
---
Outside diff comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 756-805: The run_step_device function has unchecked CUDA
calls—specifically cudaStreamBeginCapture and several
cudaLaunchCooperativeKernel invocations in the binary/non-binary branches and
inside the FJ_DEBUG_LOAD_BALANCING block—so add proper error checking around
each of these calls (use RAFT_CUDA_TRY or equivalent error-check macro) and
ensure any failure is handled consistently (log and abort/return) just like the
surrounding kernel launches (e.g., wraps around cudaStreamBeginCapture, the
cooperative launches that invoke kernels such as
handle_local_minimum_kernel<i_t,f_t> and other cooperative kernel calls in the
binary/non-binary paths and the FJ_DEBUG_LOAD_BALANCING block) so every CUDA API
and kernel launch in run_step_device is checked.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5d18dbbf-bd69-4d1c-8ed7-7457c556083e
📒 Files selected for processing (8)
cpp/libmps_parser/src/mps_parser.cppcpp/src/dual_simplex/presolve.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/utilities/version_info.cppcpp/tests/linear_programming/unit_tests/optimization_problem_test.cu
…g CUDA checks - Use unique_ptr for FILE handle in presolve binary reader (RAII) - Throw on fopen failure instead of silent skip - Add CSC matrix validation (col_start monotonicity, row index bounds) - Check cudaRuntimeGetVersion return value in version_info - Wrap remaining cudaStreamBeginCapture and cudaLaunchCooperativeKernel calls with RAFT_CUDA_TRY in feasibility_jump Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/dual_simplex/presolve.hpp`:
- Around line 93-97: The deserialization currently coerces any bit pattern into
obj_scale and objective_is_integral; change the read-and-assign logic around
obj_scale and objective_is_integral (the block that calls check_read for
obj_constant, obj_scale, is_integral) to validate exact on-disk encodings:
accept obj_scale only if it equals 1.0f or -1.0f (reject otherwise) and accept
is_integral only if it equals 0 or 1 (treat 1 as true, 0 as false; reject other
values); on invalid values return a read/format error (or throw) instead of
silently coercing, keeping use of check_read for the fread checks and preserving
obj_constant handling.
- Around line 86-129: The file-reader currently only limits element counts
(max_dim, nnz) but still allows huge memory allocations via many large arrays;
before calling any resize() or fread() for objective, rhs, lower, upper,
A.col_start, A.i, or A.x, compute the total byte footprint required (e.g.
total_bytes = num_cols*sizeof(f_t) /*objective*/ + num_rows*sizeof(f_t) /*rhs*/
+ num_cols*sizeof(f_t) /*lower*/ + num_cols*sizeof(f_t) /*upper*/ +
(num_cols+1)*sizeof(i_t) /*col_start*/ + nnz*sizeof(i_t) /*A.i*/ +
nnz*sizeof(f_t) /*A.x*/) and compare it against a safe cap (e.g. a MAX_BYTES
constant like 1<<30 or derived from max_dim and available memory); if
total_bytes exceeds the cap, throw std::runtime_error and do not perform any
resize/fread. Ensure this check happens immediately after computing nnz (after
reading A.col_start and before any resize/fread on
objective/rhs/lower/upper/A.i/A.x) and keep existing per-field
monotonicity/index checks (A.col_start, A.i bounds) intact.
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 811-815: The cudaGraph_t 'graph' is not destroyed if
RAFT_CUDA_TRY(cudaGraphInstantiate(&graph_instance, graph)) throws, leaking the
captured graph; wrap the instantiate call in a try/catch (or equivalent RAFT
error-handling block) so that on any exception you call cudaGraphDestroy(graph)
before rethrowing, only set graph_created = true after successful instantiation,
and ensure the same fix is applied to the identical pattern in
ping_pong_graph.cu; reference the symbols cudaStreamEndCapture,
cudaGraphInstantiate, cudaGraphDestroy, graph_instance, graph, climber_stream,
RAFT_CUDA_TRY, and graph_created when making the change.
In `@cpp/src/utilities/version_info.cpp`:
- Around line 169-177: The function currently returns early when cudaGetDevice
or cudaGetDeviceProperties fail, which prevents the subsequent general cuOpt
version/build and CPU logs from running; remove those early returns and instead
set a local flag (e.g., bool has_gpu = false) or guard variables (using
device_id/device_prop) so GPU-specific preparation and the GPU log emission are
skipped but the function continues to emit the general version/build and CPU
logs; specifically, modify the cudaGetDevice and cudaGetDeviceProperties checks
(the blocks calling CUOPT_LOG_WARN) to clear/leave has_gpu false and continue,
and wrap only the GPU info preparation and the GPU logging call(s) in an if
(has_gpu) block so non-GPU logging always executes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3d3ff6de-c56a-4aa2-af69-97f0a661d5e1
📒 Files selected for processing (3)
cpp/src/dual_simplex/presolve.hppcpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/utilities/version_info.cpp
… GPU flag - Add 2 GiB total serialized byte cap before allocations in presolve reader - Validate obj_scale (must be 1.0 or -1.0) and is_integral (must be 0 or 1) - Wrap cudaGraphInstantiate in try/catch to ensure graph cleanup on failure - Use has_gpu flag instead of early return so version/CPU info always logs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu (2)
724-824:⚠️ Potential issue | 🟠 MajorAdd CUDA error checking to all kernel launches in
load_balancing_score_update().Lines 603, 615, 623–624, and 628 in
load_balancing_score_update()issue rawcudaLaunchCooperativeKerneland<<<>>>kernel launches withoutRAFT_CUDA_TRYorRAFT_CHECK_CUDA. Since this function is invoked at line 721 whenuse_load_balancingis true, launch failures in this path will not be caught and reported, creating an inconsistency with the hardened non-load-balanced path. Wrap all four kernel launches with appropriate error checking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 724 - 824, The kernel launches inside load_balancing_score_update are missing RAFT_CUDA_TRY/RAFT_CHECK_CUDA wrappers; wrap each raw cudaLaunchCooperativeKernel and cudaLaunchKernel call in that function with RAFT_CUDA_TRY (or use RAFT_CHECK_CUDA where appropriate after stream ops) so failures are reported; specifically update the launches that invoke compute_mtm_moves_kernel<i_t,f_t,...>, update_lift_moves_kernel<i_t,f_t>, update_breakthrough_moves_kernel<i_t,f_t> and the other raw launches in the load_balancing_score_update flow (e.g., select_variable_kernel, handle_local_minimum_kernel, update_assignment_kernel, update_changed_constraints_kernel) to use RAFT_CUDA_TRY/RAFT_CHECK_CUDA consistent with the non-load-balanced path.
709-820:⚠️ Potential issue | 🟠 MajorWrap stream capture region in exception-safe guard to ensure cudaStreamEndCapture always executes.
Any
RAFT_CUDA_TRYfailure inside the loop (lines 715–803) exits without callingcudaStreamEndCapture()at line 811. Per CUDA documentation, a stream remains invalid after failed capture untilcudaStreamEndCapture()completes, leavingclimber_streamunusable for the rest of the solve. Wrap the entire capture block (lines 710–810) in a try/catch that unconditionally ends capture, destroys any returned graph, and rethrows the exception.The existing try/catch around
cudaGraphInstantiate(lines 812–818) only handles failures during instantiation, not during the capture itself.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 709 - 820, The cuda stream capture started by cudaStreamBeginCapture(climber_stream) must be wrapped in an exception-safe block so cudaStreamEndCapture(climber_stream, &graph) is always called even if any RAFT_CUDA_TRY inside the capture (e.g., launches in the loop using compute_mtm_moves_kernel, update_lift_moves_kernel, select_variable_kernel, etc.) throws; modify the code around the begin capture to use a try/catch/finally pattern that on any exception calls cudaStreamEndCapture, destroys any non-null graph via cudaGraphDestroy(graph), then rethrows; preserve the existing cudaGraphInstantiate/graph_instance logic but ensure graph is cleaned up and graph_created set only after successful instantiate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 724-824: The kernel launches inside load_balancing_score_update
are missing RAFT_CUDA_TRY/RAFT_CHECK_CUDA wrappers; wrap each raw
cudaLaunchCooperativeKernel and cudaLaunchKernel call in that function with
RAFT_CUDA_TRY (or use RAFT_CHECK_CUDA where appropriate after stream ops) so
failures are reported; specifically update the launches that invoke
compute_mtm_moves_kernel<i_t,f_t,...>, update_lift_moves_kernel<i_t,f_t>,
update_breakthrough_moves_kernel<i_t,f_t> and the other raw launches in the
load_balancing_score_update flow (e.g., select_variable_kernel,
handle_local_minimum_kernel, update_assignment_kernel,
update_changed_constraints_kernel) to use RAFT_CUDA_TRY/RAFT_CHECK_CUDA
consistent with the non-load-balanced path.
- Around line 709-820: The cuda stream capture started by
cudaStreamBeginCapture(climber_stream) must be wrapped in an exception-safe
block so cudaStreamEndCapture(climber_stream, &graph) is always called even if
any RAFT_CUDA_TRY inside the capture (e.g., launches in the loop using
compute_mtm_moves_kernel, update_lift_moves_kernel, select_variable_kernel,
etc.) throws; modify the code around the begin capture to use a
try/catch/finally pattern that on any exception calls cudaStreamEndCapture,
destroys any non-null graph via cudaGraphDestroy(graph), then rethrows; preserve
the existing cudaGraphInstantiate/graph_instance logic but ensure graph is
cleaned up and graph_created set only after successful instantiate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9646f09b-95a1-4e2c-a65b-2047d379eafd
📒 Files selected for processing (3)
cpp/src/dual_simplex/presolve.hppcpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/utilities/version_info.cpp
chris-maes
left a comment
There was a problem hiding this comment.
Let's just delete read_problem in presolve.hpp. I think it was only used for debugging.
| @@ -73,33 +74,84 @@ struct lp_problem_t { | |||
|
|
|||
| void read_problem(const std::string& path) | |||
There was a problem hiding this comment.
This is not used anywhere, we can delete this. We go through mps reader and writer.
| check_read(fread(A.x.data(), sizeof(f_t), nnz, fid.get()), nnz, "A.x"); | ||
| } | ||
|
|
||
| void write_mps(const std::string& path) const |
There was a problem hiding this comment.
Do you mean write_mps function ?
There was a problem hiding this comment.
│ write_mps (lines 74-127) │ Used behind #ifdef debug flags in branch_and_bound.cpp — not dead, but debug-only │
There was a problem hiding this comment.
But there is │ write_problem (lines 53-72) │ Dead — defined but never called anywhere in the codebase │
There was a problem hiding this comment.
Yes, this is dead code that we can delete
There was a problem hiding this comment.
Hmm, @chris-maes mentioned this is used for debugging. So I am leaving it as it is, but removing write_problem.
|
@rg20 @chris-maes May I get another round of review on this PR |
Summary
Improve error handling and resource management across C++/CUDA code: validate fread returns and dimensions in presolve binary reader, wrap unchecked CUDA API calls with RAFT_CUDA_TRY, use RAII for FILE handles in MPS parser, add PID-based shared memory names in gRPC server, and check CUDA device query returns in version_info.
Testing
No new tests added. Existing unit and integration tests cover the modified code paths.
Documentation
No documentation changes needed.