[SPARK-50520][PySpark] Respect timeout in df.rdd.countApprox() by rishav23 · Pull Request #56060 · apache/spark

rishav23 · 2026-05-22T10:07:01Z

What changes were proposed in this pull request?

PySpark approximate RDD actions currently call getFinalValue() on the PartialResult returned by Spark approximate job APIs. This introduces blocking behavior and causes APIs like countApprox(timeout=...) to wait for full job completion instead of respecting timeout semantics. This PR changes PySpark to use PartialResult.initialValue(), which already contains the timeout-aware approximation computed by ApproximateActionListener.awaitResult(). Additionally, regression tests were added to validate:

timeout-aware approximate behavior
exact results when computation completes successfully

Why are the changes needed?

Spark approximate actions are designed to return partial results after the specified timeout. Scala APIs correctly expose this behavior through PartialResult, but PySpark currently forces blocking completion by calling getFinalValue(). As a result, PySpark countApprox() ignores timeout semantics and waits for full completion.

Does this PR introduce any user-facing change?

Yes, PySpark approximate RDD actions now correctly respect timeout semantics and return timeout-aware approximate results instead of blocking until full completion.

How was this patch tested?

Reproduced the issue locally using large RDDs
Verified timeout behavior before and after the fix
Added regression tests in python/pyspark/tests/test_rdd.py
Ran: python/run-tests.py --testnames pyspark.tests.test_rdd

Was this patch authored or co-authored using generative AI tooling?

No

…ctions

rishav23 added 2 commits May 22, 2026 15:16

[SPARK-50520][PySpark] Respect timeout semantics in approximate RDD a…

4984467

…ctions

[SPARK-50520][PySpark] Trigger CI

c49d15e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50520][PySpark] Respect timeout in df.rdd.countApprox()#56060

[SPARK-50520][PySpark] Respect timeout in df.rdd.countApprox()#56060
rishav23 wants to merge 2 commits into
apache:masterfrom
rishav23:fix-spark-50520-countapprox-timeout

rishav23 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rishav23 commented May 22, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant