Skip to content

Query execution time difference between deltatable QueryBuilder and using DataFusion directly. #1140

@debajyoti-truefoundry

Description

@debajyoti-truefoundry

Describe the bug
What happened:
There are two ways of querying a Delta Table using DataFusion.

  1. Using DataFusion directly.
  2. Using the Query Builder from Delta.
deltalake==1.0.2
datafusion==47.0.0
import time
from datafusion import SessionContext
from deltalake import DeltaTable, QueryBuilder

dt = DeltaTable("./delta_traces_3/otel_traces")
sql = """
SELECT
  *
FROM tbl
WHERE
("MlRepoId" = 1089) AND ("TracingProjectId" = '222fde49-1f7a-4752-8ec1-06bcdbf570c5') AND ("TraceId" = '8728990bd3d11fa91a688e9d9964bca1') AND ("SpanId" = '82c0a65e80000450')
"""

qb = QueryBuilder().register("tbl", dt)
start = time.monotonic()
table = qb.execute(sql).read_all()
print("Delta QueryBuilder: ", time.monotonic() - start)

ctx = SessionContext()
ctx.register_table_provider("tbl", dt)
start = time.monotonic()
arrow_list = ctx.sql(sql).collect()
print("DataFusion: ", time.monotonic() - start)
python perf_diff.py 
Delta QueryBuilder:  1.2023070430004736
DataFusion:  136.57222191900019

As we can see in the above result, I am noticing a massive difference in the query execution time.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
I was expecting a near-identical execution time.

Additional context
delta-io/delta-rs#3517 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions