Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "df-autohint-runner/dsb"]
path = df-autohint-runner/dsb
url = https://github.com/microsoft/dsb.git
15 changes: 11 additions & 4 deletions autohint-implementations/df-execution-impl/src/visitor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ use hint_engine::{JoinAlgorithm, LeafNode, PlanNodeMetadata};

use datafusion::arrow::datatypes::Schema;
use datafusion::execution::context::SessionContext;
use datafusion::physical_plan::ExecutionPlan;
use datafusion::physical_plan::joins::{HashJoinExec, NestedLoopJoinExec};
use datafusion::physical_plan::test::exec::MockExec;
use datafusion::physical_plan::{ExecutionPlanVisitor, accept};
use datafusion::physical_plan::ExecutionPlan;
use datafusion::physical_plan::{accept, ExecutionPlanVisitor};
use std::collections::HashSet;

use std::sync::Arc;
Expand Down Expand Up @@ -139,8 +139,15 @@ impl ExecutionPlanVisitor for DFExecutionPlanVisitor {
let file_name = plan
.as_any()
.downcast_ref::<datafusion::datasource::memory::DataSourceExec>()
.unwrap()
.data_source()
.unwrap();

// NOTE: Here the data_source is a MemorySourceConfig instead of FileScanConfig
let file_name = file_name.data_source();
println!("\nfile_name");
println!("{:?}", file_name);

// NOTE: This unwrap() panicked
let file_name = file_name
.as_any()
.downcast_ref::<datafusion::datasource::physical_plan::FileScanConfig>()
.unwrap()
Expand Down
4 changes: 4 additions & 0 deletions df-autohint-runner/.gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
**target/
**data/
output
output/tpch/
output/job/
output/dsb/
input/job/queries
input/dsb/queries
**Cargo.lock
**__pycache__

*.rs.bk
67 changes: 63 additions & 4 deletions df-autohint-runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ export PGPASSWORD=<postgres password>
TPC-H tables and queries will be generated under `input/tpch/data` and `input/tpch/queries`.
TPC-H tables and schema will be loaded into Postgres `tpch` database.

## SQL to Datafusion to Hint Runner
### SQL to Datafusion to Hint Runner
run:
```
./scripts/tpch-df-logical-plan.sh
Expand All @@ -28,7 +28,7 @@ This will load the schema and query into datafusion, run the hint engine over th
The original SQL with dialect translation and produced SQL with hints will be dumped into
`output/tpch/original-queries` and `output/tpch/{logical, logical-optimized, physical-optimized}-queries` directories.

## SQL with Hints Postgres Runner
### SQL with Hints Postgres Runner
run:
```
export PGPASSWORD=<postgres password>
Expand All @@ -53,7 +53,7 @@ export PGPASSWORD=<postgres password>
job-fkindex.sh
```

## SQL to Datafusion to Hint Runner
### SQL to Datafusion to Hint Runner
run:
```
./scripts/job-df-logical-plan.sh
Expand All @@ -64,7 +64,7 @@ This will load the schema and query into datafusion, run the hint engine over th
The original SQL with dialect translation and produced SQL with hints will be dumped into
`output/job/original-queries` and `output/job/{logical, logical-optimized, physical-optimized}-queries` directories.

## SQL with Hints Postgres Runner
### SQL with Hints Postgres Runner
run:
```
export PGPASSWORD=<postgres password>
Expand All @@ -74,6 +74,65 @@ This will run explain analyze or explain on the original queries from `output/jo
and optimized queries from `output/job/logical-queries`, `output/job/logical-optimized-queries`, and `output/job/physical-optimized-queries` and dump their runtime performance and explain analyze plans or just explain plans into `output/job/postgres-results-{unhinted, logical, logical-optimized, physical-optimized}` respectively.
Please notice that any other processes containing name 'postgres', 'pgbench', or 'psql' could be killed during benchmark run, and postgres server will be restarted.


## DSB Setup
To init submodule `https://github.com/microsoft/dsb`, run
```
git submodule init
```

Then to build the `dsdgen` binary, edit `dsb/code/tools/Makefile.suite` `OS = ...` field if needed, then
```
cd dsb/code/tools
make
cd -
```
should build the `dsdgen` binary in the `dsb/code/tools` dir.

Then run the data generation and load python scripts, first install `psycopg2-binary`
```
pip3 install psycopg2-binary
```
Then run
```
python3 ./scripts/dsb-gen.py
python3 ./scripts/dsb-load.py
```
This should load dsb data into Postgres `dsb` database.

To create the reference indexes for DSB, run
```
python3 ./scripts/dsb-index.py
```

To generate the workload queries, adjust the config json `./scripts/dsb_workload_config.json`
and run
```
./scripts/dsb-workload-gen.sh
```
This should generate all queries under `input/dsb/queries`

### SQL to Datafusion to Hint Runner
run:
```
./scripts/dsb-df-logical-plan.sh
./scripts/dsb-df-logical-optimized-plan.sh
./scripts/dsb-df-execution-plan.sh
```
This will load the schema and query into datafusion, run the hint engine over the unoptimized logical plan, optimized logical plan, and optimized physical plan respectively, and dump the hinted SQL for postgres.
The original SQL with dialect translation and produced SQL with hints will be dumped into
`output/dsb/original-queries` and `output/dsb/{logical, logical-optimized, physical-optimized}-queries` directories.

### SQL with Hints Postgres Runner
run:
```
export PGPASSWORD=<postgres password>
./scripts/dsb-run-all.sh --single-core|--multi-core [--timeout-second 10] [--explain-only]
```
This will run explain analyze or explain on the original queries from `output/job/original-queries`
and optimized queries from `output/dsb/logical-queries`, `output/dsb/logical-optimized-queries`, and `output/dsb/physical-optimized-queries` and dump their runtime performance and explain analyze plans or just explain plans into `output/dsb/postgres-results-{unhinted, logical, logical-optimized, physical-optimized}` respectively.
Please notice that any other processes containing name 'postgres', 'pgbench', or 'psql' could be killed during benchmark run, and postgres server will be restarted.

---

## Complete Benchmark Analysis Pipeline (`run_benchmark_analysis.sh`)
Expand Down
4 changes: 4 additions & 0 deletions df-autohint-runner/datafusion-logical-runner/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ use std::{fs, path::Path, sync::Arc};
pub enum Benchmark {
TPCH,
JOB,
DSB,
}

fn observer(_plan: &LogicalPlan, _rule: &dyn OptimizerRule) {
Expand Down Expand Up @@ -122,6 +123,9 @@ pub async fn setup_database(
Benchmark::JOB => {
run_csv_ddl_file(&ctx, &schema_dir.as_ref().join("df-schema.sql"), data_dir).await?;
}
Benchmark::DSB => {
run_csv_ddl_file(&ctx, &schema_dir.as_ref().join("df-schema.sql"), data_dir).await?;
}
}
Ok(ctx)
}
Expand Down
7 changes: 4 additions & 3 deletions df-autohint-runner/datafusion-logical-runner/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ use datafusion::sql::sqlparser::dialect::PostgreSqlDialect;
use indicatif::{ProgressBar, ProgressStyle};
use std::{fs, path::PathBuf};

use datafusion_logical_runner::{Benchmark, df_sql_to_logical, is_sql_file, plan, setup_database};
use datafusion_logical_runner::{df_sql_to_logical, is_sql_file, plan, setup_database, Benchmark};

use df_execution_impl::DFExecutionPlanVisitor;
use std::sync::Arc;
Expand All @@ -13,8 +13,8 @@ use datafusion::physical_plan::displayable;
use df_logical_impl::DFLogicalPlanVisitor;
use hint_engine::{engine::Engine, engine::HintEngineConfig, hints::PGHintType};

use tracing::{error, info};
use tracing_subscriber::{EnvFilter, fmt};
use tracing::{debug, error, info, warn};
use tracing_subscriber::{fmt, EnvFilter};

#[derive(Parser, Debug)]
#[command(name = "df-autohint-runner", version, about = "Run tools")]
Expand Down Expand Up @@ -48,6 +48,7 @@ async fn main() -> Result<()> {
let benchmark = match args.benchmark.to_lowercase().as_str() {
"job" => Benchmark::JOB,
"tpch" => Benchmark::TPCH,
"dsb" => Benchmark::DSB,
other => panic!("Unknown benchmark: {other}"),
};

Expand Down
1 change: 1 addition & 0 deletions df-autohint-runner/dsb
Submodule dsb added at ec9a15
Loading