diff --git a/.gitignore b/.gitignore index 0a567a6c11..79391abf68 100644 --- a/.gitignore +++ b/.gitignore @@ -22,3 +22,4 @@ remorph_transpile/ .databricks-login.json .mypy_cache .env +.cursor/rules/profiler-fetchresult-connectors.mdc diff --git a/docs/lakebridge/docs/assessment/profiler/index.mdx b/docs/lakebridge/docs/assessment/profiler/index.mdx index d0e13b7f15..c484ba45b0 100644 --- a/docs/lakebridge/docs/assessment/profiler/index.mdx +++ b/docs/lakebridge/docs/assessment/profiler/index.mdx @@ -33,6 +33,7 @@ Key capabilities: | Source Platform | Configuration Status | |:---------------:|:-------------------:| | Azure Synapse | ✅ | +| Amazon Redshift | ✅ | ## Configure Profiler @@ -88,3 +89,5 @@ Each execution will create a timestamped snapshot of your source environment. Visualize your profiler results as a Lakeview dashboard deployed directly to your Databricks workspace. See the full guide: [Profiler Summary Dashboard](./dashboards). + +For **Amazon Redshift**, dashboard creation is limited to uploading the profiler extract; see [Amazon Redshift Profiler Details](./redshift) for details. diff --git a/docs/lakebridge/docs/assessment/profiler/redshift.mdx b/docs/lakebridge/docs/assessment/profiler/redshift.mdx new file mode 100644 index 0000000000..39fa2e0dee --- /dev/null +++ b/docs/lakebridge/docs/assessment/profiler/redshift.mdx @@ -0,0 +1,159 @@ +--- +sidebar_position: 2 +title: Amazon Redshift Profiler Details +--- +import Admonition from '@theme/Admonition'; + +# Amazon Redshift Profiler Details + +- [Prerequisites](#prerequisites) +- [Configure Connection to Redshift](#configure-connection-to-redshift) +- [Run the profiler](#run-the-profiler) +- [Profiler output and dashboards](#profiler-output-and-dashboards) + +## Prerequisites + +### 1. Environment + +- **Lakebridge CLI** installed and configured for your Databricks workspace (same as other profiler sources). + +### 2. Choose the Redshift deployment variant + +The profiler ships **three extract pipelines** — pick the one that matches your Redshift instance: + +| Variant | Use when | +|--------|-----------| +| **serverless** | Amazon Redshift Serverless | +| **provisioned** | Single-AZ provisioned cluster | +| **provisioned_multi_az** | Multi-AZ provisioned cluster | + +When you run `execute-database-profiler`, the CLI prompts you to select this variant so Lakebridge loads the correct SQL pipeline under `resources/assessments/redshift//`. + +### 3. Network connectivity + +The machine running the profiler must reach the Redshift cluster **endpoint** (hostname) on the cluster port (default **5439**), subject to your security groups / VPC / routing rules. + +### 4. Authentication + +During configuration you choose an **authentication method** and where secrets are read from (**local**, **env**, or **file**): + +| Authentication method | Typical use | +|----------------------|-------------| +| **database_password** | Native database user and password | +| **temporary_credentials_db_user** | Temporary credentials via AWS (`GetClusterCredentials`-style flows); wizard collects DB user for credential exchange (often `awsuser` for the master DB user path) | +| **temporary_credentials_iam** | IAM-authenticated temporary credentials | +| **federated_user** | Federated identity mapped through AWS → Redshift | + +For IAM-oriented methods you typically need: + +- **AWS credentials** available to the process (for example `AWS_PROFILE`, environment variables for keys, or instance/profile credentials). +- **IAM permissions** allowing Amazon Redshift credential APIs appropriate for your setup (for example `redshift:GetClusterCredentials` where applicable). + +Use **`local`** to store plaintext values in `~/.databricks/labs/lakebridge/.credentials.yml`, **`env`** to substitute values from environment variables (with fallback), or **`file`** to reuse an existing credential file when it already contains valid Redshift entries. + +### 5. Database privileges + +The profiler connects as your configured user and runs read-only extracts. The pipeline includes **prepare** steps that create a helper view **`query_view`** in the database (via `CREATE OR REPLACE VIEW`). The connecting user therefore needs permission to: + +- **Create (and replace) views** in the target database used for profiling. +- **Select** from the Amazon Redshift system relations used by the extracts (see below). + +**Provisioned clusters** (`provisioned` / `provisioned_multi_az`) — objects referenced by the bundled SQL include, among others: + +- `stl_query`, `stl_query_metrics` +- `stv_node_storage_capacity`, `stv_partitions` +- `sys_external_query_detail` + +**Serverless** (`serverless`) — examples include: + +- `sys_query_history`, `sys_query_detail` +- `sys_external_query_detail`, `sys_serverless_usage` + +Exact object access is determined by Amazon Redshift documentation for your edition; grant **minimum read** access consistent with those views/tables. + +:::tip +If you cannot grant broad catalog access, narrow to the relations used in the YAML pipeline for your variant under `resources/assessments/redshift//` in the Lakebridge package. +::: + +## Configure Connection to Redshift + +```bash +databricks labs lakebridge configure-database-profiler +``` + +Select **redshift** when prompted for the source system. The wizard will ask for authentication method, credential source (`local` | `env` | `file`), and connection details — for password auth, for example: + +- Redshift cluster **endpoint** (host) +- **Port** (default 5439) +- **Database** name +- **User** and **password** + +For temporary / federated IAM-style paths, expect prompts for the **DB user** used with `GetClusterCredentials` (default suggested: `awsuser`) and optionally an **AWS profile** name. + +Example-style transcript (values are illustrative): + +```console +databricks labs lakebridge configure-database-profiler + +Please select the source system you want to configure +[0] synapse +[1] redshift +Enter a number between 0 and 1: 1 + +Redshift authentication: database_password, temporary_credentials_db_user, +temporary_credentials_iam, or federated_user. + +Authentication method +[0] database_password +[1] temporary_credentials_db_user +... +Credential source (local | env | file) +[0] local +[1] env +[2] file +... + +Enter the Redshift cluster endpoint (host): mycluster.abc123.us-east-1.redshift.amazonaws.com +Enter the port details (default: 5439): 5439 +Enter the database name: dev +Enter the user details: profiler_reader +Enter the password details: ******** + +Do you want to test the connection to redshift? [y/n]: y +``` + +## Run the profiler + +After configuration and a successful connection test (optional): + +```bash +databricks labs lakebridge execute-database-profiler --help +``` + +Run the profiler (interactive source selection): + +```bash +databricks labs lakebridge execute-database-profiler +``` + +When **redshift** is selected, the CLI prompts for **Redshift variant**: `serverless`, `provisioned`, or `provisioned_multi_az`. + +You can pass the source explicitly where supported: + +```bash +databricks labs lakebridge execute-database-profiler --source-tech redshift +``` + +Execution will: + +1. Load `pipeline_config.yml` for the chosen variant. +2. Run **prepare** steps on the cluster (including creating/updating **`query_view`**). +3. Run SQL extracts and persist results into **`profiler_extract.db`** (DuckDB) under the configured extract folder. + +## Profiler output and dashboards + +:::warning Attention: +For **Redshift**, `create_profiler-dashboard` uploads the **`profiler_extract.db`** extract to Unity Catalog Volume storage **only**. It **does not** deploy the Synapse-style Lakeview profiler summary dashboard for Redshift (`serverless`, `provisioned`, and `provisioned_multi_az`). Plan to analyze the DuckDB extract locally or with your own tooling unless/until dashboard support is added. +::: + +[Back to Configure Profiler](../#configure-profiler)