diff --git a/databricks_job_executor/README.md b/databricks_job_executor/README.md index 3790e64..e84973c 100644 --- a/databricks_job_executor/README.md +++ b/databricks_job_executor/README.md @@ -16,7 +16,7 @@ A Streamlit application for executing and monitoring Databricks migration jobs. - Python 3.8+ - Streamlit - Databricks workspace access -- Databricks personal access token +- Databricks service principal with OAuth M2M credentials (client ID and client secret) ### Installation @@ -28,14 +28,16 @@ pip install -r requirements.txt 2. Set environment variables: ```bash export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com" -export DATABRICKS_TOKEN="your-personal-access-token" +export DATABRICKS_CLIENT_ID="your-client-id" +export DATABRICKS_CLIENT_SECRET="your-client-secret" export DATABRICKS_JOB_ID="123456" # Optional: specific job ID to run ``` Or create a `.env` file: ``` DATABRICKS_HOST=https://your-workspace.cloud.databricks.com -DATABRICKS_TOKEN=your-personal-access-token +DATABRICKS_CLIENT_ID=your-client-id +DATABRICKS_CLIENT_SECRET=your-client-secret DATABRICKS_JOB_ID=123456 ``` @@ -86,10 +88,15 @@ This application can be deployed to Databricks using Databricks Asset Bundles. The application requires the following environment variables: -- **DATABRICKS_HOST** (required): Your Databricks workspace URL (e.g., `https://your-workspace.cloud.databricks.com`) -- **DATABRICKS_TOKEN** (required): Your Databricks personal access token +- **DATABRICKS_HOST** (required for local): Your Databricks workspace URL (e.g., `https://your-workspace.cloud.databricks.com`) +- **DATABRICKS_CLIENT_ID** (required for local): Your service principal client ID +- **DATABRICKS_CLIENT_SECRET** (required for local): Your service principal client secret - **DATABRICKS_JOB_ID** (required): The specific job ID to run +**Authentication Methods:** +- **Local Development**: Uses OAuth M2M (service principal) with `DATABRICKS_CLIENT_ID` and `DATABRICKS_CLIENT_SECRET` +- **Databricks Runtime**: Automatically uses built-in authentication (no credentials needed) + These credentials are read from environment variables at startup. The connection status is displayed in the sidebar. ## Usage @@ -104,7 +111,7 @@ These credentials are read from environment variables at startup. The connection ## Security Note -Never commit your `DATABRICKS_TOKEN` to version control. Always use environment variables or secure credential management systems. +Never commit your `DATABRICKS_CLIENT_SECRET` to version control. Always use environment variables or secure credential management systems (e.g., Databricks Secrets). ### Setting Environment Variables and Secrets on Databricks @@ -126,33 +133,26 @@ When deploying and running the Streamlit app on Databricks, you can configure th # MY_CUSTOM_VAR: "value" ``` -2. **Databricks Widgets (for `DATABRICKS_HOST`, `DATABRICKS_TOKEN`, `DATABRICKS_JOB_ID`)**: - When you launch a Databricks App, you can pass parameters as widgets. The Streamlit app is configured to read `databricks_host`, `databricks_token`, and `databricks_job_id` from these widgets if they are present. +2. **Databricks App Configuration**: + When deploying to Databricks as an app, authentication is handled automatically using the Databricks runtime's built-in authentication. No explicit credentials (client ID/secret) are needed when running on Databricks. - To set widgets when launching the app: - * Go to your Databricks workspace. - * Navigate to "Apps" (or the equivalent section where deployed apps are listed). - * Select your deployed app (e.g., `databricks-job-executor-streamlit`). - * Click "Launch" or "Run App". - * In the launch dialog, you may find options to set parameters. If not directly available, you might need to configure them in the `databricks.yml` or rely on secrets. - * `databricks_host`: `https://your-workspace.cloud.databricks.com` - * `databricks_token`: `dapixxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx` (your personal access token) - * `databricks_job_id`: `123456` (the ID of the job you want to execute) + For local development configuration, you can optionally use Databricks Widgets to pass `databricks_host`, `databricks_client_id`, `databricks_client_secret`, and `databricks_job_id` if needed. -3. **Databricks Secrets (for `DATABRICKS_TOKEN`)**: - For enhanced security, it is recommended to store your `DATABRICKS_TOKEN` in Databricks Secrets. The application will attempt to retrieve the token from a secret scope if it's not provided via environment variables or widgets. +3. **Databricks Secrets (for Local Development)**: + For enhanced security during local development, you can store your OAuth credentials in Databricks Secrets and retrieve them programmatically. To set up Databricks Secrets: * **Create a Secret Scope**: ```bash - databricks secrets create-scope --scope databricks-token-scope + databricks secrets create-scope --scope oauth-credentials ``` (You might need to configure ACLs for this scope to allow users/groups to read it.) - * **Put the Secret**: + * **Put the Secrets**: ```bash - databricks secrets put --scope databricks-token-scope --key databricks-token-key + databricks secrets put --scope oauth-credentials --key client-id + databricks secrets put --scope oauth-credentials --key client-secret ``` - When prompted, paste your Databricks personal access token. + When prompted, enter your service principal credentials. - The application will then automatically attempt to retrieve the token using `dbutils.secrets.get("databricks-token-scope", "databricks-token-key")` when running in the Databricks environment. + **Note**: When running on Databricks as an app, the runtime automatically handles authentication, so explicit credential storage is not required. diff --git a/databricks_job_executor/streamlit_app/components/ui/initializers.py b/databricks_job_executor/streamlit_app/components/ui/initializers.py index 37a6854..a39983a 100644 --- a/databricks_job_executor/streamlit_app/components/ui/initializers.py +++ b/databricks_job_executor/streamlit_app/components/ui/initializers.py @@ -24,7 +24,8 @@ def configure_page(bundle_environment: str = 'dev'): def initialize_config_state(db_env: dict): """Initialize configuration state from environment variables or Databricks environment.""" st.session_state.databricks_host = db_env.get('host', '') - st.session_state.databricks_token = db_env.get('token', '') + st.session_state.databricks_client_id = db_env.get('client_id', '') + st.session_state.databricks_client_secret = db_env.get('client_secret', '') st.session_state.bundle_environment = db_env.get('bundle_environment', 'dev') job_id_str = os.getenv('DATABRICKS_JOB_ID') # Still allow .env override diff --git a/databricks_job_executor/streamlit_app/components/ui/renders.py b/databricks_job_executor/streamlit_app/components/ui/renders.py index 321bb80..03c34e8 100644 --- a/databricks_job_executor/streamlit_app/components/ui/renders.py +++ b/databricks_job_executor/streamlit_app/components/ui/renders.py @@ -7,59 +7,107 @@ from streamlit_app.utils.databricks_env import validate_connection +def _get_session_config(): + """Extract connection configuration from session state.""" + return { + 'host': st.session_state.get('databricks_host', ''), + 'client_id': st.session_state.get('databricks_client_id', ''), + 'client_secret': st.session_state.get('databricks_client_secret', ''), + 'is_runtime': st.session_state.get('databricks_env', {}).get('is_databricks_runtime', False), + 'job_id': st.session_state.get('databricks_job_id'), + } + + +def _render_connection_status_runtime(job_id): + """Render connection status for Databricks runtime environment.""" + is_valid, error_msg = validate_connection() + if is_valid: + st.success("✅ Connected to Databricks") + st.info("**Environment:** Databricks Runtime") + if job_id: + st.info(f"**Job ID:**\n`{job_id}`") + else: + st.warning("⚠️ No Job ID configured") + else: + st.error("❌ Connection Failed") + st.error(f"**Error:** {error_msg}") + + +def _render_connection_status_local(host, client_id, client_secret, job_id): + """Render connection status for local development environment.""" + if host and client_id and client_secret: + is_valid, error_msg = validate_connection(host, client_id, client_secret) + if is_valid: + st.success("✅ Connected to Databricks") + st.info(f"**Workspace:**\n{host}") + if job_id: + st.info(f"**Job ID:**\n`{job_id}`") + else: + st.warning("⚠️ No Job ID configured") + else: + st.error("❌ Connection Failed") + st.error(f"**Error:** {error_msg}") + else: + st.warning("⚠️ Configuration Missing") + missing = [] + if not host: + missing.append("`DATABRICKS_HOST`") + if not client_id: + missing.append("`DATABRICKS_CLIENT_ID`") + if not client_secret: + missing.append("`DATABRICKS_CLIENT_SECRET`") + if not job_id: + missing.append("`DATABRICKS_JOB_ID`") + st.markdown("Please set the following environment variables:\n- " + "\n- ".join(missing)) + + +def _render_about_section(is_runtime): + """Render the About section in the sidebar.""" + st.markdown("### ℹ️ About") + st.markdown(""" + **Data Migration Accelerator** + + This tool helps you: + - Execute the configured migration job + - Monitor job runs and progress + - View job logs and diagnostics + - Cancel running jobs if needed + """) + + if is_runtime: + st.markdown(""" + **Deployed in Databricks Runtime** + - Authentication: Automatic + - Configure `DATABRICKS_JOB_ID` to set default job + """) + else: + st.markdown(""" + **Local Development Configuration:** + - `DATABRICKS_HOST` - Workspace URL + - `DATABRICKS_CLIENT_ID` - Service principal client ID + - `DATABRICKS_CLIENT_SECRET` - Service principal client secret + - `DATABRICKS_JOB_ID` - Job ID to run + """) + + def render_sidebar(): """Render the sidebar with connection status.""" + config = _get_session_config() + with st.sidebar: st.markdown("## ⚙️ Configuration") - st.markdown("### Connection Status") - host = st.session_state.get('databricks_host', '') - token = st.session_state.get('databricks_token', '') - - job_id = st.session_state.get('databricks_job_id') - - if host and token: - is_valid, error_msg = validate_connection(host, token) - if is_valid: - st.success("✅ Connected to Databricks") - st.info(f"**Workspace:**\n{host}") - if job_id: - st.info(f"**Job ID:**\n`{job_id}`") - else: - st.warning("⚠️ No Job ID configured") - else: - st.error("❌ Connection Failed") - st.error(f"**Error:** {error_msg}") + if config['is_runtime']: + _render_connection_status_runtime(config['job_id']) else: - st.warning("⚠️ Configuration Missing") - missing = [] - if not host: - missing.append("`DATABRICKS_HOST`") - if not token: - missing.append("`DATABRICKS_TOKEN`") - if not job_id: - missing.append("`DATABRICKS_JOB_ID`") - st.markdown(f"Please set the following environment variables:\n- " + "\n- ".join(missing)) + _render_connection_status_local( + config['host'], config['client_id'], + config['client_secret'], config['job_id'] + ) st.divider() - - st.markdown("### ℹ️ About") - st.markdown(""" - **Data Migration Accelerator** - - This tool helps you: - - Execute the configured migration job - - Monitor job runs and progress - - View job logs and diagnostics - - Cancel running jobs if needed - - **Configuration:** - Set via environment variables: - - `DATABRICKS_HOST` - Workspace URL - - `DATABRICKS_TOKEN` - Access token - - `DATABRICKS_JOB_ID` - Job ID to run - """) + _render_about_section(config['is_runtime']) def render_header(): @@ -82,35 +130,52 @@ def render_header(): """, unsafe_allow_html=True) -def render_main_content(): - """Render the main content area of the application.""" - render_sidebar() - render_header() - - host = st.session_state.get('databricks_host', '') - token = st.session_state.get('databricks_token', '') +def _check_connection_and_render_errors(config) -> bool: + """Check connection and render appropriate error messages. Returns True if connected.""" + if config['is_runtime']: + is_valid, error_msg = validate_connection() + if not is_valid: + st.error("❌ **Connection Failed**") + st.error(f"Unable to connect to Databricks: {error_msg}") + return False + return True - if not host or not token: + if not all([config['host'], config['client_id'], config['client_secret']]): st.error("⚠️ **Configuration Required**") st.markdown(""" - Please set the following environment variables before running the application: + Please set the following environment variables: - - `DATABRICKS_HOST` - Your Databricks workspace URL (e.g., `https://your-workspace.cloud.databricks.com`) - - `DATABRICKS_TOKEN` - Your Databricks personal access token + - `DATABRICKS_HOST` - Your Databricks workspace URL + - `DATABRICKS_CLIENT_ID` - Your service principal client ID + - `DATABRICKS_CLIENT_SECRET` - Your service principal client secret You can set these in your environment or in a `.env` file. """) - return + return False - is_valid, error_msg = validate_connection(host, token) + is_valid, error_msg = validate_connection( + config['host'], config['client_id'], config['client_secret'] + ) if not is_valid: - st.error(f"❌ **Connection Failed**") + st.error("❌ **Connection Failed**") st.error(f"Unable to connect to Databricks: {error_msg}") - st.info("Please check your `DATABRICKS_HOST` and `DATABRICKS_TOKEN` environment variables.") + st.info("Please verify your environment variables are correct.") + return False + + return True + + +def render_main_content(): + """Render the main content area of the application.""" + render_sidebar() + render_header() + + config = _get_session_config() + + if not _check_connection_and_render_errors(config): return - job_interface = JobInterface() - job_interface.render() + JobInterface().render() def render_footer(): @@ -122,4 +187,3 @@ def render_footer(): "", unsafe_allow_html=True ) - diff --git a/databricks_job_executor/streamlit_app/utils/databricks_env.py b/databricks_job_executor/streamlit_app/utils/databricks_env.py index 603ccd9..54efae4 100644 --- a/databricks_job_executor/streamlit_app/utils/databricks_env.py +++ b/databricks_job_executor/streamlit_app/utils/databricks_env.py @@ -1,8 +1,12 @@ """ Databricks environment utilities for job executor. + +Provides authentication and connection management for both: +- Local development (service principal with client_id/client_secret) +- Databricks runtime (automatic built-in authentication) """ import os -from typing import Dict, Any, Optional +from typing import Dict, Any, Tuple try: from databricks.sdk.runtime import dbutils @@ -10,79 +14,93 @@ dbutils = None -def initialize_databricks_environment() -> Dict[str, Any]: - """ - Initialize Databricks environment configuration. +def is_databricks_runtime() -> bool: + """Check if running in Databricks runtime environment.""" + return dbutils is not None and 'DATABRICKS_RUNTIME_VERSION' in os.environ - Returns: - Dict containing environment configuration - """ + +def initialize_databricks_environment() -> Dict[str, Any]: + """Initialize and return Databricks environment configuration.""" return { 'host': os.getenv('DATABRICKS_HOST', ''), - 'token': os.getenv('DATABRICKS_TOKEN', ''), - 'is_databricks_runtime': _is_databricks_runtime(), + 'client_id': os.getenv('DATABRICKS_CLIENT_ID', ''), + 'client_secret': os.getenv('DATABRICKS_CLIENT_SECRET', ''), + 'is_databricks_runtime': is_databricks_runtime(), 'workspace_id': os.getenv('DATABRICKS_WORKSPACE_ID', ''), 'bundle_environment': os.getenv('DATABRICKS_BUNDLE_ENV', 'dev'), } -def _is_databricks_runtime() -> bool: - """Check if running in Databricks runtime environment.""" - return dbutils is not None and 'DATABRICKS_RUNTIME_VERSION' in os.environ +def _create_runtime_client(): + """Create WorkspaceClient using Databricks runtime's built-in auth.""" + from databricks.sdk import WorkspaceClient + return WorkspaceClient() -def get_databricks_client(host: str, token: str): - """ - Get Databricks client for API calls. +def _create_service_principal_client(host: str, client_id: str, client_secret: str): + """Create WorkspaceClient using service principal OAuth credentials.""" + from databricks.sdk import WorkspaceClient + return WorkspaceClient(host=host, client_id=client_id, client_secret=client_secret) - Args: - host: Databricks workspace URL - token: Access token +def get_databricks_client(host: str = "", client_id: str = "", client_secret: str = ""): + """ + Get Databricks WorkspaceClient with appropriate authentication. + + In Databricks runtime: Uses automatic authentication (no credentials needed). + Locally: Uses service principal OAuth (requires all three parameters). + Returns: - Databricks client instance + WorkspaceClient instance or None if unavailable. """ try: - from databricks.sdk import WorkspaceClient - if _is_databricks_runtime(): - # In Databricks runtime, try to get host and token from widgets or secrets - host = dbutils.widgets.get("databricks_host") if dbutils.widgets.get("databricks_host") else host - token = dbutils.widgets.get("databricks_token") if dbutils.widgets.get("databricks_token") else token - if not token and dbutils.secrets.get("databricks-token-scope", "databricks-token-key"): - token = dbutils.secrets.get("databricks-token-scope", "databricks-token-key") - return WorkspaceClient(host=host, token=token) + if is_databricks_runtime(): + return _create_runtime_client() + + if host and client_id and client_secret: + return _create_service_principal_client(host, client_id, client_secret) + + return None except ImportError: - # Fallback for environments without databricks-sdk return None except Exception as e: - print(f"Error getting Databricks client: {e}") + print(f"Error creating Databricks client: {e}") return None -def validate_connection(host: str, token: str) -> tuple[bool, str]: - """ - Validate Databricks connection. +def _test_connection(client) -> Tuple[bool, str]: + """Test if client can successfully authenticate.""" + try: + client.current_user.me() + return True, "" + except Exception as e: + return False, f"Connection failed: {str(e)}" - Args: - host: Databricks workspace URL - token: Access token +def validate_connection(host: str = "", client_id: str = "", client_secret: str = "") -> Tuple[bool, str]: + """ + Validate Databricks connection. + Returns: - Tuple of (is_valid, error_message) + Tuple of (is_valid, error_message). """ - if not host or not token: - return False, "Host and token are required" + if is_databricks_runtime(): + client = get_databricks_client() + if not client: + return False, "Databricks SDK not available" + return _test_connection(client) + + if not all([host, client_id, client_secret]): + return False, "Host, client_id, and client_secret are required" - if not host.startswith('https://'): - return False, "Host must start with https://" - try: - client = get_databricks_client(host, token) - if client: - # Test connection by trying to get current user - client.current_user.me() - return True, "" - else: - return False, "Databricks SDK not available" - except Exception as e: - return False, f"Connection failed: {str(e)}" \ No newline at end of file + + client = get_databricks_client(host, client_id, client_secret) + if not client: + return False, "Failed to create Databricks client" + + return _test_connection(client) + + +# Keep backward compatibility with internal function name +_is_databricks_runtime = is_databricks_runtime \ No newline at end of file diff --git a/databricks_job_executor/streamlit_app/utils/job_manager.py b/databricks_job_executor/streamlit_app/utils/job_manager.py index fe2ce50..808777f 100644 --- a/databricks_job_executor/streamlit_app/utils/job_manager.py +++ b/databricks_job_executor/streamlit_app/utils/job_manager.py @@ -21,12 +21,9 @@ def __init__(self): def _update_client(self): """Update the Databricks client with current configuration.""" host = st.session_state.get('databricks_host', '') - token = st.session_state.get('databricks_token', '') - - if host and token: - self.client = get_databricks_client(host, token) - else: - self.client = None + client_id = st.session_state.get('databricks_client_id', '') + client_secret = st.session_state.get('databricks_client_secret', '') + self.client = get_databricks_client(host, client_id, client_secret) def _ensure_client(self) -> bool: """Ensure client is available and valid.""" @@ -484,21 +481,25 @@ def _get_cluster_id_from_job_cluster_key(self, run_info, job_cluster_key: str) - # Global job manager instance _job_manager = None _last_host = None -_last_token = None +_last_client_id = None +_last_client_secret = None def get_job_manager() -> JobManager: """Get the global job manager instance, recreating if connection changed.""" - global _job_manager, _last_host, _last_token + global _job_manager, _last_host, _last_client_id, _last_client_secret current_host = st.session_state.get('databricks_host', '') - current_token = st.session_state.get('databricks_token', '') + current_client_id = st.session_state.get('databricks_client_id', '') + current_client_secret = st.session_state.get('databricks_client_secret', '') if (_job_manager is None or _last_host != current_host or - _last_token != current_token): + _last_client_id != current_client_id or + _last_client_secret != current_client_secret): _job_manager = JobManager() _last_host = current_host - _last_token = current_token + _last_client_id = current_client_id + _last_client_secret = current_client_secret return _job_manager \ No newline at end of file diff --git a/env.example b/env.example index 4f80fa0..2d63cc7 100644 --- a/env.example +++ b/env.example @@ -1,27 +1,28 @@ -# Snowflake Connection Credentials +# Migration Accelerator Environment Configuration # Copy this file to .env and fill in your actual credentials +# ============================================================================== +# SNOWFLAKE CONNECTION +# ============================================================================== + # Account identifier (e.g., xy12345.us-east-1 or xy12345) -SNOWFLAKE_ACCOUNT=NQHYCCK-OH54539 +SNOWFLAKE_ACCOUNT=your_account_identifier # User credentials SNOWFLAKE_USER=your_username - -# Password authentication -# SNOWFLAKE_PASSWORD=your_password +SNOWFLAKE_PASSWORD=your_password # Database and schema context (REQUIRED - no defaults) SNOWFLAKE_DATABASE=LVDMS SNOWFLAKE_SCHEMA=LVDMS -# Warehouse (optional - only needed if you want to specify a warehouse) -# Leave this commented out if you don't have a warehouse or want to use default -# SNOWFLAKE_WAREHOUSE=COMPUTE_WH +# Warehouse (optional - defaults to COMPUTE_WH) +SNOWFLAKE_WAREHOUSE=COMPUTE_WH -# Role (optional) -SNOWFLAKE_ROLE=your_role +# Role (optional - defaults to SYSADMIN) +SNOWFLAKE_ROLE=SYSADMIN -# Region (optional - for Snowpark, if your account requires explicit region) +# Region (optional - only if your account requires explicit region) # SNOWFLAKE_REGION=us-east-1 # ============================================================================== # DATABRICKS CONNECTION (OAuth M2M - Service Principal) diff --git a/src/artifact_translation_package/config/ddl_config.py b/src/artifact_translation_package/config/ddl_config.py index bbab319..ffd1b63 100644 --- a/src/artifact_translation_package/config/ddl_config.py +++ b/src/artifact_translation_package/config/ddl_config.py @@ -8,7 +8,6 @@ @dataclass class LLMConfig: provider: str - model: str api_key: Optional[str] = None temperature: float = 0.7 max_tokens: Optional[int] = None @@ -22,174 +21,157 @@ def __post_init__(self): class DDLConfig: DEFAULT_CONFIG = { - "environment": LangGraphConfig.ENVIRONMENT.value, - "debug": LangGraphConfig.DDL_DEBUG.value, + "environment": os.getenv("ENVIRONMENT", LangGraphConfig.ENVIRONMENT.value), + "debug": os.getenv("DDL_DEBUG", str(LangGraphConfig.DDL_DEBUG.value)).lower() == "true", "llms": { "smart_router": { "provider": "databricks", - "model": "databricks-llama-4-maverick", - "temperature": 0.1, - "max_tokens": 2000, + "temperature": float(os.getenv("DDL_TEMPERATURE", LangGraphConfig.DDL_TEMPERATURE.value)), + "max_tokens": int(os.getenv("DDL_MAX_TOKENS", LangGraphConfig.DDL_MAX_TOKENS.value)), "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "database_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", - "temperature": 0.2, - "max_tokens": 4000, + "temperature": float(os.getenv("DDL_TEMPERATURE", LangGraphConfig.DDL_TEMPERATURE.value)), + "max_tokens": int(os.getenv("DDL_MAX_TOKENS", LangGraphConfig.DDL_MAX_TOKENS.value)), "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "schemas_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", - "temperature": 0.2, - "max_tokens": 4000, + "temperature": float(os.getenv("DDL_TEMPERATURE", LangGraphConfig.DDL_TEMPERATURE.value)), + "max_tokens": int(os.getenv("DDL_MAX_TOKENS", LangGraphConfig.DDL_MAX_TOKENS.value)), "additional_params": { "endpoint": LangGraphConfig.DBX_ENDPOINT.value } }, "tables_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "views_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "stages_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "streams_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "pipes_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "roles_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "grants_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "tags_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "comments_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "masking_policies_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "udfs_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "procedures_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "evaluator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.1, "max_tokens": 2000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } }, "external_locations_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { - "endpoint": LangGraphConfig.DBX_ENDPOINT.value + "endpoint": os.getenv("DBX_ENDPOINT", LangGraphConfig.DBX_ENDPOINT.value) } } }, "processing": { - "batch_size": LangGraphConfig.DDL_BATCH_SIZE.value, - "max_concurrent_batches": LangGraphConfig.DDL_MAX_CONCURRENT.value, - "timeout_seconds": LangGraphConfig.DDL_TIMEOUT.value, + "batch_size": int(os.getenv("DDL_BATCH_SIZE", LangGraphConfig.DDL_BATCH_SIZE.value)), + "max_concurrent_batches": int(os.getenv("DDL_MAX_CONCURRENT", LangGraphConfig.DDL_MAX_CONCURRENT.value)), + "timeout_seconds": int(os.getenv("DDL_TIMEOUT", LangGraphConfig.DDL_TIMEOUT.value)), "evaluation_batch_size": 5 # Number of SQL statements per LLM evaluation call }, "output": { - "format": LangGraphConfig.DDL_OUTPUT_FORMAT.value, - "include_metadata": LangGraphConfig.DDL_INCLUDE_METADATA.value, - "compress_output": LangGraphConfig.DDL_COMPRESS_OUTPUT.value, - "base_dir": LangGraphConfig.DDL_OUTPUT_DIR.value, + "format": os.getenv("DDL_OUTPUT_FORMAT", LangGraphConfig.DDL_OUTPUT_FORMAT.value), + "include_metadata": os.getenv("DDL_INCLUDE_METADATA", str(LangGraphConfig.DDL_INCLUDE_METADATA.value)).lower() == "true", + "compress_output": os.getenv("DDL_COMPRESS_OUTPUT", str(LangGraphConfig.DDL_COMPRESS_OUTPUT.value)).lower() == "true", + "base_dir": os.getenv("DDL_OUTPUT_DIR", LangGraphConfig.DDL_OUTPUT_DIR.value), "timestamp_format": "%Y%m%d_%H%M%S" }, "validation": { @@ -207,13 +189,13 @@ class DDLConfig: "batch_size": 10 }, "langsmith": { - "tracing": LangGraphConfig.LANGSMITH_TRACING.value, - "project": LangGraphConfig.LANGSMITH_PROJECT.value, + "tracing": os.getenv("LANGSMITH_TRACING", str(LangGraphConfig.LANGSMITH_TRACING.value)).lower() == "true", + "project": os.getenv("LANGSMITH_PROJECT", LangGraphConfig.LANGSMITH_PROJECT.value), "endpoint": None, # Will be loaded from secrets "api_key": None # Will be loaded from secrets }, "lakebase": { - "database": LangGraphConfig.LAKEBASE_DATABASE.value, + "database": os.getenv("LAKEBASE_DATABASE", LangGraphConfig.LAKEBASE_DATABASE.value), "host": None, # Will be loaded from secrets "user": None, # Will be loaded from secrets "password": None # Will be loaded from secrets diff --git a/src/artifact_translation_package/evaluation/model_benchmark.py b/src/artifact_translation_package/evaluation/model_benchmark.py index 48f5d12..991f91f 100644 --- a/src/artifact_translation_package/evaluation/model_benchmark.py +++ b/src/artifact_translation_package/evaluation/model_benchmark.py @@ -54,7 +54,6 @@ def run_translation(self, batches: List[ArtifactBatch], model_config: ModelConfi "llms": { f"{self.artifact_type}_translator": { "provider": "databricks", - "model": model_config.endpoint, "temperature": model_config.temperature, "max_tokens": model_config.max_tokens, "additional_params": { diff --git a/translation_graph/config/ddl_config.py b/translation_graph/config/ddl_config.py index 72cc930..ef8cbef 100644 --- a/translation_graph/config/ddl_config.py +++ b/translation_graph/config/ddl_config.py @@ -13,7 +13,6 @@ @dataclass class LLMConfig: provider: str - model: str api_key: Optional[str] = None temperature: float = 0.7 max_tokens: Optional[int] = None @@ -32,7 +31,6 @@ class DDLConfig: "llms": { "smart_router": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.1, "max_tokens": 2000, "additional_params": { @@ -41,7 +39,6 @@ class DDLConfig: }, "database_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -50,7 +47,6 @@ class DDLConfig: }, "schemas_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -59,7 +55,6 @@ class DDLConfig: }, "tables_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -68,7 +63,6 @@ class DDLConfig: }, "views_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -77,7 +71,6 @@ class DDLConfig: }, "stages_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -86,7 +79,6 @@ class DDLConfig: }, "streams_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -95,7 +87,6 @@ class DDLConfig: }, "pipes_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -104,7 +95,6 @@ class DDLConfig: }, "roles_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -113,7 +103,6 @@ class DDLConfig: }, "grants_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -122,7 +111,6 @@ class DDLConfig: }, "tags_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -131,7 +119,6 @@ class DDLConfig: }, "comments_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -140,7 +127,6 @@ class DDLConfig: }, "masking_policies_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -149,7 +135,6 @@ class DDLConfig: }, "udfs_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -158,7 +143,6 @@ class DDLConfig: }, "procedures_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -167,7 +151,6 @@ class DDLConfig: }, "file_formats_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -176,7 +159,6 @@ class DDLConfig: }, "external_locations_translator": { "provider": "databricks", - "model": "databricks-llama-4-maverick", "temperature": 0.2, "max_tokens": 4000, "additional_params": { @@ -185,7 +167,6 @@ class DDLConfig: }, "evaluator": { "provider": "databricks", - "model": "databricks-meta-llama-3-1-8b-instruct", "temperature": 0.1, "max_tokens": 1000, "additional_params": {