add AWS Glue → Databricks PySpark transpiler by dmux · Pull Request #2395 · databrickslabs/lakebridge

dmux · 2026-04-27T22:51:14Z

Changes

What does this PR do?

Introduces a libcst-based AST transpiler that converts AWS Glue PySpark
scripts to Databricks-native PySpark code, accessible via
--transpiler-config-path glue in the CLI.

The transpiler rewrites Glue-specific patterns statically — no runtime Spark
session required — and emits structured WARNING-severity TranspileError
entries for constructs that need manual review, rather than failing silently.

Relevant implementation details

transpiler/glue/glue_transformer.py — GlueTransformer / _GlueVisitor
(CSTTransformer): handles import rewriting, context setup collapse,
getResolvedOptions, catalog reads/writes (S3 + JDBC), ApplyMapping.apply,
and Job boilerplate removal.
_BindingCollector runs a read-only first pass so all variable→role mappings
(SparkContext, GlueContext, SparkSession, Job) are fully resolved
before transformation begins, avoiding ordering-dependent bugs.
transpiler/glue/glue_engine.py — GlueEngine implements the
TranspileEngine contract; validates generated code with ast.parse() and
converts transformer warnings into TranspileError(WARNING) entries in
TranspileResult.
cli.py — GlueEngine registered in _BUILTIN_ENGINES["glue"],
bypassing the LSP config-file requirement so the engine works without a
config file on disk.
pyproject.toml — libcst>=1.4.0,<2 added as a runtime dependency.

Transformations supported:

Glue construct	Databricks output
`awsglue.*` imports	Removed; `pyspark` equivalents injected after last existing import
`SparkContext` / `GlueContext` setup	Collapsed to `SparkSession.builder.getOrCreate()`
`getResolvedOptions`	`argparse` (default) or `dbutils.widgets` via `args-style` option
`create_dynamic_frame.from_catalog`	`spark.read.table(...)` with optional Unity Catalog prefix
`create_dynamic_frame.from_options` (S3)	`spark.read.format(...).load(...)` with `s3a://`/`s3n://` normalisation
`create_dynamic_frame.from_options` (JDBC)	`spark.read.format('jdbc').option(...).load()`
`write_dynamic_frame.from_options` (S3)	`df.write.format(...).partitionBy(...).save(...)`
`write_dynamic_frame.from_options` (JDBC)	`df.write.format('jdbc').option(...).save()`
`ApplyMapping.apply`	`withColumnRenamed` + `withColumn(col(...).cast(...))` chains
`Job.init` / `Job.commit`	Removed
`toDF()` / `DynamicFrame.fromDF()`	Unwrapped to the inner DataFrame expression

Engine options (via transpiler_options in config):

catalog — prepends Unity Catalog prefix to all table references (catalog.database.table)
args-style — "argparse" (default) or "dbutils" for dbutils.widgets.text() blocks

Type mapping (Glue → Spark SQL): long→bigint, short→smallint,
byte→tinyint, bool→boolean, char/varchar→string, decimal(p,s) preserved as-is.

Caveats/things to watch out for when reviewing:

from_catalog and from_options calls with dynamic arguments
(e.g. database=args["source_db"]) cannot be statically resolved; the call
is kept verbatim and a WARNING is emitted directing the user to manual
conversion.
14 Glue transforms (ResolveChoice, DropNullFields, FillMissingValues,
Relationalize, RenameField, SelectFromCollection, SplitFields,
SplitRows, Unbox, and others) emit a WARNING and are left unchanged.
Only ApplyMapping.apply is fully rewritten.
The transpiler only handles .py files; is_supported_file rejects
notebooks and SQL files.
The engine is stateless between files (catalog and args-style are set once in
initialize), which matches the TranspileEngine contract.

Linked issues

Resolves #..

Functionality

added new CLI capability: databricks labs lakebridge transpile --transpiler-config-path glue --source-dialect glue
added relevant user documentation
added new CLI command
modified existing command: databricks labs lakebridge ...

Tests

added unit tests
added integration tests
manually tested

tests/unit/transpiler/test_glue_transformer.py — 40 inline tests covering
all rewrite paths: import removal, context collapse, getResolvedOptions
(argparse + dbutils modes), S3/JDBC reads and writes, ApplyMapping with
rename/cast/type-map, job boilerplate removal, unsupported transform warnings,
comment and whitespace preservation, s3a:// and s3n:// path normalisation,
and the full _map_glue_type parametrised suite (17 type pairs).

tests/unit/transpiler/test_glue_engine.py — 15 direct tests (engine
contract, catalog/args-style option propagation, ast.parse output
validation via monkeypatch) + 14 parametrised fixture tests.

tests/resources/functional/glue/ — 15 input/expected fixture pairs across
8 categories: args, boilerplate, context, e2e, imports, reads,
transforms, writes.

Introduces a libcst-based AST transpiler that converts AWS Glue PySpark scripts to Databricks-native PySpark code. Transformations supported: - awsglue.* import removal with pyspark equivalent injection - SparkContext/GlueContext bootstrap → SparkSession.builder.getOrCreate() - getResolvedOptions → argparse (default) or dbutils.widgets via args-style option - create_dynamic_frame.from_catalog → spark.read.table(), with optional Unity Catalog 3-level namespace via the catalog option - create_dynamic_frame.from_options → spark.read.format().load() for S3 and spark.read.format('jdbc').option(...).load() for JDBC - write_dynamic_frame.from_options → df.write.format().save() for S3/JDBC - ApplyMapping.apply → withColumnRenamed/withColumn/col().cast() chains - Job.init / Job.commit boilerplate removal - 14 unsupported transforms emit actionable warnings for manual review Architecture: two-pass CST (BindingCollector read-only pass resolves all variable roles before GlueVisitor transforms), ensuring correctness regardless of statement order. Generated code is validated with ast.parse(); syntax errors in output are reported as warnings rather than silent failures. GlueEngine registered as the built-in "glue" engine; invoke via --transpiler-config-path glue.

sundarshankar89 · 2026-04-30T13:22:50Z

Thanks for your PR, currently pyspark is not an intended target we would like to support.

cc: @gueniai and @asnare

dmux · 2026-05-01T07:42:36Z

Thanks for your PR, currently pyspark is not an intended target we would like to support.

cc: @gueniai and @asnare

My main goal with this PR was to streamline and accelerate migrations from AWS Glue to Databricks, as PySpark support is a key factor for teams moving between these ecosystems.

I completely understand the decision regarding the roadmap for lakebridge. That said, are there any other projects within Databricks Labs where PySpark contributions or migration-focused tooling are currently a priority? I’d be happy to redirect my efforts there.

dmux requested a review from a team as a code owner April 27, 2026 22:51

Merge branch 'main' into feat/glue-to-databricks-transpiler

c767e45

Merge branch 'main' into feat/glue-to-databricks-transpiler

5598671

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add AWS Glue → Databricks PySpark transpiler#2395

add AWS Glue → Databricks PySpark transpiler#2395
dmux wants to merge 3 commits into
databrickslabs:mainfrom
dmux:feat/glue-to-databricks-transpiler

dmux commented Apr 27, 2026

Uh oh!

sundarshankar89 commented Apr 30, 2026

Uh oh!

dmux commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmux commented Apr 27, 2026

Changes

What does this PR do?

Relevant implementation details

Caveats/things to watch out for when reviewing:

Linked issues

Functionality

Tests

Uh oh!

sundarshankar89 commented Apr 30, 2026

Uh oh!

dmux commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants