Skip to content

add AWS Glue → Databricks PySpark transpiler#2395

Open
dmux wants to merge 3 commits into
databrickslabs:mainfrom
dmux:feat/glue-to-databricks-transpiler
Open

add AWS Glue → Databricks PySpark transpiler#2395
dmux wants to merge 3 commits into
databrickslabs:mainfrom
dmux:feat/glue-to-databricks-transpiler

Conversation

@dmux
Copy link
Copy Markdown

@dmux dmux commented Apr 27, 2026

Changes

What does this PR do?

Introduces a libcst-based AST transpiler that converts AWS Glue PySpark
scripts to Databricks-native PySpark code, accessible via
--transpiler-config-path glue in the CLI.

The transpiler rewrites Glue-specific patterns statically — no runtime Spark
session required — and emits structured WARNING-severity TranspileError
entries for constructs that need manual review, rather than failing silently.

Relevant implementation details

  • transpiler/glue/glue_transformer.pyGlueTransformer / _GlueVisitor
    (CSTTransformer): handles import rewriting, context setup collapse,
    getResolvedOptions, catalog reads/writes (S3 + JDBC), ApplyMapping.apply,
    and Job boilerplate removal.
    _BindingCollector runs a read-only first pass so all variable→role mappings
    (SparkContext, GlueContext, SparkSession, Job) are fully resolved
    before transformation begins, avoiding ordering-dependent bugs.

  • transpiler/glue/glue_engine.pyGlueEngine implements the
    TranspileEngine contract; validates generated code with ast.parse() and
    converts transformer warnings into TranspileError(WARNING) entries in
    TranspileResult.

  • cli.pyGlueEngine registered in _BUILTIN_ENGINES["glue"],
    bypassing the LSP config-file requirement so the engine works without a
    config file on disk.

  • pyproject.tomllibcst>=1.4.0,<2 added as a runtime dependency.

Transformations supported:

Glue construct Databricks output
awsglue.* imports Removed; pyspark equivalents injected after last existing import
SparkContext / GlueContext setup Collapsed to SparkSession.builder.getOrCreate()
getResolvedOptions argparse (default) or dbutils.widgets via args-style option
create_dynamic_frame.from_catalog spark.read.table(...) with optional Unity Catalog prefix
create_dynamic_frame.from_options (S3) spark.read.format(...).load(...) with s3a:///s3n:// normalisation
create_dynamic_frame.from_options (JDBC) spark.read.format('jdbc').option(...).load()
write_dynamic_frame.from_options (S3) df.write.format(...).partitionBy(...).save(...)
write_dynamic_frame.from_options (JDBC) df.write.format('jdbc').option(...).save()
ApplyMapping.apply withColumnRenamed + withColumn(col(...).cast(...)) chains
Job.init / Job.commit Removed
toDF() / DynamicFrame.fromDF() Unwrapped to the inner DataFrame expression

Engine options (via transpiler_options in config):

  • catalog — prepends Unity Catalog prefix to all table references (catalog.database.table)
  • args-style"argparse" (default) or "dbutils" for dbutils.widgets.text() blocks

Type mapping (Glue → Spark SQL): long→bigint, short→smallint,
byte→tinyint, bool→boolean, char/varchar→string, decimal(p,s) preserved as-is.

Caveats/things to watch out for when reviewing:

  • from_catalog and from_options calls with dynamic arguments
    (e.g. database=args["source_db"]) cannot be statically resolved; the call
    is kept verbatim and a WARNING is emitted directing the user to manual
    conversion.

  • 14 Glue transforms (ResolveChoice, DropNullFields, FillMissingValues,
    Relationalize, RenameField, SelectFromCollection, SplitFields,
    SplitRows, Unbox, and others) emit a WARNING and are left unchanged.
    Only ApplyMapping.apply is fully rewritten.

  • The transpiler only handles .py files; is_supported_file rejects
    notebooks and SQL files.

  • The engine is stateless between files (catalog and args-style are set once in
    initialize), which matches the TranspileEngine contract.

Linked issues

Resolves #..

Functionality

  • added new CLI capability: databricks labs lakebridge transpile --transpiler-config-path glue --source-dialect glue
  • added relevant user documentation
  • added new CLI command
  • modified existing command: databricks labs lakebridge ...

Tests

  • added unit tests
  • added integration tests
  • manually tested

tests/unit/transpiler/test_glue_transformer.py — 40 inline tests covering
all rewrite paths: import removal, context collapse, getResolvedOptions
(argparse + dbutils modes), S3/JDBC reads and writes, ApplyMapping with
rename/cast/type-map, job boilerplate removal, unsupported transform warnings,
comment and whitespace preservation, s3a:// and s3n:// path normalisation,
and the full _map_glue_type parametrised suite (17 type pairs).

tests/unit/transpiler/test_glue_engine.py — 15 direct tests (engine
contract, catalog/args-style option propagation, ast.parse output
validation via monkeypatch) + 14 parametrised fixture tests.

tests/resources/functional/glue/ — 15 input/expected fixture pairs across
8 categories: args, boilerplate, context, e2e, imports, reads,
transforms, writes.

Introduces a libcst-based AST transpiler that converts AWS Glue PySpark
scripts to Databricks-native PySpark code.

Transformations supported:
- awsglue.* import removal with pyspark equivalent injection
- SparkContext/GlueContext bootstrap → SparkSession.builder.getOrCreate()
- getResolvedOptions → argparse (default) or dbutils.widgets via args-style option
- create_dynamic_frame.from_catalog → spark.read.table(), with optional
  Unity Catalog 3-level namespace via the catalog option
- create_dynamic_frame.from_options → spark.read.format().load() for S3
  and spark.read.format('jdbc').option(...).load() for JDBC
- write_dynamic_frame.from_options → df.write.format().save() for S3/JDBC
- ApplyMapping.apply → withColumnRenamed/withColumn/col().cast() chains
- Job.init / Job.commit boilerplate removal
- 14 unsupported transforms emit actionable warnings for manual review

Architecture: two-pass CST (BindingCollector read-only pass resolves all
variable roles before GlueVisitor transforms), ensuring correctness regardless
of statement order. Generated code is validated with ast.parse(); syntax errors
in output are reported as warnings rather than silent failures.

GlueEngine registered as the built-in "glue" engine; invoke via
--transpiler-config-path glue.
@dmux dmux requested a review from a team as a code owner April 27, 2026 22:51
@sundarshankar89
Copy link
Copy Markdown
Collaborator

Thanks for your PR, currently pyspark is not an intended target we would like to support.

cc: @gueniai and @asnare

@dmux
Copy link
Copy Markdown
Author

dmux commented May 1, 2026

Thanks for your PR, currently pyspark is not an intended target we would like to support.

cc: @gueniai and @asnare

My main goal with this PR was to streamline and accelerate migrations from AWS Glue to Databricks, as PySpark support is a key factor for teams moving between these ecosystems.

I completely understand the decision regarding the roadmap for lakebridge. That said, are there any other projects within Databricks Labs where PySpark contributions or migration-focused tooling are currently a priority? I’d be happy to redirect my efforts there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants