add AWS Glue → Databricks PySpark transpiler#2395
Open
dmux wants to merge 3 commits into
Open
Conversation
Introduces a libcst-based AST transpiler that converts AWS Glue PySpark
scripts to Databricks-native PySpark code.
Transformations supported:
- awsglue.* import removal with pyspark equivalent injection
- SparkContext/GlueContext bootstrap → SparkSession.builder.getOrCreate()
- getResolvedOptions → argparse (default) or dbutils.widgets via args-style option
- create_dynamic_frame.from_catalog → spark.read.table(), with optional
Unity Catalog 3-level namespace via the catalog option
- create_dynamic_frame.from_options → spark.read.format().load() for S3
and spark.read.format('jdbc').option(...).load() for JDBC
- write_dynamic_frame.from_options → df.write.format().save() for S3/JDBC
- ApplyMapping.apply → withColumnRenamed/withColumn/col().cast() chains
- Job.init / Job.commit boilerplate removal
- 14 unsupported transforms emit actionable warnings for manual review
Architecture: two-pass CST (BindingCollector read-only pass resolves all
variable roles before GlueVisitor transforms), ensuring correctness regardless
of statement order. Generated code is validated with ast.parse(); syntax errors
in output are reported as warnings rather than silent failures.
GlueEngine registered as the built-in "glue" engine; invoke via
--transpiler-config-path glue.
Collaborator
Author
My main goal with this PR was to streamline and accelerate migrations from AWS Glue to Databricks, as PySpark support is a key factor for teams moving between these ecosystems. I completely understand the decision regarding the roadmap for lakebridge. That said, are there any other projects within Databricks Labs where PySpark contributions or migration-focused tooling are currently a priority? I’d be happy to redirect my efforts there. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
What does this PR do?
Introduces a
libcst-based AST transpiler that converts AWS Glue PySparkscripts to Databricks-native PySpark code, accessible via
--transpiler-config-path gluein the CLI.The transpiler rewrites Glue-specific patterns statically — no runtime Spark
session required — and emits structured
WARNING-severityTranspileErrorentries for constructs that need manual review, rather than failing silently.
Relevant implementation details
transpiler/glue/glue_transformer.py—GlueTransformer/_GlueVisitor(
CSTTransformer): handles import rewriting, context setup collapse,getResolvedOptions, catalog reads/writes (S3 + JDBC),ApplyMapping.apply,and
Jobboilerplate removal._BindingCollectorruns a read-only first pass so all variable→role mappings(
SparkContext,GlueContext,SparkSession,Job) are fully resolvedbefore transformation begins, avoiding ordering-dependent bugs.
transpiler/glue/glue_engine.py—GlueEngineimplements theTranspileEnginecontract; validates generated code withast.parse()andconverts transformer warnings into
TranspileError(WARNING)entries inTranspileResult.cli.py—GlueEngineregistered in_BUILTIN_ENGINES["glue"],bypassing the LSP config-file requirement so the engine works without a
config file on disk.
pyproject.toml—libcst>=1.4.0,<2added as a runtime dependency.Transformations supported:
awsglue.*importspysparkequivalents injected after last existing importSparkContext/GlueContextsetupSparkSession.builder.getOrCreate()getResolvedOptionsargparse(default) ordbutils.widgetsviaargs-styleoptioncreate_dynamic_frame.from_catalogspark.read.table(...)with optional Unity Catalog prefixcreate_dynamic_frame.from_options(S3)spark.read.format(...).load(...)withs3a:///s3n://normalisationcreate_dynamic_frame.from_options(JDBC)spark.read.format('jdbc').option(...).load()write_dynamic_frame.from_options(S3)df.write.format(...).partitionBy(...).save(...)write_dynamic_frame.from_options(JDBC)df.write.format('jdbc').option(...).save()ApplyMapping.applywithColumnRenamed+withColumn(col(...).cast(...))chainsJob.init/Job.committoDF()/DynamicFrame.fromDF()Engine options (via
transpiler_optionsin config):catalog— prepends Unity Catalog prefix to all table references (catalog.database.table)args-style—"argparse"(default) or"dbutils"fordbutils.widgets.text()blocksType mapping (Glue → Spark SQL):
long→bigint,short→smallint,byte→tinyint,bool→boolean,char/varchar→string,decimal(p,s)preserved as-is.Caveats/things to watch out for when reviewing:
from_catalogandfrom_optionscalls with dynamic arguments(e.g.
database=args["source_db"]) cannot be statically resolved; the callis kept verbatim and a
WARNINGis emitted directing the user to manualconversion.
14 Glue transforms (
ResolveChoice,DropNullFields,FillMissingValues,Relationalize,RenameField,SelectFromCollection,SplitFields,SplitRows,Unbox, and others) emit aWARNINGand are left unchanged.Only
ApplyMapping.applyis fully rewritten.The transpiler only handles
.pyfiles;is_supported_filerejectsnotebooks and SQL files.
The engine is stateless between files (catalog and args-style are set once in
initialize), which matches theTranspileEnginecontract.Linked issues
Resolves #..
Functionality
databricks labs lakebridge transpile --transpiler-config-path glue --source-dialect gluedatabricks labs lakebridge ...Tests
tests/unit/transpiler/test_glue_transformer.py— 40 inline tests coveringall rewrite paths: import removal, context collapse,
getResolvedOptions(argparse + dbutils modes), S3/JDBC reads and writes,
ApplyMappingwithrename/cast/type-map, job boilerplate removal, unsupported transform warnings,
comment and whitespace preservation,
s3a://ands3n://path normalisation,and the full
_map_glue_typeparametrised suite (17 type pairs).tests/unit/transpiler/test_glue_engine.py— 15 direct tests (enginecontract,
catalog/args-styleoption propagation,ast.parseoutputvalidation via monkeypatch) + 14 parametrised fixture tests.
tests/resources/functional/glue/— 15 input/expected fixture pairs across8 categories:
args,boilerplate,context,e2e,imports,reads,transforms,writes.