-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Copilot skill for Fabric Lakehouse #739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f27d45d
e25be07
6c2c0e4
e4edbac
e8522a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,98 @@ | ||||||
| --- | ||||||
| name: fabric-lakehouse | ||||||
| description: 'Provide definition and context about Fabric Lakehouse and its capabilities for software systems and AI-powered features. Help users design, build, and optimize Lakehouse solutions using best practices.' | ||||||
| metadata: | ||||||
| author: tedvilutis | ||||||
| version: "1.0" | ||||||
| --- | ||||||
|
|
||||||
| # Fabric Lakehouse | ||||||
|
|
||||||
| ## Core Concepts | ||||||
|
|
||||||
| ### What is a Lakehouse? | ||||||
|
|
||||||
| Lakehouse in Microsoft Fabric is an item that gives users a place to store their tabular, like tables, and non-tabular, like files, data. It combines the flexibility of a data lake with the management capabilities of a data warehouse. It provides: | ||||||
|
|
||||||
| - **Unified storage** in OneLake for structured and unstructured data | ||||||
| - **Delta Lake format** for ACID transactions, versioning, and time travel | ||||||
| - **SQL analytics endpoint** for T-SQL queries | ||||||
| - **Semantic model** for Power BI integration | ||||||
| - Support for other table formats like CSV, Parquet | ||||||
| - Support for any file formats | ||||||
| - Tools for table optimization and data management | ||||||
|
|
||||||
| ### Key Components | ||||||
|
|
||||||
| - **Delta Tables**: Managed tables with ACID compliance and schema enforcement | ||||||
| - **Files**: Unstructured/semi-structured data in the Files section | ||||||
| - **SQL Endpoint**: Auto-generated read-only SQL interface for querying | ||||||
| - **Shortcuts**: Virtual links to external/internal data without copying | ||||||
| - **Fabric Materialized Views**: Pre-computed tables for fast query performance | ||||||
|
|
||||||
| ### Tabular data in a Lakehouse | ||||||
|
|
||||||
| Tabular data in a form of tables are stored under "Tables" folder. Main format for tables in Lakehouse is Delta. Lakehouse can store tabular data in other formats like CSV or Parquet, these formats only available for Spark querying. | ||||||
|
||||||
| Tabular data in a form of tables are stored under "Tables" folder. Main format for tables in Lakehouse is Delta. Lakehouse can store tabular data in other formats like CSV or Parquet, these formats only available for Spark querying. | |
| Tabular data in a form of tables are stored under "Tables" folder. Main format for tables in Lakehouse is Delta. Lakehouse can store tabular data in other formats like CSV or Parquet, these formats are only available for Spark querying. |
Copilot
AI
Feb 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after period. Should be "tables. This" instead of "tables.This".
| For faster data read with semantic model enable V-Order optimization on Delta tables.This presorts data in a way that improves query performance for common access patterns. | |
| For faster data read with semantic model enable V-Order optimization on Delta tables. This presorts data in a way that improves query performance for common access patterns. |
Copilot
AI
Feb 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "Lakehosue" should be "Lakehouse".
| Lakehosue item supports lineage, which allows users to track the origin and transformations of data. Lineage information is automatically captured for tables and files in Lakehouse, showing how data flows from source to destination. This helps with debugging, auditing, and understanding data dependencies. | |
| Lakehouse item supports lineage, which allows users to track the origin and transformations of data. Lineage information is automatically captured for tables and files in Lakehouse, showing how data flows from source to destination. This helps with debugging, auditing, and understanding data dependencies. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| ### Data Factory Integration | ||
|
|
||
| Microsoft Fabric includes Data Factory for ETL/ELT orchestration: | ||
|
|
||
| - **180+ connectors** for data sources | ||
| - **Copy activity** for data movement | ||
| - **Dataflow Gen2** for transformations | ||
| - **Notebook activity** for Spark processing | ||
| - **Scheduling** and triggers | ||
|
|
||
| ### Pipeline Activities | ||
|
|
||
| | Activity | Description | | ||
| |----------|-------------| | ||
| | Copy Data | Move data between sources and Lakehouse | | ||
| | Notebook | Execute Spark notebooks | | ||
| | Dataflow | Run Dataflow Gen2 transformations | | ||
| | Stored Procedure | Execute SQL procedures | | ||
| | ForEach | Loop over items | | ||
| | If Condition | Conditional branching | | ||
| | Get Metadata | Retrieve file/folder metadata | | ||
| | Lakehouse Maintenance | Optimize and vacuum Delta tables | | ||
|
|
||
| ### Orchestration Patterns | ||
|
|
||
| ``` | ||
| Pipeline: Daily_ETL_Pipeline | ||
| ├── Get Metadata (check for new files) | ||
| ├── ForEach (process each file) | ||
| │ ├── Copy Data (bronze layer) | ||
| │ └── Notebook (silver transformation) | ||
| ├── Notebook (gold aggregation) | ||
| └── Lakehouse Maintenance (optimize tables) | ||
| ``` | ||
|
|
||
| --- |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,187 @@ | ||
| ### Spark Configuration (Best Practices) | ||
|
|
||
| ```python | ||
| # Enable Fabric optimizations | ||
| spark.conf.set("spark.sql.parquet.vorder.enabled", "true") | ||
| spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") | ||
| ``` | ||
|
|
||
| ### Reading Data | ||
|
|
||
| ```python | ||
| # Read CSV file | ||
| df = spark.read.format("csv") \ | ||
| .option("header", "true") \ | ||
| .option("inferSchema", "true") \ | ||
| .load("Files/bronze/data.csv") | ||
|
|
||
| # Read JSON file | ||
| df = spark.read.format("json").load("Files/bronze/data.json") | ||
|
|
||
| # Read Parquet file | ||
| df = spark.read.format("parquet").load("Files/bronze/data.parquet") | ||
|
|
||
| # Read Delta table | ||
| df = spark.read.format("delta").table("my_delta_table") | ||
|
|
||
| # Read from SQL endpoint | ||
| df = spark.sql("SELECT * FROM lakehouse.my_table") | ||
| ``` | ||
|
|
||
| ### Writing Delta Tables | ||
|
|
||
| ```python | ||
| # Write DataFrame as managed Delta table | ||
| df.write.format("delta") \ | ||
| .mode("overwrite") \ | ||
| .saveAsTable("silver_customers") | ||
|
|
||
| # Write with partitioning | ||
| df.write.format("delta") \ | ||
| .mode("overwrite") \ | ||
| .partitionBy("year", "month") \ | ||
| .saveAsTable("silver_transactions") | ||
|
|
||
| # Append to existing table | ||
| df.write.format("delta") \ | ||
| .mode("append") \ | ||
| .saveAsTable("silver_events") | ||
| ``` | ||
|
|
||
| ### Delta Table Operations (CRUD) | ||
|
|
||
| ```python | ||
| # UPDATE | ||
| spark.sql(""" | ||
| UPDATE silver_customers | ||
| SET status = 'active' | ||
| WHERE last_login > '2024-01-01' | ||
| """) | ||
|
|
||
| # DELETE | ||
| spark.sql(""" | ||
| DELETE FROM silver_customers | ||
| WHERE is_deleted = true | ||
| """) | ||
|
|
||
| # MERGE (Upsert) | ||
| spark.sql(""" | ||
| MERGE INTO silver_customers AS target | ||
| USING staging_customers AS source | ||
| ON target.customer_id = source.customer_id | ||
| WHEN MATCHED THEN UPDATE SET * | ||
| WHEN NOT MATCHED THEN INSERT * | ||
| """) | ||
| ``` | ||
|
|
||
| ### Schema Definition | ||
|
|
||
| ```python | ||
| from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DecimalType | ||
|
|
||
| schema = StructType([ | ||
| StructField("id", IntegerType(), False), | ||
| StructField("name", StringType(), True), | ||
| StructField("email", StringType(), True), | ||
| StructField("amount", DecimalType(18, 2), True), | ||
| StructField("created_at", TimestampType(), True) | ||
| ]) | ||
|
|
||
| df = spark.read.format("csv") \ | ||
| .schema(schema) \ | ||
| .option("header", "true") \ | ||
| .load("Files/bronze/customers.csv") | ||
| ``` | ||
|
|
||
| ### SQL Magic in Notebooks | ||
|
|
||
| ```sql | ||
| %%sql | ||
| -- Query Delta table directly | ||
| SELECT | ||
| customer_id, | ||
| COUNT(*) as order_count, | ||
| SUM(amount) as total_amount | ||
| FROM gold_orders | ||
| GROUP BY customer_id | ||
| ORDER BY total_amount DESC | ||
| LIMIT 10 | ||
| ``` | ||
|
|
||
| ### V-Order Optimization | ||
|
|
||
| ```python | ||
| # Enable V-Order for read optimization | ||
| spark.conf.set("spark.sql.parquet.vorder.enabled", "true") | ||
| ``` | ||
|
|
||
| ### Table Optimization | ||
|
|
||
| ```sql | ||
| %%sql | ||
| -- Optimize table (compact small files) | ||
| OPTIMIZE silver_transactions | ||
|
|
||
| -- Optimize with Z-ordering on query columns | ||
| OPTIMIZE silver_transactions ZORDER BY (customer_id, transaction_date) | ||
|
|
||
| -- Vacuum old files (default 7 days retention) | ||
| VACUUM silver_transactions | ||
|
|
||
| -- Vacuum with custom retention | ||
| VACUUM silver_transactions RETAIN 168 HOURS | ||
|
|
||
| ### Incremental Load Pattern | ||
|
Comment on lines
+133
to
+134
|
||
|
|
||
| ```python | ||
| from pyspark.sql.functions import col, max as spark_max | ||
|
|
||
| # Get last processed watermark | ||
| last_watermark = spark.sql(""" | ||
| SELECT MAX(processed_timestamp) as watermark | ||
| FROM silver_orders | ||
| """).collect()[0]["watermark"] | ||
|
|
||
| # Load only new records | ||
| new_records = spark.read.format("delta") \ | ||
| .table("bronze_orders") \ | ||
| .filter(col("created_at") > last_watermark) | ||
|
|
||
| # Merge new records | ||
| new_records.createOrReplaceTempView("staging_orders") | ||
| spark.sql(""" | ||
| MERGE INTO silver_orders AS target | ||
| USING staging_orders AS source | ||
| ON target.order_id = source.order_id | ||
| WHEN MATCHED THEN UPDATE SET * | ||
| WHEN NOT MATCHED THEN INSERT * | ||
| """) | ||
| ``` | ||
|
|
||
| ### SCD Type 2 Pattern | ||
|
|
||
| ```python | ||
| from pyspark.sql.functions import current_timestamp, lit | ||
|
|
||
| # Close existing records | ||
| spark.sql(""" | ||
| UPDATE dim_customer | ||
| SET is_current = false, end_date = current_timestamp() | ||
| WHERE customer_id IN (SELECT customer_id FROM staging_customer) | ||
| AND is_current = true | ||
| """) | ||
|
|
||
| # Insert new versions | ||
| spark.sql(""" | ||
| INSERT INTO dim_customer | ||
| SELECT | ||
| customer_id, | ||
| name, | ||
| email, | ||
| address, | ||
| current_timestamp() as start_date, | ||
| null as end_date, | ||
| true as is_current | ||
| FROM staging_customer | ||
| """) | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new fabric-lakehouse skill should be added to the docs/README.skills.md skills index table for discoverability. According to repository conventions, new skills need to be documented in the skills index.