last-updated	2026-05-03

Data Mesh Architecture

A comprehensive guide to implementing domain-oriented, decentralized data architectures across AWS, Azure, and Google Cloud Platform.

Architecture Overview
Architecture Diagram Description
Component Breakdown
AWS Implementation
Azure Implementation
GCP Implementation
Design Considerations
Data as a Product
Federated Governance
Cost Estimation
Production Checklist

Architecture Overview

What is Data Mesh?

Data mesh is a decentralized data architecture that organizes data by business domain:

Domain Ownership: Each business domain owns and manages its own data
Data as a Product: Data is treated as a product with defined quality, SLAs, and discoverability
Self-Serve Data Platform: Infrastructure provides a platform for domains to publish and consume data
Federated Computational Governance: Governance is automated and embedded, not centralized
Decentralized Architecture: No central data team owns all data pipelines

Four Pillars of Data Mesh

Domain-Oriented Ownership: Data is owned by the teams that understand the business domain
Data as a Product: Each data product has an owner, SLAs, documentation, and quality guarantees
Self-Serve Data Infrastructure: A platform that makes it easy for domains to create and manage data products
Federated Computational Governance: Global policies enforced through automation, not manual review

Benefits

Scalability: Data ownership scales with organizational growth
Domain Expertise: Teams closest to the data manage it
Reduced Bottlenecks: No central data team as a bottleneck
Faster Time to Value: Domains can independently create data products
Quality: Domain experts ensure data accuracy and relevance
Innovation: Teams can experiment without cross-team dependencies

Trade-offs

Organizational Change: Requires significant cultural and organizational shift
Duplication: Some data may be duplicated across domains
Standardization: Balancing domain autonomy with interoperability
Skill Requirements: Each domain needs data engineering capabilities
Cost: Decentralized infrastructure can be more expensive than centralized
Governance Complexity: Federated governance is harder to implement than centralized

Use Cases

Large enterprises with multiple business units and data domains
Organizations where a central data team has become a bottleneck
Companies adopting microservices and domain-driven design
Industries with complex data regulations requiring domain accountability
Organizations needing to scale their data analytics capabilities

When to Avoid Data Mesh

Small organizations with a single data team
Simple data landscapes with few domains
Organizations without mature engineering practices
Teams not ready for cultural shift to domain ownership

Architecture Diagram Description

High-Level Architecture

flowchart TB
  subgraph SD[Sales domain]
    SS[Source systems] --> SI[Ingestion] --> SP[Data product:<br/>orders, revenue]
  end
  subgraph PD[Product domain]
    PS[Source systems] --> PI[Ingestion] --> PP[Data product:<br/>catalog, inventory]
  end
  subgraph CD[Customer domain]
    CS[Source systems] --> CI[Ingestion] --> CP[Data product:<br/>customer 360, segments]
  end
  SP --> CAT[Federated catalog<br/>+ governance]
  PP --> CAT
  CP --> CAT
  CAT --> CON[Cross-domain consumers:<br/>analytics, ML, BI]

+------------------+    +------------------+    +------------------+
|   Sales Domain   |    |  Product Domain  |    | Customer Domain  |
|                  |    |                  |    |                  |
| [Source Systems]  |    | [Source Systems]  |    | [Source Systems]  |
|       |          |    |       |          |    |       |          |
| [Ingestion]      |    | [Ingestion]      |    | [Ingestion]      |
|       |          |    |       |          |    |       |          |
| [Data Product]   |    | [Data Product]   |    | [Data Product]   |
|  - Sales Orders  |    |  - Product Cat.  |    |  - Customer 360  |
|  - Revenue       |    |  - Inventory     |    |  - Segments      |
+--------+---------+    +--------+---------+    +--------+---------+
         |                       |                       |
         +----------+------------+-----------+-----------+
                    |                        |
          [Data Product Catalog]    [Self-Serve Platform]
                    |                        |
          [Federated Governance]    [Monitoring / Quality]

Data Product Structure

Each data product consists of:

Input Ports: Data ingestion from source systems
Transformation: Domain-specific business logic
Output Ports: APIs, datasets, events for consumers
Metadata: Schema, documentation, lineage, quality metrics
Observability: Monitoring, alerting, SLA tracking

Component Breakdown

Platform Components

Component	Purpose	Key Features
Data Catalog	Discover and understand data products	Search, metadata, lineage
Data Storage	Persist domain data	Lakehouse, warehouse, object storage
Data Processing	Transform and prepare data	ETL/ELT, streaming, batch
Data Quality	Ensure data meets standards	Validation, profiling, monitoring
Access Control	Manage who can access data	RBAC, column-level security
Data Sharing	Share data across domains	APIs, subscriptions, marketplace

Data Product Components

Component	Purpose	Key Features
Owner	Accountable person/team	Domain expert, product manager
SLA	Quality and availability guarantees	Freshness, completeness, uptime
Schema	Data structure definition	Versioned, backward compatible
Documentation	Usage instructions and context	Examples, business definitions
Access Policies	Who can consume the data	Request-based, self-service

AWS Implementation

Core Services

Component	AWS Service	Purpose
Data Catalog	AWS Glue Data Catalog	Centralized metadata repository
Data Governance	AWS Lake Formation	Fine-grained access control, data sharing
Object Storage	Amazon S3	Data lake storage for data products
Data Processing	AWS Glue, EMR, Step Functions	ETL/ELT pipelines
Data Warehouse	Amazon Redshift	Analytical queries across data products
Streaming	Amazon Kinesis, MSK	Real-time data product ingestion
Data Sharing	Lake Formation, Redshift Data Sharing	Cross-account data sharing
Data Quality	AWS Glue Data Quality	Automated data quality rules

Multi-Account Strategy

+---------------------------+
|   Management Account      |
|   - AWS Organizations     |
|   - Central Governance    |
+---------------------------+
         |
    +----+----+----+
    |         |         |
+---------+ +---------+ +---------+
| Sales   | | Product | | Customer|
| Account | | Account | | Account |
| - S3    | | - S3    | | - S3    |
| - Glue  | | - Glue  | | - Glue  |
| - Lake  | | - Lake  | | - Lake  |
| Format. | | Format. | | Format. |
+---------+ +---------+ +---------+

Implementation Pattern

Each domain gets its own AWS account via AWS Organizations
Lake Formation manages cross-account data sharing
Glue Data Catalog provides federated metadata
S3 buckets store domain data products
IAM policies enforce access control per data product
EventBridge enables cross-domain event notifications

Documentation:

Azure Implementation

Core Services

Component	Azure Service	Purpose
Data Catalog	Microsoft Purview	Unified data governance and cataloging
Data Governance	Purview Data Map	Data lineage, classification, policies
Object Storage	Azure Data Lake Storage Gen2	Scalable data lake storage
Data Processing	Azure Synapse Analytics, Data Factory	ETL/ELT pipelines
Data Warehouse	Synapse Dedicated SQL Pools	Analytical queries
Streaming	Azure Event Hubs, Stream Analytics	Real-time data ingestion
Data Sharing	Azure Data Share	Cross-domain data sharing
Data Quality	Purview Data Quality	Data profiling and quality rules

Subscription Strategy

Each domain gets its own Azure subscription or resource group
Purview provides cross-subscription data catalog and governance
ADLS Gen2 containers organized by data product
Synapse workspaces per domain for independent processing
Azure Policy enforces tagging and compliance standards
Managed identities for cross-domain access

Key Integration Points

Purview scans data sources across domains automatically
Data Share enables snapshot and in-place sharing
Synapse Link provides near-real-time analytics from operational stores
Event Hubs captures domain events for streaming data products

Documentation:

GCP Implementation

Core Services

Component	GCP Service	Purpose
Data Catalog	Dataplex	Unified data management and governance
Data Governance	Dataplex Data Quality, Catalog	Automated governance
Object Storage	Cloud Storage (GCS)	Data lake storage
Data Processing	Dataflow, Dataproc, Cloud Composer	ETL/ELT pipelines
Data Warehouse	BigQuery	Serverless analytics
Streaming	Pub/Sub, Dataflow	Real-time data ingestion
Data Sharing	BigQuery Analytics Hub	Data exchange marketplace
Data Quality	Dataplex Data Quality Tasks	Automated quality checks

Project-Based Strategy

Each domain gets its own GCP project
Dataplex organizes data into lakes, zones, and assets across projects
BigQuery datasets serve as data product output ports
Analytics Hub enables cross-project data sharing without data movement
Cloud IAM with organization-level policies for federated governance
Pub/Sub topics for domain event publishing

Key Integration Points

Dataplex auto-discovers and catalogs data across projects
BigQuery authorized views and datasets for controlled sharing
Analytics Hub listings for marketplace-style data product discovery
Data Catalog tags for classification and business metadata

Documentation:

Design Considerations

Domain Identification

Align with business capabilities: Domains should map to business units or capabilities (sales, marketing, product, customer)
Bounded contexts: Use domain-driven design (DDD) to identify natural boundaries
Source-aligned domains: Own operational data (orders, transactions)
Consumer-aligned domains: Create aggregated data products (analytics, reporting)
Avoid too many domains: Start with 3-5 domains and expand as the organization matures

Data Product Design

Define clear interfaces: Standardize how data products are consumed (SQL, API, files)
Version schemas: Use semantic versioning for breaking changes
Self-describing: Include metadata, documentation, and sample data
Immutable: Append-only data with full history when possible
SLA-driven: Define freshness, completeness, and availability guarantees

Platform Design

Golden paths: Provide templates and tooling for common data product patterns
Self-service provisioning: Domains should be able to create infrastructure without tickets
Observability built-in: Monitoring, alerting, and quality checks as platform features
Security by default: Encryption, access control, and audit logging as defaults

Data as a Product

Product Thinking for Data

Aspect	Traditional Data	Data as a Product
Ownership	Central data team	Domain team
Quality	Best effort	SLA-backed guarantees
Discovery	Ask someone	Self-serve catalog
Documentation	Wiki pages (maybe)	Co-located with data
Consumers	Known, limited	Self-service, scalable
Evolution	Breaking changes	Versioned, backward compatible

Data Product Quality Dimensions

Freshness: How recently was the data updated?
Completeness: Are there missing values or records?
Accuracy: Does the data reflect reality?
Consistency: Is the data consistent across sources?
Uniqueness: Are there duplicate records?
Timeliness: Is data available when needed?

Data Product Metadata

Every data product should include:

Name and description: Clear, business-friendly naming
Owner and contact: Who to reach for questions
Schema: Field definitions, types, and constraints
SLA: Freshness, availability, and quality guarantees
Lineage: Where the data comes from and how it is transformed
Sample data: Examples for quick understanding
Usage examples: SQL queries, API calls, code snippets
Access request process: How to get access

Federated Governance

Governance Model

                    [Global Policies]
                    - Naming standards
                    - Security baseline
                    - Quality thresholds
                    - Interoperability
                          |
         +----------------+----------------+
         |                |                |
   [Domain: Sales]  [Domain: Product] [Domain: Customer]
   - Domain rules   - Domain rules    - Domain rules
   - Local quality  - Local quality   - Local quality
   - Access mgmt    - Access mgmt    - Access mgmt

Global vs Domain Policies

Policy Type	Global (Federated)	Domain (Local)
Naming	Standard prefixes and conventions	Domain-specific terminology
Security	Encryption requirements, PII handling	Access control for specific products
Quality	Minimum freshness and completeness	Domain-specific validation rules
Interoperability	Standard formats (Parquet, Avro)	Internal transformation logic
Compliance	Regulatory requirements (GDPR, HIPAA)	Domain-specific retention policies

Automation

Policy as Code: Define governance rules in version-controlled code
Automated Quality Checks: Run data quality validations on every update
Schema Validation: Enforce schema standards at ingestion time
Access Auditing: Automatically log and review data access patterns
Compliance Scanning: Detect PII and sensitive data automatically

Cost Estimation

AWS (Monthly Estimate per Domain)

Component	Service	Estimated Cost
Storage	S3 (1 TB)	~$23/month
Catalog	Glue Data Catalog	~$1/month
Processing	Glue Jobs (10 DPU-hours/day)	~$130/month
Governance	Lake Formation	Free (included)
Warehouse	Redshift Serverless (moderate)	~$200-500/month
Quality	Glue Data Quality	~$50-100/month
Per Domain Total		~$400-750/month
Platform Total (5 domains)		~$2,000-3,750/month

Azure (Monthly Estimate per Domain)

Component	Service	Estimated Cost
Storage	ADLS Gen2 (1 TB)	~$21/month
Catalog	Purview	~$250-500/month (shared)
Processing	Data Factory + Synapse	~$200-500/month
Sharing	Data Share	~$50-100/month
Per Domain Total		~$500-1,100/month
Platform Total (5 domains)		~$2,500-5,500/month

GCP (Monthly Estimate per Domain)

Component	Service	Estimated Cost
Storage	GCS (1 TB)	~$20/month
Catalog	Dataplex	~$50-100/month
Processing	Dataflow	~$100-300/month
Warehouse	BigQuery (on-demand, moderate)	~$100-400/month
Sharing	Analytics Hub	Free (query costs only)
Per Domain Total		~$270-820/month
Platform Total (5 domains)		~$1,350-4,100/month

Production Checklist

Organization Readiness

Domain boundaries identified and agreed upon
Domain data product owners assigned
Data mesh principles communicated to stakeholders
Organizational support and executive sponsorship secured
Training plan for domain teams created

Platform Infrastructure

Multi-account/project/subscription strategy implemented
Self-serve provisioning templates available
Data catalog deployed and accessible
Data quality framework operational
Monitoring and alerting configured
CI/CD pipelines for data product deployment

Data Products

Data product template and standards defined
Schema registry or catalog configured
At least one data product per domain published
Documentation standards enforced
SLAs defined and monitored
Data product versioning strategy implemented

Governance

Global governance policies documented
Policy-as-code implemented and enforced
Data classification and PII detection automated
Access control model implemented (RBAC/ABAC)
Compliance requirements mapped to controls
Data lineage tracking operational

Operations

Data quality dashboards available
Incident response process for data issues defined
Cross-domain data sharing tested and validated
Performance benchmarks established
Cost allocation per domain configured
Regular data mesh maturity assessments scheduled

Related Guides:

FilesExpand file tree

data-mesh.md

Latest commit

History

data-mesh.md

File metadata and controls

Data Mesh Architecture

Table of Contents

Architecture Overview

What is Data Mesh?

Four Pillars of Data Mesh

Benefits

Trade-offs

Use Cases

When to Avoid Data Mesh

Architecture Diagram Description

High-Level Architecture

Data Product Structure

Component Breakdown

Platform Components

Data Product Components

AWS Implementation

Core Services

Multi-Account Strategy

Implementation Pattern

Azure Implementation

Core Services

Subscription Strategy

Key Integration Points

GCP Implementation

Core Services

Project-Based Strategy

Key Integration Points

Design Considerations

Domain Identification

Data Product Design

Platform Design

Data as a Product

Product Thinking for Data

Data Product Quality Dimensions

Data Product Metadata

Federated Governance

Governance Model

Global vs Domain Policies

Automation

Cost Estimation

AWS (Monthly Estimate per Domain)

Azure (Monthly Estimate per Domain)

GCP (Monthly Estimate per Domain)

Production Checklist

Organization Readiness

Platform Infrastructure

Data Products

Governance

Operations