Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

Data architectures need organizing principles that help teams understand where data lives, what quality to expect, and how transformations progress. And what better than a catchy description. Without clear structure, data lakes devolve into swamps where nobody knows which datasets are authoritative, what processing has occurred, or whether data is ready for production use.
The pattern emerged from Databricks and the lakehouse movement, but reflects principles that predate its formalization. Data warehouses always had staging areas, integration layers, and presentation layers. ETL pipelines always had extract, transform, and load phases. The medallion architecture packages these ideas with clear naming, explicit quality expectations, and alignment to modern data lake tooling. The result is a framework that’s both familiar and new.
Bronze, silver, and gold are intuitive labels that communicate data maturity without technical jargon. That’s the pitch being made and I buy some of that, but it likes up against better descriptive terms like raw, standardized and consumables respectively. Technical stakeholders understand that gold data is refined and ready for analytics while bronze data is raw and unprocessed. Engineers understand the transformation flow from ingestion through cleansing to aggregation. The architecture provides shared vocabulary that bridges technical and business perspectives. My advice would be to use those previously mentioned business synonyms alongside the more technical medallion terms.
The bronze layer stores data exactly as it arrives from source systems. No cleansing, no transformation, no filtering. If the source system sends malformed JSON, the bronze layer contains malformed JSON. If records have missing fields or invalid values, those imperfections persist. If the table column is named GHY_5 then that’s what you have. The bronze layer is the system of record for what was actually received. It it completely source aligned. You would hope the source system is semantically clear. But we all know that seldom the case.
This immutability is deliberate and valuable. When transformation logic changes or bugs are discovered in downstream processing, you can reprocess from bronze with corrected logic. Without bronze as an immutable foundation, fixing transformation errors requires re-extracting from sources, which might no longer be possible if sources have changed or retention windows have expired.
The bronze layer often stores data in formats optimized for write performance rather than query performance. Append-only Parquet files, JSON, or even the raw formats from source systems work well. The priority is reliable, fast ingestion rather than analytical query patterns. Bronze data might be partitioned by ingestion date to simplify time-based retention policies.
Schema enforcement is minimal or absent in bronze. The layer accepts whatever arrives without validation. This flexibility prevents source system changes from breaking ingestion pipelines. When a source adds a new field, bronze ingestion continues working without schema updates. Downstream layers handle validation and schema enforcement during transformation to silver.
Retention policies for bronze data balance storage costs against reprocessing needs. Some organizations keep bronze data indefinitely because storage is cheap and the ability to reprocess from raw data has proven valuable. Others implement time-based retention deleting bronze data after it’s been successfully processed to silver and retention periods expire. The right policy depends on reprocessing requirements and regulatory obligations.
Change data capture flows naturally into bronze layers. CDC streams from operational databases land in bronze as event logs capturing every insert, update, and delete. The bronze layer preserves the complete event history, enabling replay and temporal analysis. Transforming CDC events into current-state tables happens in silver.
The silver layer represents validated, cleaned, and conformed data ready for general analytical use. Data quality issues caught during bronze-to-silver transformation are either corrected or rejected. Schema is enforced consistently. Business rules are applied. The result is data that analysts and data scientists can trust for their work.
Data validation during bronze-to-silver transformation identifies and handles quality issues. Records with missing required fields might be filtered out or sent to error tables for investigation. Format inconsistencies are standardized. Referential integrity is validated where possible. Invalid values are either corrected through imputation or marked as invalid. The transformation logic encodes data quality rules explicitly.
Schema enforcement happens here rather than in bronze. The silver layer defines expected schemas and validates incoming data against them. Schema evolution is managed through versioning and compatibility rules. Adding fields is straightforward, but removing or renaming fields requires careful handling to maintain compatibility with downstream consumers.
Deduplication removes duplicate records that might exist in source systems or occur through ingestion processes. Duplicate detection logic ranges from simple exact-match deduplication to sophisticated fuzzy matching for cases where slight variations indicate the same entity. The silver layer aims to represent each entity once with the most accurate and complete information available.
Data type standardization ensures consistency across sources. Dates are converted to standard formats. Numeric fields are cast to appropriate types. Enumerations are mapped to controlled vocabularies. This standardization makes silver data easier to work with downstream because consumers don’t need to handle source-specific formatting quirks.
Slowly changing dimensions are implemented in silver when historical tracking is required. Type 2 SCD patterns that create new rows for each change typically happen during silver processing. The silver layer becomes the system of record for dimensional data with proper historical tracking, while bronze retains the raw change events.
Incremental processing dominates silver layer pipelines. Rather than reprocessing all bronze data, incremental jobs process only new or changed data since the last run. This makes pipelines efficient and enables near-real-time silver data freshness. Watermarks, timestamps, or change tracking mechanisms coordinate incremental processing.
The gold layer contains data organized for specific business use cases. While silver provides cleaned atomic data, gold presents aggregated, joined, and enriched data optimized for consumption by reports, dashboards, and analytical applications. Different gold tables serve different analytical needs, each structured for its particular use case.
Dimensional models often live in the gold layer. Star schemas with fact and dimension tables enable efficient analytical queries. The dimensional modeling happens during silver-to-gold transformation, joining normalized silver tables into denormalized dimensional structures. Business users interact primarily with gold dimensional models rather than silver tables.
Aggregations pre-compute metrics at various grains to accelerate dashboard and report performance. Daily, weekly, and monthly rollups of key metrics prevent repetitive aggregation of raw data. These aggregates trade storage space for query performance, a worthwhile trade-off for frequently accessed metrics.
Business metrics calculated through complex logic are materialized in gold rather than computed on-demand. Customer lifetime value, churn predictions, product affinity scores, and similar derived metrics are calculated once in gold and then consumed by multiple downstream applications. This ensures consistent metric definitions and eliminates redundant computation.
Feature tables for machine learning are a specialized form of gold data. These tables contain engineered features at appropriate grains for model training and inference. Maintaining features in gold enables consistent feature definitions between training and production, reducing training-serving skew.
Multiple gold layers serving different business domains can coexist. A sales gold layer serves sales analytics with appropriate metrics and dimensions. A marketing gold layer serves marketing use cases with different aggregations and joins. A finance gold layer presents data structured for financial reporting. This domain-specific organization aligns data presentation with business structure via BI tools.
The gold layer is where business logic is most explicit. Transformation logic from silver to gold encodes business rules, metric definitions, and domain knowledge. This logic should be documented, tested, and version-controlled rigorously because it directly affects business decisions based on the data.
Data flows progressively through the medallion architecture, with each layer adding value and quality. The pattern is extract, validate, transform, aggregate. Understanding this flow helps design pipelines that fit naturally into the architecture.
Bronze ingestion focuses on reliable extraction from sources with minimal transformation. The goal is getting data into the data lake quickly and durably. Complex transformation logic doesn’t belong here because it risks ingestion failures that prevent data capture. Simplicity and reliability are paramount.
Silver transformation applies data quality rules, schema validation, and cleaning logic. This is where bad data is identified and either corrected or quarantined. Silver transformation is more complex than bronze ingestion but remains focused on cleaning and conforming rather than business logic. The output is clean atomic data.
Gold transformation applies business logic and structures data for consumption. Joins across silver tables, aggregations, metric calculations, and dimensional modeling happen here. Gold transformation is the most business-focused layer, translating technical data structures into business concepts.
Error handling improves at each layer. Bronze captures everything including errors. Silver identifies errors and routes them to error tables or quarantine areas for investigation. Gold should rarely encounter data quality issues because silver has already validated and cleaned the data. This progressive error handling isolates problems at appropriate layers.
Testing strategies differ by layer. Bronze ingestion testing focuses on reliability and completeness – did all data arrive? Silver transformation testing focuses on data quality – are validation rules working correctly? Gold testing focuses on business logic – are metrics calculated correctly? Each layer has distinct testing priorities.
Some organizations extend the medallion pattern with additional layers beyond bronze, silver, and gold. A platinum layer might contain highly curated datasets for executive dashboards. A presentation layer might provide views optimized for specific BI tools. These extensions make sense when clear use cases justify additional layers.
The risk of too many layers is increased complexity without proportional value. Each additional layer adds processing overhead, storage costs, and cognitive load for understanding data flow. The three-layer medallion architecture strikes a balance that works for most organizations. Additional layers should solve specific problems that the three-layer model doesn’t address.
Semantic layers that provide business-friendly abstractions over gold data might be considered a fourth layer or part of the consumption layer depending on how you categorize. These semantic layers translate technical tables and columns into business terms and provide governed access to data. They enhance rather than replace the medallion layers.
The medallion architecture provides vertical structure from raw to refined data. Data products organized by business domain provide horizontal structure. These two organizational principles complement each other. Each domain can have its own bronze-silver-gold progression for domain-specific data.
A customer domain might have bronze customer event data, silver cleaned customer profiles, and gold customer analytics. A product domain has its own bronze-silver-gold layers for product data. Domains are independent in their data processing but can share data through well-defined interfaces at appropriate layers.
This domain-oriented medallion approach aligns with data mesh principles where domains own their data products. Each domain manages its own medallion layers with appropriate governance, quality standards, and transformation logic. Cross-domain data sharing happens through published interfaces rather than direct table access.
The medallion architecture is technology-agnostic, but implementation patterns have emerged for common technology stacks. Understanding these patterns helps translate the conceptual architecture into working systems.
Delta Lake and similar table formats fit naturally into medallion architecture. Bronze tables use append-only Delta tables for efficient ingestion. Silver tables leverage Delta’s ACID transactions and schema evolution. Gold tables benefit from optimization features like Z-ordering. The transaction log enables time travel across all layers.
Databricks popularized medallion architecture and provides native support through notebooks, Delta Live Tables, and Unity Catalog. The patterns are well-documented and tooling is optimized for medallion workflows. Organizations using Databricks find medallion a natural fit.
Cloud data warehouses like Snowflake and BigQuery implement medallion through database schemas or datasets. Bronze, silver, and gold become separate schemas with appropriate permissions and retention policies. The pattern works but lacks some of the optimizations that lakehouse platforms provide for raw data storage.
dbt has become a standard tool for managing silver-to-gold transformations. Models organized by layer (staging, intermediate, marts) map naturally to silver and gold. The declarative SQL approach and testing framework align well with medallion principles. Many organizations use dbt as their primary transformation tool within medallion architectures.
Object storage with orchestrated processing implements medallion through directory structures and batch jobs. Bronze, silver, and gold become S3 prefixes or Azure blob containers. Spark, Flink, or other processing frameworks handle transformations between layers. This approach requires more manual orchestration but provides maximum flexibility.
The medallion architecture naturally aligns with governance policies because different layers have different access requirements. Bronze data might be restricted to data engineering teams. Silver data is accessible to analysts and data scientists. Gold data can be opened to broader business users. This layered access control matches data maturity with appropriate audience.
Personally identifiable information and sensitive data might be masked or encrypted in gold layers while remaining accessible in silver for authorized users. This progressive privacy approach enables broad gold data access while protecting sensitive information. PII masking becomes part of silver-to-gold transformation logic.
Data quality SLAs differ by layer. Bronze has minimal quality guarantees – it’s raw data as received. Silver commits to validated, cleaned data meeting defined quality standards. Gold commits to business-correct data suitable for decision-making. These differentiated SLAs help users understand what quality to expect from each layer.
Lineage tracking through medallion layers provides transparency about data origins and transformations. Tools like Unity Catalog, Alation, or Collibra can track lineage from bronze sources through silver transformations to gold consumption. This lineage is essential for understanding data provenance and diagnosing issues.
Organizations implementing medallion architecture encounter predictable challenges. Learning from common mistakes accelerates successful implementation.
Skipping bronze is tempting when storage costs are a concern or when source data seems clean enough to go directly to silver. This shortcut eliminates the ability to reprocess from raw data when transformation logic changes or errors are discovered. The storage savings rarely justify losing this flexibility. Always maintain bronze as an immutable source of truth.
Over-transforming in bronze defeats the purpose of having raw data. Resist the temptation to do “light” cleaning during ingestion because you know certain transformations are always needed. Once you start transforming in bronze, the line between bronze and silver blurs and you lose the benefits of immutable raw data.
Under-transforming in silver creates gold layers that must handle data quality issues. Silver should provide clean, validated, trusted data. If gold transformations constantly need to handle missing values, invalid data, or inconsistencies, that logic belongs in silver. Gold should focus on business logic, not data cleaning.
Mixing grains in gold tables creates confusion. A gold table should have a clear, consistent grain – one row per customer, one row per daily sales total, one row per transaction. Mixed grains make tables difficult to use correctly and lead to errors in downstream analysis.
Insufficient documentation about what each layer contains and how transformations work makes medallion architectures difficult to maintain and use. Clear documentation of bronze sources, silver validation rules, and gold business logic is essential. This documentation should live with the data, not in separate documentation systems that fall out of date.
The medallion architecture emerged from batch-oriented data lake patterns but adapts to streaming and real-time requirements. The principles remain valid even as processing moves from batch to stream.
Streaming bronze ingestion writes events to bronze as they arrive rather than in batches. Kafka topics, Kinesis streams, or Pub/Sub can feed bronze tables incrementally. The bronze layer still captures raw data exactly as received but does so continuously rather than periodically.
Incremental silver processing can run at high frequency, processing new bronze data every few minutes or seconds. Delta Live Tables and similar frameworks support continuous processing modes where transformations run constantly rather than on schedule. This enables near-real-time silver data without fundamentally changing the architecture.
Gold aggregations might be incrementally maintained rather than recomputed. Streaming aggregations update running totals as new data arrives. This requires more sophisticated transformation logic but enables real-time dashboards on gold data. The medallion layers remain conceptually the same even as update frequencies increase.
Lambda architectures that maintain separate batch and streaming paths can map both to medallion. The batch path flows through bronze-silver-gold with appropriate latency. The streaming path might bypass bronze or use a different bronze implementation optimized for low latency. Both paths ultimately populate the same silver and gold tables, unifying batch and streaming results.
The medallion architecture provides clear structure for data lakes through three layers with explicit purposes: bronze for raw data, silver for cleaned data, and gold for business-ready aggregates. This structure is simple enough to understand quickly but sophisticated enough to handle real-world complexity.
The architecture’s value lies in providing shared vocabulary and clear expectations. Everyone understands what bronze, silver, and gold mean. Teams know where to find data at different maturity levels. The progressive refinement from raw to refined aligns naturally with how data teams think about transformation pipelines.
Implementing medallion architecture doesn’t require specific technology. The pattern works with data lakes, lakehouses, cloud warehouses, and various processing frameworks. This technology agnosticism makes medallion broadly applicable rather than tied to specific vendor stacks.
The separation of layers enables independent evolution and optimization. Bronze can optimize for ingestion throughput, silver for transformation efficiency, and gold for query performance. Each layer uses appropriate storage formats, partitioning strategies, and retention policies without compromise.
The medallion architecture isn’t revolutionary, it codifies patterns that effective data teams have used for years. Its power comes from making implicit practices explicit and providing clear naming that facilitates communication. In data architecture, clarity and shared understanding often matter more than novelty.