Semantic Models

The word semantic is rather heavily used (often incorrectly) when discussing data models. The word semantic itself is an adjective relating to meaning in language or logic. When we think about its use in data models, we think of the meta-data associated with a particular entity or table. If you are building databases, you are likely already using a semantic model as the meaning is baked into the table names, relationships, descriptions, constraints, cardinality etc.

Here is a really simple example:

CREATE TABLE person (
  id INT PRIMARY KEY,
  name VARCHAR(100) NOT NULL,
  age INT CHECK (age >= 0),
  country_id INT REFERENCES countries(id)
)

Here we have created a semantic model with multiple dimensions:

  • Structure (columns, types)
  • Constraints (NOT NULL, CHECK, uniqueness)
  • Relationships (foreign keys)
  • Identity (primary keys)
  • Business rules (age can’t be negative)

So, what’s more to know? Well, if we are using just DDL then we are somewhat limited by how much semantic richness we can embedd into our models. Data Contracts get us a step further as we can embedd further dimensions into the yaml contract definitions, as described here:

The best way to think about semantic models is to abstract away from the technical space that answers questions like “what type is this field?” to adding business meaning, by answering questions like “what does this actually mean to our business?” or “how do we calculate quarterly revenue?”. This is where semantic models come in. A semantic model extends basic metadata by adding layers of business meaning, calculation logic, and contextual understanding. It’s metadata that knows not just the structure of your data, but the significance of it. Here is an example using Snowflake Semantic Views:

First we have the usual database DDL:

-- Raw data tables
CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  product_id INT,
  order_amount DECIMAL(10,2),
  order_date DATE,
  status VARCHAR(20)
);

CREATE TABLE customers (
  customer_id INT PRIMARY KEY,
  customer_name VARCHAR(100),
  country VARCHAR(50),
  segment VARCHAR(20)
);

CREATE TABLE products (
  product_id INT PRIMARY KEY,
  product_name VARCHAR(100),
  category VARCHAR(50)
);

That we then augment with a rich semantic layers.

name: "Sales Analytics"
description: "Semantic model for sales and revenue analysis"

tables:
  - name: orders
    description: "Customer orders and transactions"
    base_table: 
      database: sales_db
      schema: public
      table: orders
    dimensions:
      - name: order_date
        synonyms: ["date", "order day", "purchase date"]
        description: "Date when the order was placed"
        data_type: date
      - name: status
        synonyms: ["order status", "state"]
        description: "Current status of the order"
        data_type: varchar
    
  - name: customers
    description: "Customer information"
    base_table:
      database: sales_db
      schema: public
      table: customers
    dimensions:
      - name: customer_name
        synonyms: ["customer", "client name"]
        description: "Name of the customer"
        data_type: varchar
      - name: country
        synonyms: ["location", "region"]
        description: "Customer's country"
        data_type: varchar
      - name: segment
        synonyms: ["customer segment", "customer type"]
        description: "Customer segment classification"
        data_type: varchar

  - name: products
    description: "Product catalog"
    base_table:
      database: sales_db
      schema: public
      table: products
    dimensions:
      - name: category
        synonyms: ["product category", "product type"]
        description: "Product category"
        data_type: varchar

# Define relationships
relationships:
  - name: orders_to_customers
    left_table: orders
    left_column: customer_id
    right_table: customers
    right_column: customer_id
    relationship_type: many_to_one
    
  - name: orders_to_products
    left_table: orders
    left_column: product_id
    right_table: products
    right_column: product_id
    relationship_type: many_to_one

# Define business metrics
metrics:
  - name: revenue
    description: "Total revenue from completed orders"
    synonyms: ["sales", "total sales", "income"]
    definition: SUM(order_amount)
    filters:
      - status = 'completed'
    
  - name: order_count
    description: "Number of orders"
    synonyms: ["number of orders", "order volume"]
    definition: COUNT(order_id)
    
  - name: average_order_value
    description: "Average value per order"
    synonyms: ["AOV", "avg order size"]
    definition: AVG(order_amount)
    filters:
      - status = 'completed'

# Time grains for aggregation
time_dimensions:
  - name: order_date
    time_grains: [day, week, month, quarter, year]

What’s really powerful is that we are now able to introduce other dimensions like synonymns. You sales department may refer to a field as “Order Data” whilst procurement may refer to it as “Purchase Date”. The semantic model caters for this. In addition, we can now encode business-friendly metrics users can immediately recognise, such as “Average value per order” which is defined as the AVG(order_amount) where status = ‘completed’. The semantic model means business users get immediate vaule from the models without having to learn any SQL. Data models are the digital footprint of the business and it’s vital the business can use them.

And all of this is important for the reasons described. But where it becomes vital is how this semantic layer can inform AI. Without semantics or meaning, AI will hallucinate and guess the true meaning of the data model, and be confident and convincing when it does. But without meaning, what insight would any method gain from a column of numbers named MT_SN. Is it Market Sum, Maintenance Snap? With semantics we give the AI the information it needs and we now have meaning embedded, rather than guesswork and pretence.

Discover more from Where Data Engineering Meets Business Strategy

Subscribe now to keep reading and get access to the full archive.

Continue reading