Purpose of Models

Semantic models provide a flexible analytical layer on top of your data warehouse, aiming to create a digital twin of your business processes and concepts. They allow you to define business logic, metrics, and relationships in a centralized, governed place. This practice, often referred to as “shifting left”, involves moving analytical logic out of downstream BI tools or ad-hoc query generation (like that performed by AI agents) and into this upstream modeling layer. By doing so, you ensure logic is versioned, testable, and consistently applied across all uses.

Instead of embedding complex logic in individual queries or reports, or worse, allowing an AI to make assumptions about raw data (where infinite interpretations exist), you define key business concepts like “total revenue” or “active user” once in the model. A key benefit is that if a metric’s definition changes, you update it in one place (the model), and that change automatically propagates to all analyses and tools using the model. This ensures consistency and reliability, dramatically reducing the risk of inconsistent answers or technical debt. For AI agents like Buster, this pre-defined logic is crucial; it allows the AI to leverage your curated business understanding, delivering trustworthy results instead of potentially flawed interpretations based on ambiguous raw data.

Model Structure

Models represent business entities or concepts, typically mapping to a table or view in your database. Each model is defined with a name and can optionally include descriptions, dimensions, measures, metrics, filters, and entities (relationships).

name: orders                # Required - model identifier
description: "..."          # Required - description
# At least one of dimensions, measures, or metrics are required.
dimensions:                 # Optional - Dimension definitions
  - name: order_id
    type: string
    ...
measures:                   # Optional - Measure definitions
  - name: order_amount
    type: number
    agg: sum
    ... 
metrics:                    # Optional - Metric definitions
  - name: average_order_value
    expr: "..."
    ...
filters:                    # Optional - Filter definitions
  - name: last_30_days
    expr: "..."
    ...
relationships:              # Optional - Relationship definitions
  - name: customer
    source_col: customer_id
    ref_col: customer_id
    ...

Required Fields

  • name: A unique identifier for the model, typically matching the table name in your database.
  • description: A clear explanation of what the model represents. This field is mandatory to ensure clarity and context for all users and downstream tools like AI agents.
  • At least one of dimensions, measures, or metrics are required.

Optional Fields

Technically, the attributes below are optional. You can create a model with just the name and description fields, or include any combination of dimensions, measures, metrics, filters, and relationships as needed.

Modeling Approaches

While the structure above defines how to build a model, it’s helpful to consider what you are modeling. There are a few common approaches:

1. Entity-Focused Models

These models represent core business objects or concepts. They often map directly to dimension tables in a traditional data warehouse.

  • Examples: customers, products, employees, stores, invoices, aircraft.
  • Characteristics: Primarily consist of descriptive dimensions (customer name, product category, employee department, store location, invoice status). May also include some measures directly related to the entity (e.g., credit_limit for a customer) or summary/rollup metrics (e.g., total_invoiced_amount on an invoice model).
  • Filters: Often include filters based on the entity’s attributes (e.g., is_premium_customer, product_is_discontinued).
# Example: Entity-Focused Model (Based on Quickstart users.yml)
# models/users.yml
name: users
description: >
  Contains comprehensive information about registered users, including their unique identifiers,
  contact details, registration timestamp, geographical location, and lifetime purchase value.
  This model serves as a central hub for understanding user demographics and engagement.
dimensions:
  - name: id # Note: Quickstart SQL uses user_id, but YML uses id. Relationship refers to 'id' as source_col.
    description: >
      A unique alphanumeric string identifying each user. Essential for joining with other user-related data.
    type: string
    primary_key: true # Assuming this is the primary key
  - name: email
    description: The primary email address associated with the user account, used for communication and login.
    type: string
  - name: user_created_at
    description: >
      The precise date and time (UTC) when the user first registered their account. Useful for cohort analysis and understanding user tenure. Indexed for efficient searching.
    type: timestamp
    searchable: true
  - name: country
    description: >
      The country where the user is located, typically derived from their registration information or IP address. Stored as a two-letter ISO 3166-1 alpha-2 country code (e.g., 'US', 'GB', 'CA').
    type: string
metrics: # Note: Quickstart SQL defines lifetime_value as a column. YML defines it as a metric. Following YML structure.
  - name: lifetime_value
    description: >
      The cumulative total amount spent by the user across all purchases, measured in USD. Represents the total monetary value brought in by the user.
    type: number # Aggregation (e.g., sum) might be needed depending on source data.
  - name: average_order_value_over_50
    description: The average value of orders placed by the user where the order total exceeds $50.
    expr: "AVG(CASE WHEN orders.order_total > 50 THEN orders.order_total ELSE NULL END)" # Assumes relationship 'orders' exists
relationships:
  - name: orders
    source_col: id  # Corresponds to the 'id' dimension in this 'users' model
    ref_col: user_id     # Corresponds to the 'user_id' dimension in the related 'orders' model
    cardinality: one-to-many # One user can have multiple orders
    description: >
      Links this user to their corresponding order events recorded in the 'orders' model. Allows analysis of user purchasing behavior.

# Example: Entity-Focused Model (customers.yml)
# models/customers.yml
name: customers
description: >
  Represents individual customer entities. Contains contact information, registration details,
  location, and derived metrics like lifetime value. This is a central model for
  understanding customer attributes and behavior segments. Essential for joining customer
  data with transactions, support interactions, etc.
dimensions:
  - name: id
    description: >
      The unique identifier for a customer, typically corresponds to the primary key
      in the source customer table. Essential for joining with event models like orders.
    type: string
  - name: name
    description: The full name of the customer. Used for display and personalization.
    type: string
  - name: email
    description: >
      The customer's primary email address. Used for communication, login identification,
      and potentially linking across different systems. Assumed to be unique per customer.
    type: string
  - name: signup_year
    description: >
      The year the customer first registered or created their account. Useful for cohort
      analysis based on tenure (e.g., comparing behavior of 2022 vs 2023 signups).
    type: integer
  - name: country
    description: >
      The country where the customer is located, typically derived from registration
      or geo-IP lookup. Often stored as ISO 3166-1 alpha-2 code (e.g., 'US', 'GB'). Used for
      geographical segmentation and analysis.
    type: string
  - name: user_created_at
    description: >
      The precise date and time (UTC recommended) when the customer account was created.
      Provides granular detail for time-based cohorting or analyzing initial user activity.
    type: timestamp
    searchable: true
metrics:
  - name: lifetime_value
    description: >
      The total cumulative revenue generated from this customer across all their orders.
      Represents the historical monetary value of the customer. Calculation assumes an underlying
      aggregation or pre-calculation in the source data, or relies on a join with the 'orders' model.
    expr: "SUM(orders.amount)" # Needs relationship 'orders'
  - name: average_order_value_over_50
    description: >
      The average value of orders placed by this customer where the individual order amount
      exceeded $50. Requires joining with the 'orders' model via the 'orders' relationship.
    expr: "AVG(CASE WHEN orders.amount > 50 THEN orders.amount ELSE NULL END)" # Needs relationship 'orders'
relationships:
  - name: orders # Link back TO orders model
    source_col: id           # Key in customers (current model)
    ref_col: customer_id    # Key in orders (related model)
    description: >
      Links this customer to all their associated order records in the 'orders' model.
      Enables analysis of purchasing history and behavior for each customer.
    cardinality: one-to-many # One customer can have many orders
    type: LEFT

2. Event-Focused Models

These models represent occurrences or interactions that happen over time. They often map to fact tables or event stream data.

  • Examples: page_views, orders, transactions, login_events, support_tickets, security_incidents.
  • Characteristics: Almost always include a timestamp dimension. They contain dimensions describing the event (e.g., URL for page view, order status, transaction type, user agent for login). They typically contain measures quantifying the event (e.g., order_amount, view_duration_seconds, transaction_value). Summary metrics related to the event itself might be included.
  • Relationships: Event models usually have relationships (relationships) linking back to multiple Entity-Focused models (e.g., an orders event links to customers, products, and employees).
  • Filters: Common filters relate to time (last_7_days, this_quarter) or event attributes (successful_logins, high_priority_tickets).
# Example: Event-Focused Model (Based on Quickstart orders.sql)
# models/orders.yml
name: orders
description: >
  Represents individual customer order transactions. Contains details about each order,
  including identifier, customer linkage, timing, status, and financial value. This is
  an event-based model crucial for sales analysis, revenue tracking, and understanding
  purchase patterns.
dimensions:
  - name: order_id
    description: >
      The unique identifier for each order transaction. Serves as the primary key for
      this model. Used for looking up specific orders or counting unique orders.
    type: string
  - name: customer_id # Changed from user_id for consistency
    description: >
      Foreign key referencing the 'id' in the 'customers' model. Links the order
      back to the specific customer who placed it. Essential for customer-centric analysis.
    type: string
  - name: order_date # Changed from order_timestamp
    description: >
      The date the order was placed. Used for time-series analysis, filtering by date ranges
      (e.g., daily, weekly, monthly revenue), and cohort analysis based on order time.
    type: date
    searchable: true
  - name: status
    description: >
      The current fulfillment or payment status of the order (e.g., 'pending', 'shipped',
      'delivered', 'cancelled'). Used for operational reporting and filtering orders by state.
    type: string
    options: ["pending", "shipped", "delivered", "cancelled"]
  - name: ordered_within_30_days_of_signup # From quickstart logic
    description: >
      A boolean flag indicating if this order was placed within 30 days of the associated
      customer's signup date (`customers.user_created_at`). Useful for analyzing early
      customer purchasing behavior. Requires join to 'customers' model to calculate.
    type: boolean
measures:
  - name: amount # Changed from order_total
    description: >
      The total monetary value of the order, typically in a specific currency (e.g., USD).
      This is the primary value used for revenue calculations (SUM, AVG).
    type: number
  - name: item_count # Added measure
    description: The total number of distinct items included in the order. Used for analyzing order size.
    type: integer
filters:
  - name: orders_last_30_days
    description: Selects orders placed within the last 30 calendar days.
    expr: "orders.order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)"
  - name: first_month_orders
    description: >
      Selects orders placed within 30 days of the associated customer's signup date.
      Relies on the pre-calculated 'ordered_within_30_days_of_signup' dimension.
    expr: "orders.ordered_within_30_days_of_signup"
relationships:
  - name: customer # Renamed from 'users'
    source_col: customer_id # Key in orders (current)
    ref_col: id           # Key in customers (related)
    description: >
      Links this order back to the specific customer who placed it, using the 'customers' model.
      Enables retrieving customer details (e.g., name, email, country) for order analysis.
    cardinality: many-to-one
    type: LEFT
  # - name: products # Example linking to a products model
  #   source_col: product_id # Assuming a product_id field in orders
  #   ref_col: id # Assuming id is PK in products
  #   description: Links the order line to the specific product.
  #   cardinality: many-to-one 
  #   type: LEFT

3. Metric-Focused Models

These models center around a core business metric or Key Performance Indicator (KPI). They often correspond to fact tables in a star or snowflake schema and serve as the central point for analyzing a specific metric across various business dimensions.

  • Examples: monthly_revenue, daily_user_activity, campaign_performance_facts.
  • Characteristics: Primarily contain measures that contribute to the core metric(s). The dimensions are typically foreign keys (e.g., date_id, customer_id, product_id, region_id) that link to separate Entity-Focused models. They might also include degenerate dimensions (attributes directly on the fact record).
  • Metrics: The main purpose is often to define one or more significant metrics (e.g., total_revenue, active_users, conversion_rate) calculated from the base measures.
  • Relationships: Heavily rely on entities to connect the foreign key dimensions back to their respective dimension/entity models.
  • Filters: Often include time-based filters (last_quarter, year_to_date) or filters related to the metric calculation itself.
# Example: Metric-Focused Model (Monthly Revenue Fact)
name: monthly_revenue
description: Central fact model for analyzing monthly revenue across different business dimensions.
dimensions:
  # Foreign Keys linking to dimension/entity models
  - name: month_start_date # Links to a date dimension model
    type: date
  - name: region_id # Links to a regions model
    type: string
  - name: customer_segment_id # Links to a customer_segments model
    type: string
  - name: sales_rep_id # Links to a sales_reps model
    type: string
  # You might also have direct measures if the metric uses them
measures:
  - name: revenue_amount
    description: The base revenue amount for a given record.
    type: number
    agg: sum
metrics:
  - name: total_revenue
    description: The total calculated revenue for the selected dimensions.
    expr: "SUM(revenue_amount)" # Metric derived from the measure
filters:
  - name: revenue_last_quarter
    description: Revenue generated in the previous full quarter.
    # Example expr assumes a date dimension model provides quarter info
    expr: "month_start_date >= DATE_TRUNC('QUARTER', DATE_SUB(CURRENT_DATE(), INTERVAL 3 MONTH)) AND month_start_date < DATE_TRUNC('QUARTER', CURRENT_DATE())"
relationships:
  # Define relationships to the dimension/entity models
  - name: date_dimension
    source_col: date_id # Assuming date_id in the date dimension model
    ref_col: month_start_date # Field in this model
    # ... other relationship details
  - name: regions
    source_col: region_id # Key in regions model
    ref_col: region_id # Key in this model
    # ... other relationship details
  - name: customer_segments
    source_col: segment_id # Key in segments model
    ref_col: customer_segment_id # Key in this model
    # ... other relationship details
  - name: sales_reps
    source_col: rep_id # Key in reps model
    ref_col: sales_rep_id # Key in this model
    # ... other relationship details

Best Practices & Combining Approaches:

  • Start with Entities and Events: Most semantic layers are built primarily around Entity-Focused and Event-Focused models, as these directly represent the core nouns and verbs of your business.
  • Use Relationships: Define relationships clearly to connect your models. This is how Buster understands how events relate to entities (e.g., which customer placed which order).
  • Mix and Match: Real-world data often requires a combination. An orders model (Event) will link to customers and products (Entities).
  • Keep Models Focused: Aim for each model to represent a single, well-defined concept (one entity or one type of event).
  • Metric-Focused for Performance: Consider Metric-Focused models if you have very large datasets and need pre-aggregation for performance, or if a specific, central metric links many otherwise disparate entities.

Organizing Your Model Files

Your model definitions must be structured with each model residing in its own separate YAML file. Buster will discover and load models from all .yml files within the specified model paths in your buster.yml.

This approach, organizing models into separate files, often grouped into subdirectories based on business domain (e.g., models/finance/, models/marketing/, models/product/), is essential for improving discoverability and maintainability as your semantic layer grows.

# Example: Models split by domain (Required Structure)
# models/marketing/campaigns.yml
name: marketing_campaigns
  # ... definition ...

# models/sales/orders.yml
name: orders
    # ... definition ...

# models/product/users.yml
name: users
  # ... definition ...

Naming Conventions

  • Use snake_case for model names (e.g., order_items, not OrderItems or order-items).
  • Choose names that reflect the business entity or concept, not necessarily the exact technical table name.
  • Be consistent with naming patterns across your semantic layer.

Best Practices

Clear Descriptions

Always include descriptive text that explains what the model represents in business terms, even if brief.

Start Simple

Begin with a core model definition (just the name) and add dimensions, measures, etc., incrementally as analysis requires.

Align with Data Source

Generally, each model should align conceptually with a database table or view, but it’s not a strict requirement.

Logical Grouping

Use separate files and potentially subdirectories to group related models logically (e.g., by business domain like finance, marketing).

Example

Here’s a more detailed example of a page_views model, demonstrating various optional components:

# models/web_analytics/page_views.yml
name: page_views
description: >
  Captures each instance of a user viewing a page on the website. This event-based model
  is fundamental for web analytics, tracking user engagement, content popularity, and
  navigation paths. Each record represents a single page load event.
dimensions:
    - name: view_id
      description: >
        A unique identifier assigned to each individual page view event. Serves as the
        primary key for this model and is essential for counting distinct views.
      type: string
    - name: view_timestamp
      description: >
        The precise date and time (UTC recommended) when the page view occurred. Critical for
        time-series analysis, sessionization, and understanding user activity patterns over time.
      type: timestamp
      searchable: true # Optimizes queries filtering or sorting by view time.
    - name: url
      description: >
        The full URL of the web page that was viewed. Used to analyze traffic to specific
        pages, content performance, and site navigation.
      type: string
    - name: viewed_within_7_days_of_signup
      description: >
        A boolean flag indicating if this page view occurred within the first 7 days of the
        associated user's registration. Helpful for analyzing early user engagement with the website.
        Typically requires a join to a 'users' model to determine signup date.
      type: boolean
  measures:
    - name: view_duration_seconds
      description: >
        The duration, in seconds, that the user spent on the page during this specific view.
        Note: Accuracy may depend on tracking method (e.g., time between pings or navigation events).
        Used for engagement analysis (e.g., SUM for total time on site, AVG for typical page dwell time).
      type: integer
      # agg: sum # Aggregation is defined by metrics, not directly on measures in spec
  metrics:
    - name: total_page_views
      expr: "COUNT(page_views.view_id)" # Added model prefix for clarity
      description: >
        The total number of page views recorded in the queried context. Represents the overall
        volume of page impressions.
    - name: average_view_duration
      expr: "AVG(page_views.view_duration_seconds)"
      description: >
        The average time, in seconds, users spent on a page per view, calculated across all
        page views in the queried context. Indicates typical engagement duration for a page.
  filters:
    - name: views_last_7_days
      expr: "page_views.view_timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)"
      description: Selects page view events that occurred within the last 7 calendar days.
  relationships:
    - name: users # Corresponds to a 'users' or 'customers' model
      source_col: user_id # Assumes a user_id FK in page_views
      ref_col: id        # Assumes 'id' is the PK in the related users/customers model
      cardinality: many-to-one # Many page views can belong to one user
      description: >
        Links this page view event to the specific user who performed the action. Enables
        user-centric analysis of web activity.
      type: LEFT
    - name: products # Example: Corresponds to a 'products' model if pages are product-specific
      source_col: product_id # Assumes a product_id FK in page_views (if applicable)
      ref_col: id           # Assumes 'id' is the PK in the related products model
      cardinality: many-to-one # A page view might relate to one product (e.g. product detail page)
      description: >
        Links this page view event to a specific product, if the viewed page is associated
        with one (e.g., a product detail page). Enables analysis of product page engagement.
      type: LEFT