Purpose of Models

Semantic models provide a flexible analytical layer on top of your data warehouse, aiming to create a digital twin of your business processes and concepts. They allow you to define business logic, metrics, and relationships in a centralized, governed place. This practice, often referred to as “shifting left”, involves moving analytical logic out of downstream BI tools or ad-hoc query generation (like that performed by AI agents) and into this upstream modeling layer. By doing so, you ensure logic is versioned, testable, and consistently applied across all uses.

Instead of embedding complex logic in individual queries or reports, or worse, allowing an AI to make assumptions about raw data (where infinite interpretations exist), you define key business concepts like “total revenue” or “active user” once in the model. A key benefit is that if a metric’s definition changes, you update it in one place (the model), and that change automatically propagates to all analyses and tools using the model. This ensures consistency and reliability, dramatically reducing the risk of inconsistent answers or technical debt. For AI agents like Buster, this pre-defined logic is crucial; it allows the AI to leverage your curated business understanding, delivering trustworthy results instead of potentially flawed interpretations based on ambiguous raw data.

Model Structure

Models represent business entities or concepts, typically mapping to a table or view in your database. Each model is defined with a name and can optionally include descriptions, dimensions, measures, metrics, filters, and entities (relationships).

models:
  - name: orders             # Required - model identifier
    description: Order data  # Required - description
    dimensions:              # Optional - data attributes
      # Dimension definitions
    measures:                # Optional - aggregations
      # Measure definitions
    metrics:                 # Optional - key performance indicators
      # Metric definitions
    filters:                 # Optional - predefined query filters
      # Filter definitions
    entities:                # Optional - defines relationships
      # Relationship definitions

Required Fields

  • name: A unique identifier for the model, typically matching the table name in your database.
  • description: A clear explanation of what the model represents. This field is mandatory to ensure clarity and context for all users and downstream tools like AI agents.

Optional Fields

All other fields within a model definition besides name and description are optional. You can create a model with just these two required fields, or include any combination of dimensions, measures, metrics, filters, and entities as needed.

  • dimensions: Attributes used for slicing and dicing data (e.g., user_country, product_category).
  • measures: Aggregations performed on data (e.g., total_revenue, average_order_value).
  • metrics: Key performance indicators derived from measures and dimensions (e.g., conversion_rate).
  • filters: Named, reusable boolean conditions (e.g., is_active, last_30_days) to simplify common query constraints. See Defining Filters.
  • entities: Define relationships between this model and others.

Modeling Approaches

While the structure above defines how to build a model, it’s helpful to consider what you are modeling. There are a few common approaches:

1. Entity-Focused Models

These models represent core business objects or concepts. They often map directly to dimension tables in a traditional data warehouse.

  • Examples: customers, products, employees, stores, invoices, aircraft.
  • Characteristics: Primarily consist of descriptive dimensions (customer name, product category, employee department, store location, invoice status). May also include some measures directly related to the entity (e.g., credit_limit for a customer) or summary/rollup metrics (e.g., total_invoiced_amount on an invoice model).
  • Filters: Often include filters based on the entity’s attributes (e.g., is_premium_customer, product_is_discontinued).
# Example: Entity-Focused Model
models:
  - name: products
    description: Represents individual products sold.
    dimensions:
      - name: product_id
        type: string
        primary_key: true # Indicates this is a primary key
      - name: product_name
        type: string
      - name: category
        type: string
      - name: launch_date
        type: date
      - name: is_discontinued
        type: boolean
    measures:
      - name: unit_cost
        type: number
        agg: avg # Example measure directly on the product
    filters:
      - name: is_discontinued
        description: Products no longer actively sold.
        # expr implicitly uses the is_discontinued dimension
      - name: launched_this_year
        description: Products launched in the current calendar year.
        expr: "EXTRACT(YEAR FROM launch_date) = EXTRACT(YEAR FROM CURRENT_DATE())"

2. Event-Focused Models

These models represent occurrences or interactions that happen over time. They often map to fact tables or event stream data.

  • Examples: page_views, orders, transactions, login_events, support_tickets, security_incidents.
  • Characteristics: Almost always include a timestamp dimension. They contain dimensions describing the event (e.g., URL for page view, order status, transaction type, user agent for login). They typically contain measures quantifying the event (e.g., order_amount, view_duration_seconds, transaction_value). Summary metrics related to the event itself might be included.
  • Relationships: Event models usually have relationships (entities) linking back to multiple Entity-Focused models (e.g., an orders event links to customers, products, and employees).
  • Filters: Common filters relate to time (last_7_days, this_quarter) or event attributes (successful_logins, high_priority_tickets).
# Example: Event-Focused Model (Simplified Order)
models:
  - name: orders
    description: Represents a customer placing an order.
    dimensions:
      - name: order_id
        type: string
        primary_key: true
      - name: order_timestamp
        type: timestamp
      - name: order_status
        type: string # e.g., 'pending', 'shipped', 'delivered'
    measures:
      - name: order_total_amount
        type: number
        agg: sum
      - name: item_quantity
        type: integer
        agg: sum
    metrics:
      - name: average_order_value
        expr: "AVG(order_total_amount)" # Metric summarizing the event itself
    filters:
      - name: shipped_orders
        description: Orders that have been shipped.
        expr: "order_status = 'shipped'"
      - name: orders_last_30_days
        description: Orders placed in the last 30 days.
        expr: "order_timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)"
    entities:
      - name: customers # Relationship to an entity model
        # ... relationship definition
      - name: products # Another relationship
        # ... relationship definition

3. Metric-Focused Models (Fact Models)

These models center around a core business metric or Key Performance Indicator (KPI). They often correspond to fact tables in a star or snowflake schema and serve as the central point for analyzing a specific metric across various business dimensions.

  • Examples: monthly_revenue, daily_user_activity, campaign_performance_facts.
  • Characteristics: Primarily contain measures that contribute to the core metric(s). The dimensions are typically foreign keys (e.g., date_id, customer_id, product_id, region_id) that link to separate Entity-Focused models. They might also include degenerate dimensions (attributes directly on the fact record).
  • Metrics: The main purpose is often to define one or more significant metrics (e.g., total_revenue, active_users, conversion_rate) calculated from the base measures.
  • Relationships: Heavily rely on entities to connect the foreign key dimensions back to their respective dimension/entity models.
  • Filters: Often include time-based filters (last_quarter, year_to_date) or filters related to the metric calculation itself.
# Example: Metric-Focused Model (Monthly Revenue Fact)
models:
  - name: monthly_revenue
    description: Central fact model for analyzing monthly revenue across different business dimensions.
    dimensions:
      # Foreign Keys linking to dimension/entity models
      - name: month_start_date # Links to a date dimension model
        type: date
      - name: region_id # Links to a regions model
        type: string
      - name: customer_segment_id # Links to a customer_segments model
        type: string
      - name: sales_rep_id # Links to a sales_reps model
        type: string
      # You might also have direct measures if the metric uses them
    measures:
      - name: revenue_amount
        description: The base revenue amount for a given record.
        type: number
        agg: sum
    metrics:
      - name: total_revenue
        description: The total calculated revenue for the selected dimensions.
        expr: "SUM(revenue_amount)" # Metric derived from the measure
    filters:
      - name: revenue_last_quarter
        description: Revenue generated in the previous full quarter.
        # Example expr assumes a date dimension model provides quarter info
        expr: "month_start_date >= DATE_TRUNC('QUARTER', DATE_SUB(CURRENT_DATE(), INTERVAL 3 MONTH)) AND month_start_date < DATE_TRUNC('QUARTER', CURRENT_DATE())"
    entities:
      # Define relationships to the dimension/entity models
      - name: date_dimension
        primary_key: date_id # Assuming date_id in the date dimension model
        foreign_key: month_start_date # Field in this model
        # ... other relationship details
      - name: regions
        primary_key: region_id # PK in regions model
        foreign_key: region_id # FK in this model
        # ... other relationship details
      - name: customer_segments
        primary_key: segment_id # PK in segments model
        foreign_key: customer_segment_id # FK in this model
        # ... other relationship details
      - name: sales_reps
        primary_key: rep_id # PK in reps model
        foreign_key: sales_rep_id # FK in this model
        # ... other relationship details

Best Practices & Combining Approaches:

  • Start with Entities and Events: Most semantic layers are built primarily around Entity-Focused and Event-Focused models, as these directly represent the core nouns and verbs of your business.
  • Use Relationships: Define entities (relationships) clearly to connect your models. This is how Buster understands how events relate to entities (e.g., which customer placed which order).
  • Mix and Match: Real-world data often requires a combination. An orders model (Event) will link to customers and products (Entities).
  • Keep Models Focused: Aim for each model to represent a single, well-defined concept (one entity or one type of event).
  • Metric-Focused for Performance: Consider Metric-Focused models if you have very large datasets and need pre-aggregation for performance, or if a specific, central metric links many otherwise disparate entities.

Organizing Your Model Files

You have flexibility in how you structure your model definition files:

  • Single File: You can define all your models within a single YAML file under the top-level models: key.
  • Multiple Files: You can split your model definitions across multiple YAML files. Buster will discover and load models from all .yml files within the specified model paths in your buster.yml.

Recommendation: We recommend organizing models into separate files, potentially grouped into subdirectories based on business domain (e.g., models/finance/, models/marketing/, models/product/). This improves discoverability and maintainability as your semantic layer grows.

# Example: All models in one file (models.yml)
models:
  - name: orders
    # ... definition ...
  - name: customers
    # ... definition ...
  - name: products
    # ... definition ...
# Example: Models split by domain
# models/marketing/campaigns.yml
models:
  - name: marketing_campaigns
    # ... definition ...

# models/sales/orders.yml
models:
  - name: orders
    # ... definition ...

# models/product/users.yml
models:
  - name: users
    # ... definition ...

Naming Conventions

  • Use snake_case for model names (e.g., order_items, not OrderItems or order-items).
  • Choose names that reflect the business entity or concept, not necessarily the exact technical table name.
  • Be consistent with naming patterns across your semantic layer.

Best Practices

Clear Descriptions

Always include descriptive text that explains what the model represents in business terms, even if brief.

Start Simple

Begin with a core model definition (just the name) and add dimensions, measures, etc., incrementally as analysis requires.

Align with Data Source

Generally, each model should align conceptually with a database table or view, but it’s not a strict requirement.

Logical Grouping

Use separate files and potentially subdirectories to group related models logically (e.g., by business domain like finance, marketing).

Example

Here’s a more detailed example of a page_views model, demonstrating various optional components:

# models/web_analytics/page_views.yml
models:
  - name: page_views
    description: Records each time a user views a page on the website. Event-based model.
    dimensions:
      - name: view_id
        description: Unique identifier for the page view event.
        type: string
      - name: view_timestamp
        description: Timestamp when the page view occurred.
        type: timestamp
        searchable: true # Allows searching/filtering on this dimension
      - name: url
        description: The URL of the page viewed.
        type: string
      - name: viewed_within_7_days_of_signup
        description: Indicates if the page view occurred within 7 days of the user signing up.
        type: boolean
    measures:
      - name: view_duration_seconds
        description: How long the user spent on the page in seconds.
        type: integer
        agg: sum # Example specifying aggregation type
    metrics:
      - name: total_page_views
        expr: "COUNT(view_id)"
        description: The total count of page views.
      - name: average_view_duration
        expr: "AVG(view_duration_seconds)" # Assumes view_duration_seconds is a measure or dimension
        description: The average duration of a page view in seconds.
    filters:
      - name: views_last_7_days
        expr: "view_timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)"
        description: Page views that occurred in the last 7 days.
    entities: # Define relationships to other models
      - name: users # Corresponds to a 'users' model
        primary_key: user_id # Field in the 'users' model
        foreign_key: user_id # Field in this 'page_views' model
        cardinality: many-to-one
        description: The user who performed the page view.
      - name: products # Corresponds to a 'products' model
        primary_key: product_id # Field in the 'products' model
        foreign_key: product_id # Field in this 'page_views' model
        cardinality: many-to-one # Assumes a page view relates to one product max
        description: The product associated with the page view, if any.

Next Steps

After defining your model structure, you’ll need to add the specific components:

  1. Add dimensions and measures
  2. Create metrics
  3. Define filters
  4. Establish relationships (entities)