Semantic models provide a flexible analytical layer on top of your data warehouse, aiming to create a digital twin of your business processes and concepts. They allow you to define business logic, metrics, and relationships in a centralized, governed place. This practice, often referred to as “shifting left”, involves moving analytical logic out of downstream BI tools or ad-hoc query generation (like that performed by AI agents) and into this upstream modeling layer. By doing so, you ensure logic is versioned, testable, and consistently applied across all uses.
Instead of embedding complex logic in individual queries or reports, or worse, allowing an AI to make assumptions about raw data (where infinite interpretations exist), you define key business concepts like “total revenue” or “active user” once in the model. A key benefit is that if a metric’s definition changes, you update it in one place (the model), and that change automatically propagates to all analyses and tools using the model. This ensures consistency and reliability, dramatically reducing the risk of inconsistent answers or technical debt. For AI agents like Buster, this pre-defined logic is crucial; it allows the AI to leverage your curated business understanding, delivering trustworthy results instead of potentially flawed interpretations based on ambiguous raw data.
Models represent business entities or concepts, typically mapping to a table or view in your database. Each model is defined with a name and can optionally include descriptions, dimensions, measures, metrics, filters, and entities (relationships).
name: orders # Required - model identifierdescription: "..." # Required - description# At least one of dimensions, measures, or metrics are required.dimensions: # Optional - Dimension definitions - name: order_id type: string ...measures: # Optional - Measure definitions - name: order_amount type: number agg: sum ...metrics: # Optional - Metric definitions - name: average_order_value expr: "..." ...filters: # Optional - Filter definitions - name: last_30_days expr: "..." ...relationships: # Optional - Relationship definitions - name: customer source_col: customer_id ref_col: customer_id ...
name: A unique identifier for the model, typically matching the table name in your database.
description: A clear explanation of what the model represents. This field is mandatory to ensure clarity and context for all users and downstream tools like AI agents.
At least one of dimensions, measures, or metrics are required.
Technically, the attributes below are optional. You can create a model with just the name and description fields, or include any combination of dimensions, measures, metrics, filters, and relationships as needed.
Characteristics: Primarily consist of descriptive dimensions (customer name, product category, employee department, store location, invoice status). May also include some measures directly related to the entity (e.g., credit_limit for a customer) or summary/rollup metrics (e.g., total_invoiced_amount on an invoice model).
Filters: Often include filters based on the entity’s attributes (e.g., is_premium_customer, product_is_discontinued).
# Example: Entity-Focused Model (Based on Quickstart users.yml)# models/users.ymlname: usersdescription: > Contains comprehensive information about registered users, including their unique identifiers, contact details, registration timestamp, geographical location, and lifetime purchase value. This model serves as a central hub for understanding user demographics and engagement.dimensions: - name: id # Note: Quickstart SQL uses user_id, but YML uses id. Relationship refers to 'id' as source_col. description: > A unique alphanumeric string identifying each user. Essential for joining with other user-related data. type: string primary_key: true # Assuming this is the primary key - name: email description: The primary email address associated with the user account, used for communication and login. type: string - name: user_created_at description: > The precise date and time (UTC) when the user first registered their account. Useful for cohort analysis and understanding user tenure. Indexed for efficient searching. type: timestamp searchable: true - name: country description: > The country where the user is located, typically derived from their registration information or IP address. Stored as a two-letter ISO 3166-1 alpha-2 country code (e.g., 'US', 'GB', 'CA'). type: stringmetrics: # Note: Quickstart SQL defines lifetime_value as a column. YML defines it as a metric. Following YML structure. - name: lifetime_value description: > The cumulative total amount spent by the user across all purchases, measured in USD. Represents the total monetary value brought in by the user. type: number # Aggregation (e.g., sum) might be needed depending on source data. - name: average_order_value_over_50 description: The average value of orders placed by the user where the order total exceeds $50. expr: "AVG(CASE WHEN orders.order_total > 50 THEN orders.order_total ELSE NULL END)" # Assumes relationship 'orders' existsrelationships: - name: orders source_col: id # Corresponds to the 'id' dimension in this 'users' model ref_col: user_id # Corresponds to the 'user_id' dimension in the related 'orders' model cardinality: one-to-many # One user can have multiple orders description: > Links this user to their corresponding order events recorded in the 'orders' model. Allows analysis of user purchasing behavior.# Example: Entity-Focused Model (customers.yml)# models/customers.ymlname: customersdescription: > Represents individual customer entities. Contains contact information, registration details, location, and derived metrics like lifetime value. This is a central model for understanding customer attributes and behavior segments. Essential for joining customer data with transactions, support interactions, etc.dimensions: - name: id description: > The unique identifier for a customer, typically corresponds to the primary key in the source customer table. Essential for joining with event models like orders. type: string - name: name description: The full name of the customer. Used for display and personalization. type: string - name: email description: > The customer's primary email address. Used for communication, login identification, and potentially linking across different systems. Assumed to be unique per customer. type: string - name: signup_year description: > The year the customer first registered or created their account. Useful for cohort analysis based on tenure (e.g., comparing behavior of 2022 vs 2023 signups). type: integer - name: country description: > The country where the customer is located, typically derived from registration or geo-IP lookup. Often stored as ISO 3166-1 alpha-2 code (e.g., 'US', 'GB'). Used for geographical segmentation and analysis. type: string - name: user_created_at description: > The precise date and time (UTC recommended) when the customer account was created. Provides granular detail for time-based cohorting or analyzing initial user activity. type: timestamp searchable: truemetrics: - name: lifetime_value description: > The total cumulative revenue generated from this customer across all their orders. Represents the historical monetary value of the customer. Calculation assumes an underlying aggregation or pre-calculation in the source data, or relies on a join with the 'orders' model. expr: "SUM(orders.amount)" # Needs relationship 'orders' - name: average_order_value_over_50 description: > The average value of orders placed by this customer where the individual order amount exceeded $50. Requires joining with the 'orders' model via the 'orders' relationship. expr: "AVG(CASE WHEN orders.amount > 50 THEN orders.amount ELSE NULL END)" # Needs relationship 'orders'relationships: - name: orders # Link back TO orders model source_col: id # Key in customers (current model) ref_col: customer_id # Key in orders (related model) description: > Links this customer to all their associated order records in the 'orders' model. Enables analysis of purchasing history and behavior for each customer. cardinality: one-to-many # One customer can have many orders type: LEFT
Characteristics: Almost always include a timestamp dimension. They contain dimensions describing the event (e.g., URL for page view, order status, transaction type, user agent for login). They typically contain measures quantifying the event (e.g., order_amount, view_duration_seconds, transaction_value). Summary metrics related to the event itself might be included.
Relationships: Event models usually have relationships (relationships) linking back to multiple Entity-Focused models (e.g., an orders event links to customers, products, and employees).
Filters: Common filters relate to time (last_7_days, this_quarter) or event attributes (successful_logins, high_priority_tickets).
# Example: Event-Focused Model (Based on Quickstart orders.sql)# models/orders.ymlname: ordersdescription: > Represents individual customer order transactions. Contains details about each order, including identifier, customer linkage, timing, status, and financial value. This is an event-based model crucial for sales analysis, revenue tracking, and understanding purchase patterns.dimensions: - name: order_id description: > The unique identifier for each order transaction. Serves as the primary key for this model. Used for looking up specific orders or counting unique orders. type: string - name: customer_id # Changed from user_id for consistency description: > Foreign key referencing the 'id' in the 'customers' model. Links the order back to the specific customer who placed it. Essential for customer-centric analysis. type: string - name: order_date # Changed from order_timestamp description: > The date the order was placed. Used for time-series analysis, filtering by date ranges (e.g., daily, weekly, monthly revenue), and cohort analysis based on order time. type: date searchable: true - name: status description: > The current fulfillment or payment status of the order (e.g., 'pending', 'shipped', 'delivered', 'cancelled'). Used for operational reporting and filtering orders by state. type: string options: ["pending", "shipped", "delivered", "cancelled"] - name: ordered_within_30_days_of_signup # From quickstart logic description: > A boolean flag indicating if this order was placed within 30 days of the associated customer's signup date (`customers.user_created_at`). Useful for analyzing early customer purchasing behavior. Requires join to 'customers' model to calculate. type: booleanmeasures: - name: amount # Changed from order_total description: > The total monetary value of the order, typically in a specific currency (e.g., USD). This is the primary value used for revenue calculations (SUM, AVG). type: number - name: item_count # Added measure description: The total number of distinct items included in the order. Used for analyzing order size. type: integerfilters: - name: orders_last_30_days description: Selects orders placed within the last 30 calendar days. expr: "orders.order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)" - name: first_month_orders description: > Selects orders placed within 30 days of the associated customer's signup date. Relies on the pre-calculated 'ordered_within_30_days_of_signup' dimension. expr: "orders.ordered_within_30_days_of_signup"relationships: - name: customer # Renamed from 'users' source_col: customer_id # Key in orders (current) ref_col: id # Key in customers (related) description: > Links this order back to the specific customer who placed it, using the 'customers' model. Enables retrieving customer details (e.g., name, email, country) for order analysis. cardinality: many-to-one type: LEFT # - name: products # Example linking to a products model # source_col: product_id # Assuming a product_id field in orders # ref_col: id # Assuming id is PK in products # description: Links the order line to the specific product. # cardinality: many-to-one # type: LEFT
These models center around a core business metric or Key Performance Indicator (KPI). They often correspond to fact tables in a star or snowflake schema and serve as the central point for analyzing a specific metric across various business dimensions.
Characteristics: Primarily contain measures that contribute to the core metric(s). The dimensions are typically foreign keys (e.g., date_id, customer_id, product_id, region_id) that link to separate Entity-Focused models. They might also include degenerate dimensions (attributes directly on the fact record).
Metrics: The main purpose is often to define one or more significant metrics (e.g., total_revenue, active_users, conversion_rate) calculated from the base measures.
Relationships: Heavily rely on entities to connect the foreign key dimensions back to their respective dimension/entity models.
Filters: Often include time-based filters (last_quarter, year_to_date) or filters related to the metric calculation itself.
# Example: Metric-Focused Model (Monthly Revenue Fact)name: monthly_revenuedescription: Central fact model for analyzing monthly revenue across different business dimensions.dimensions: # Foreign Keys linking to dimension/entity models - name: month_start_date # Links to a date dimension model type: date - name: region_id # Links to a regions model type: string - name: customer_segment_id # Links to a customer_segments model type: string - name: sales_rep_id # Links to a sales_reps model type: string # You might also have direct measures if the metric uses themmeasures: - name: revenue_amount description: The base revenue amount for a given record. type: number agg: summetrics: - name: total_revenue description: The total calculated revenue for the selected dimensions. expr: "SUM(revenue_amount)" # Metric derived from the measurefilters: - name: revenue_last_quarter description: Revenue generated in the previous full quarter. # Example expr assumes a date dimension model provides quarter info expr: "month_start_date >= DATE_TRUNC('QUARTER', DATE_SUB(CURRENT_DATE(), INTERVAL 3 MONTH)) AND month_start_date < DATE_TRUNC('QUARTER', CURRENT_DATE())"relationships: # Define relationships to the dimension/entity models - name: date_dimension source_col: date_id # Assuming date_id in the date dimension model ref_col: month_start_date # Field in this model # ... other relationship details - name: regions source_col: region_id # Key in regions model ref_col: region_id # Key in this model # ... other relationship details - name: customer_segments source_col: segment_id # Key in segments model ref_col: customer_segment_id # Key in this model # ... other relationship details - name: sales_reps source_col: rep_id # Key in reps model ref_col: sales_rep_id # Key in this model # ... other relationship details
Best Practices & Combining Approaches:
Start with Entities and Events: Most semantic layers are built primarily around Entity-Focused and Event-Focused models, as these directly represent the core nouns and verbs of your business.
Use Relationships: Define relationships clearly to connect your models. This is how Buster understands how events relate to entities (e.g., which customer placed which order).
Mix and Match: Real-world data often requires a combination. An orders model (Event) will link to customers and products (Entities).
Keep Models Focused: Aim for each model to represent a single, well-defined concept (one entity or one type of event).
Metric-Focused for Performance: Consider Metric-Focused models if you have very large datasets and need pre-aggregation for performance, or if a specific, central metric links many otherwise disparate entities.
Your model definitions must be structured with each model residing in its own separate YAML file. Buster will discover and load models from all .yml files within the specified model paths in your buster.yml.
This approach, organizing models into separate files, often grouped into subdirectories based on business domain (e.g., models/finance/, models/marketing/, models/product/), is essential for improving discoverability and maintainability as your semantic layer grows.
Here’s a more detailed example of a page_views model, demonstrating various optional components:
# models/web_analytics/page_views.ymlname: page_viewsdescription: > Captures each instance of a user viewing a page on the website. This event-based model is fundamental for web analytics, tracking user engagement, content popularity, and navigation paths. Each record represents a single page load event.dimensions: - name: view_id description: > A unique identifier assigned to each individual page view event. Serves as the primary key for this model and is essential for counting distinct views. type: string - name: view_timestamp description: > The precise date and time (UTC recommended) when the page view occurred. Critical for time-series analysis, sessionization, and understanding user activity patterns over time. type: timestamp searchable: true # Optimizes queries filtering or sorting by view time. - name: url description: > The full URL of the web page that was viewed. Used to analyze traffic to specific pages, content performance, and site navigation. type: string - name: viewed_within_7_days_of_signup description: > A boolean flag indicating if this page view occurred within the first 7 days of the associated user's registration. Helpful for analyzing early user engagement with the website. Typically requires a join to a 'users' model to determine signup date. type: boolean measures: - name: view_duration_seconds description: > The duration, in seconds, that the user spent on the page during this specific view. Note: Accuracy may depend on tracking method (e.g., time between pings or navigation events). Used for engagement analysis (e.g., SUM for total time on site, AVG for typical page dwell time). type: integer # agg: sum # Aggregation is defined by metrics, not directly on measures in spec metrics: - name: total_page_views expr: "COUNT(page_views.view_id)" # Added model prefix for clarity description: > The total number of page views recorded in the queried context. Represents the overall volume of page impressions. - name: average_view_duration expr: "AVG(page_views.view_duration_seconds)" description: > The average time, in seconds, users spent on a page per view, calculated across all page views in the queried context. Indicates typical engagement duration for a page. filters: - name: views_last_7_days expr: "page_views.view_timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)" description: Selects page view events that occurred within the last 7 calendar days. relationships: - name: users # Corresponds to a 'users' or 'customers' model source_col: user_id # Assumes a user_id FK in page_views ref_col: id # Assumes 'id' is the PK in the related users/customers model cardinality: many-to-one # Many page views can belong to one user description: > Links this page view event to the specific user who performed the action. Enables user-centric analysis of web activity. type: LEFT - name: products # Example: Corresponds to a 'products' model if pages are product-specific source_col: product_id # Assumes a product_id FK in page_views (if applicable) ref_col: id # Assumes 'id' is the PK in the related products model cardinality: many-to-one # A page view might relate to one product (e.g. product detail page) description: > Links this page view event to a specific product, if the viewed page is associated with one (e.g., a product detail page). Enables analysis of product page engagement. type: LEFT