Buster AI Agent Guidelines

This document provides guidance for AI agents (Claude, Windsurf, Cursor, etc.) when working with Buster configuration files and semantic models.

Note for AI Agents: All Buster documentation is available via HTTP requests. Simply:

What is Buster?

Buster is a platform for building and deploying AI data analysts that connects to various data sources and uses a semantic layer to understand your data structure and business logic. The platform enables natural language querying of your data warehouse with accurate, relevant responses.

Core Components

buster.yml

This is the primary configuration file that defines your Buster project settings.

Basic Structure

data_source_name: demo_db
schema: analytics  
database: buster

model_paths:
  - models/
  
semantic_model_paths:  # Optional: Custom paths for semantic models
  - semantic_models/

Multi-Project Structure

Buster.yml now supports configuring multiple projects in a single file:

projects:
  - path: .
    data_source_name: new_one
    schema: ont-ont
    database: postgres
    model_paths:
      - models/
    semantic_model_paths:  # Optional: Custom paths for semantic models
      - semantic_models/
  
  - path: ./another-project
    data_source_name: another_db
    schema: analytics
    database: warehouse
    model_paths:
      - models/
    semantic_model_paths:
      - semantic_models/

Semantic Models

Semantic models define a business-friendly abstraction layer over your data warehouse. They are organized in YAML files and consist of these key components:

Model Structure

models:
  - name: orders              # Required: Unique identifier (snake_case)
    description: "Order transactions from our e-commerce platform"  # Required: Clear explanation
    dimensions:               # Optional: Attributes for slicing/filtering
      - name: order_id
        description: "Unique identifier for each order"
        type: string
        primary_key: true
      - name: created_at
        description: "When the order was created"
        type: timestamp
    measures:                 # Optional: Quantifiable aggregations
      - name: revenue
        description: "Total order value"
        type: decimal
        agg: sum
    metrics:                  # Optional: Derived KPIs/calculations
      - name: average_order_value
        description: "Average $ amount per order"
        expr: "sum(revenue) / count(order_id)"
      - name: revenue_by_period
        description: "Revenue over a specified time period"
        expr: "sum(case when created_at >= dateadd(day, -{{days}}, current_date()) then revenue else 0 end)"
        args:
          - name: days
            type: integer
            description: "Number of days to include"
            default: 30
    filters:                  # Optional: Reusable query conditions
      - name: is_complete
        description: "Filter for completed orders"
        expr: "status = 'complete'"  # The exact value depends on your data
      - name: orders_above_amount
        description: "Orders with revenue above threshold"
        expr: "revenue > {{min_amount}}"
        args:
          - name: min_amount
            type: number
            description: "Minimum order amount"
            default: 100
    relationships:            # Optional: Relationships to other models (previously called entities)
      - name: customer
        description: "The customer who placed this order"
        primary_key: id           # Field in the related customer model
        foreign_key: customer_id  # Field in this orders model
        cardinality: many_to_one  # Many orders to one customer
        type: LEFT                # LEFT JOIN is the default

Key Concepts for Effective Models

Models

  • Purpose: Define meaningful business concepts or entities
  • Types:
    • Entity-focused: Representing business objects (customers, products)
    • Event-focused: Representing occurrences (orders, page views)
    • Metric-focused: Centered around core business KPIs
  • Best Practices:
    • Use snake_case for naming
    • Provide clear, concise descriptions
    • Organize by business domain in separate files
    • Name files after the main model they contain

Data Model Components Hierarchy

Buster’s semantic layer organizes data into a logical hierarchy:

Dimensions

  • Definition: The lowest level properties about a data object (non-numeric attributes)
  • Purpose: Attributes used for filtering, grouping, or display
  • Examples: Names, dates, IDs, categories, locations, statuses
  • Common Types:
    • string: Text values (e.g., customer_name, product_category)
    • timestamp: Date and time values (e.g., created_at, order_date)
    • boolean: True/false values (e.g., is_active, is_completed)
  • Properties:
    • name: Identifier (required) - Should match the exact column name in your database
    • description: Explanation (required)
    • type: Data type (recommended)
    • primary_key: Boolean designating unique identifier (optional)
    • searchable: Boolean for natural language search (optional)
    • expr: SQL expression (optional) - If not provided, the dimension name is used as the column identifier
  • Best Practices:
    • Use column names as dimension names for direct mapping to your database
    • Include time-based dimensions for temporal analysis
    • Mark primary keys for joining and uniqueness
    • Mark searchable fields for important attributes
    • Use expr only when transformation is needed from raw data

Measures

  • Definition: The lowest level numeric properties about a data object
  • Purpose: Raw quantitative values that can be aggregated
  • Examples: Amounts, counts, durations, prices, quantities
  • Common Types:
    • decimal: Precise numeric values (e.g., price, revenue)
    • integer: Whole numbers (e.g., quantity, views)
  • Aggregation Types:
    • sum: Add values together
    • avg: Calculate average
    • count: Count occurrences
    • min: Find minimum value
    • max: Find maximum value
    • count_distinct: Count unique values
  • Properties:
    • name: Identifier (required) - Should match the exact column name in your database
    • description: Explanation (required)
    • type: Data type (recommended)
    • agg: Aggregation type (required)
    • expr: SQL expression (optional) - If not provided, the measure name is used as the column identifier
  • Best Practices:
    • Use column names as measure names for direct mapping to your database
    • Choose appropriate aggregation for business meaning
    • Use clear, specific names describing the quantity
    • Include unit of measurement in description when applicable
    • Use expr only when transformation is needed from raw data

Metrics

  • Definition: High-level abstractions or calculations derived from dimensions and measures
  • Purpose: Complex business KPIs and calculations
  • Examples: Average order value, conversion rate, retention rate, year-over-year growth
  • Properties:
    • name: Identifier (required) - Descriptive name for the calculation
    • description: Explanation (required)
    • expr: SQL expression (optional) - Must always return a numeric value
      • If not provided, uses the column with the same name as the metric
    • args: Parameters for dynamic metrics (optional)
  • Syntax:
    • Use standard SQL expressions
    • Reference fields from the same model directly by name
    • Reference fields from other models using dot notation
  • Best Practices:
    • Centralize business logic in metrics
    • Use descriptive names indicating calculation
    • If referencing a single column directly, name the metric after the column
    • Document formula and business significance
    • Parameterize metrics when appropriate
    • Ensure all metrics return numeric values

Filters

  • Definition: High-level abstractions that provide reusable filtering conditions
  • Purpose: Pre-defined conditions that must resolve to a boolean
  • Examples: is_active_customer, recent_orders, high_value_transaction
  • Properties:
    • name: Identifier (required) - Descriptive name for the filter
    • description: Explanation (required)
    • expr: SQL expression (optional) - Must always resolve to TRUE or FALSE
      • If not provided, uses the boolean column with the same name as the filter
  • Syntax:
    • Use standard SQL boolean expressions
    • Can reference dimensions and measures from the model directly by name
    • Common patterns include comparison operators (equals, greater than, less than, IN, BETWEEN)
  • Best Practices:
    • Define common filtering patterns without assuming specific values in the data
    • If referencing a single boolean column directly, name the filter after the column
    • Use for time-based filters (e.g., recent_period, current_year)
    • Create concept-based filters that describe what the filter does, not specific values
    • Ensure expressions always evaluate to boolean values
    • Avoid assuming specific values exist in a column; consult actual data first

Relationships

  • Definition: Connections between different data models
  • Purpose: Define how models relate to each other
  • Examples: users-to-orders, products-to-reviews
  • Properties:
    • name: Relationship name (required)
    • primary_key: Unique identifier field (required)
    • foreign_key: Reference field in format model.field (required)
    • cardinality: Type of relationship (recommended)
    • join_type: SQL join type (optional, defaults to LEFT)
  • Cardinality Types:
    • one_to_one: Each record connects to exactly one record
    • one_to_many: Parent record connects to multiple child records
    • many_to_one: Multiple records connect to one parent record
    • many_to_many: Many-to-many relationship
  • Best Practices:
    • Only define relationships when there’s evidence they exist in the data
    • Use descriptive relationship names that reflect business concepts
    • Document the business meaning of relationships
    • Define relationships bidirectionally when appropriate
    • Verify column data types match between related fields

Tips for AI Agents

When generating or modifying Buster configuration files:

  1. Start with Basic Structure:

    • Define required fields first (data_source_name, schema, database)
    • Then add optional configurations as needed
  2. Ensure Semantic Clarity:

    • Use descriptive names that reflect business concepts
    • Include clear descriptions for all components
    • Maintain consistent naming conventions (snake_case)
  3. Model Organization:

    • Organize models by business domain
    • Keep related models in the same file
    • Split large model collections into multiple files
  4. Relationship Mapping:

    • Ensure primary and foreign keys are correctly specified
    • Define appropriate cardinality for each relationship
    • Consider join type implications for queries
  5. Metric Definition:

    • Use SQL expressions that match the target database dialect
    • Ensure calculations are mathematically sound
    • Reference fields using correct notation
  6. Field Types:

    • Specify appropriate data types for dimensions and measures
    • Choose correct aggregation types for measures
    • Mark primary keys and searchable fields appropriately
  7. Validation Checks:

    • Ensure all required fields are present
    • Verify that references between models are valid
    • Check that expressions use correct field names and syntax
  8. Avoid Data Assumptions:

    • Never make assumptions about specific values in columns
    • Use generic pattern descriptions rather than assuming particular statuses, categories, or values
    • Consult the actual data or schema documentation before creating filters or metrics
    • Create semantic models that adapt to the user’s actual data values, not presumed values
    • Don’t assume relationships between models unless clearly evident in the schema
  9. Relationship Discovery Guidelines:

    • Only create relationships/joins when you have evidence from column names or schema analysis
    • Acceptable approaches include:
      • Finding matching column patterns (e.g., id in one table, user_id in another)
      • Analyzing foreign key constraints from database metadata
      • Following explicit instructions or documentation about table relationships
    • When analyzing multiple models at once, look for conventional patterns like:
      • Primary key matches foreign key name (e.g., users.idorders.user_id)
      • Table name incorporated in foreign key (e.g., user_id referencing the users table)
    • Always verify relationship assumptions with the user when possible
  10. Use dbt Metadata Commands:

  • Leverage dbt’s metadata commands to gather accurate schema information without affecting production:
    • dbt ls - List all models in the project
    • dbt docs generate - Generate documentation (then examine metadata files)
    • dbt parse - Parse project files without running models
    • dbt compile - Compile SQL without executing it
    • dbt describe - View upstream/downstream dependencies of a model
  • IMPORTANT: Never run commands that execute models (dbt run, dbt build, etc.)
  • Extract valuable information from dbt metadata such as:
    • Column names, types, and descriptions
    • Model dependencies and lineage
    • Table properties and configurations
    • Existing documentation and tests
  • Use this metadata to make informed decisions when creating semantic models

Following these guidelines will help create effective Buster configurations that accurately represent business concepts and enable powerful natural language querying of data warehouses.

SQL Data Modeling with dbt

Buster works best when the underlying data models align well with the semantic layer. Using dbt as the primary modeling tool helps create a clean, modular data structure that maps effectively to Buster’s semantic models.

Entity-Based Modeling Approach

The SQL data models should mirror the entity-focused approach of the semantic layer, with tables structured to represent business entities and their relationships.

dbt Model Organization

  • Sources: Raw data from warehouse (src_*)
  • Staging: Cleaned, typed data (stg_*)
  • Intermediate: Business logic processing (int_*)
  • Entity Models: Core business entities (e.g., users, products, orders)
  • Derived Models: Advanced calculations and aggregations (derived_*)
models/
├── domains/                 # Business domain models
│   ├── customers/           # Customer domain
│   │   ├── users.sql        # User entity model
│   │   ├── accounts.sql     # Account entity model
│   │   └── ...
│   ├── sales/               # Sales domain
│   │   ├── orders.sql       # Order entity model
│   │   ├── products.sql     # Product entity model
│   │   └── ...
│   └── ...
├── derived/                 # Derived metrics and aggregations
│   ├── derived_user_metrics.sql
│   ├── derived_sales_performance.sql
│   └── ...
├── intermediate/            # Transformation steps
│   ├── int_order_items.sql
│   └── ...
└── staging/                 # Cleaned source data
    ├── stg_source1.sql
    ├── stg_source2.sql
    └── ...

Entity Model Pattern

-- users.sql
WITH source AS (
    SELECT * FROM {{ ref('stg_users') }}
),

-- Additional transformation/enrichment CTEs as needed

final AS (
    SELECT
        id,                    -- Primary key
        first_name,            -- Identity attributes
        last_name,
        email,
        segment,               -- Categorization attributes
        created_at,            -- Temporal attributes
        updated_at,
        
        -- Include both dimensional and measure fields directly on the entity
        lifetime_value,        -- Measure: calculated or aggregated values
        is_active,             -- Dimension: status flags
        last_login_date        -- Dimension: activity timestamps
        
        -- All attributes related to this entity in one model
    FROM source
    LEFT JOIN {{ ref('int_user_activity') }} USING (id)  -- Join intermediate calculations if needed
)

SELECT * FROM final

Entity Relationship Model Pattern

-- orders.sql
WITH order_source AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

order_items AS (
    SELECT * FROM {{ ref('stg_order_items') }}
),

-- Join or additional transformation CTEs

final AS (
    SELECT
        id,                    -- Primary key
        user_id,               -- Relationship to users entity
        created_at,            -- Event timestamp
        status,                -- State attributes
        
        -- Include both dimensional and measure fields
        items_count,           -- Measure
        order_total,           -- Measure
        tax_amount,            -- Measure
        shipping_amount,       -- Measure
        
        payment_method,        -- Dimension
        channel,               -- Dimension
        is_first_purchase      -- Calculated dimension
    FROM order_source
    LEFT JOIN order_items USING (id)
    -- Additional joins as needed
)

SELECT * FROM final

Derived Model Pattern

-- derived_sales_performance.sql
WITH user_orders AS (
    SELECT
        user_id,
        SUM(order_total) as total_spent,
        COUNT(id) as order_count,
        MIN(created_at) as first_order_date,
        MAX(created_at) as last_order_date
    FROM {{ ref('orders') }} -- Reference to entity model
    WHERE status = 'complete' -- Status depends on your data
    GROUP BY 1
),

user_attributes AS (
    SELECT 
        id as user_id,
        segment,
        created_at as joined_date
    FROM {{ ref('users') }} -- Reference to entity model
),

final AS (
    SELECT
        user_attributes.user_id,
        segment,
        joined_date,
        
        -- Entity fields from orders
        total_spent,
        order_count,
        first_order_date,
        last_order_date,
        
        -- Derived calculations
        total_spent / NULLIF(order_count, 0) as average_order_value,
        DATEDIFF('day', first_order_date, last_order_date) as customer_tenure_days,
        DATEDIFF('day', last_order_date, CURRENT_DATE()) as days_since_last_order,
        
        -- Complex derived metrics
        CASE 
            WHEN days_since_last_order <= 30 THEN 'active'
            WHEN days_since_last_order <= 90 THEN 'at_risk'
            ELSE 'inactive'
        END as activity_status
    FROM user_attributes
    LEFT JOIN user_orders USING (user_id)
)

SELECT * FROM final

Aligning SQL Models with Semantic Layer

When designing SQL models to work optimally with Buster’s semantic layer:

  1. Entity-Based Alignment:

    • Create entity-based models that directly mirror your semantic models
    • Include both dimensional attributes and measures on the entity model
    • Structure models around business concepts rather than database structure
  2. Field Naming Consistency:

    • Use consistent field names between SQL models and semantic models
    • Follow snake_case naming convention for all database objects
    • Include business-friendly names that accurately describe the data concepts
  3. Data Organization:

    • Group related entities by business domain
    • Keep the model structure flat and intuitive
    • Put attributes directly on the relevant entity rather than normalizing excessively
  4. Data Type Considerations:

    • Choose appropriate SQL data types that align with semantic types
    • Use TIMESTAMP types for all date-time fields
    • Consider using DECIMAL types with appropriate precision for monetary values
  5. Relationship Implementation:

    • Ensure entity models contain the necessary keys for relationships
    • Use consistent naming for relationship fields (e.g., entity_id pattern)
    • Create derived models to materialize complex relationships when needed
  6. Performance Optimization:

    • Create appropriate indexes for frequently queried fields
    • Consider materializing important entities with calculated attributes
    • Pre-compute complex derivations to improve query performance
  7. Documentation and Testing:

    • Include column descriptions in dbt model files
    • Add documentation that explains business logic for complex calculations
    • Implement dbt tests for primary keys, relationships, and business rules
  8. Utilize dbt Metadata for Better Modeling:

    • Run dbt docs generate to extract rich metadata about the existing models
    • Examine model structure, column types, and relationships from manifest.json
    • Use dbt compile to see how models are transformed without executing them
    • Analyze upstream and downstream dependencies with dbt describe
    • Access descriptions and documentation from existing dbt projects

By aligning your SQL data modeling approach with Buster’s semantic layer, you create a cohesive data ecosystem where each entity model directly maps to business concepts in the semantic layer, making it intuitive for users to query through Buster.