Overview

Buster is a platform for building autonomous AI agents that automate data engineering workflows. Agents can execute complex end-to-end tasks (like documentation updates, code reviews, schema migrations, data quality monitoring) without human intervention. Agents are defined as YAML configuration files in your repository and run automatically in response to triggers (pull requests, schedules, or custom events).

Why Buster?

Data engineering involves countless repetitive tasks that require context about your data, code, and infrastructure:

Documentation Drift — Models change, but documentation falls behind. Manual updates are tedious and error-prone.
Code Review Overhead — PRs need checks for performance issues, breaking changes, naming conventions, test coverage.
Schema Changes — Upstream schema changes require updating staging models, propagating to marts, updating tests.
Data Quality Monitoring — Regular checks for freshness, null rates, anomalies, referential integrity.

These tasks require understanding both code and data, making them perfect for AI agents with access to your warehouse and repository.

How it works

Auto-document your dbt project

When you connect your dbt repository and data warehouse, Buster automatically deploys dozens of agents in parallel to index and document your project. Agents pull metadata, run profiling queries on your models, discover patterns, and generate comprehensive documentation for every model and column. This automated documentation becomes the foundation that powers all of your agents, giving them an exceptionally deep understanding of your data models, business logic, and data patterns.

Example of auto-generate documentation file:

version: 2

models:
  - name: sales_order_detail
    description: |
      Individual line items representing products sold within each sales order.
      
      Purpose: Line-item transaction table enabling revenue analysis, product performance tracking, discount effectiveness measurement, and basket composition analysis. Foundation for calculating revenue metrics, product-level profitability, and customer purchasing patterns. Used extensively by metrics models for calculating CLV, average order value, gross profit, and product-specific KPIs.
      
      Contents: One row per product line item on a sales order. Composite key: (salesOrderID, salesOrderDetailID). Scale: ~121K line items across ~31K orders spanning Sept 2022 to July 2025 (date-shifted to align with current date).
      
      Lineage: Direct pass-through from stg_sales_order_detail, which sources from sales.salesorderdetail. Staging layer calculates lineTotal field and applies date shifting to modifiedDate.
      
      Patterns:
      - Order simplicity: Most orders contain few items (avg 3.9 items per order). Single-item orders are extremely common, representing the dominant purchasing pattern.
      - Quantity concentration: 58% of line items are quantity 1, 71% are quantity 1-2. Bulk purchases (qty >10) represent <3% but can reach qty 44.
      - Product concentration: Top 10 products (out of 259) account for 20% of line items. Product 870 alone appears in 3.7% of all line items.
      - Discount sparsity: 97% of line items have no discount (unitPriceDiscount = 0). When discounts apply, they're typically 2%, 5%, 10%, 15%, or 20%.
      - Special offer dominance: 95% use specialOfferID = 1 (likely "No Discount" baseline offer), making non-promotional sales the norm.
      - Carrier tracking: 45% of line items have null carrierTrackingNumber, suggesting orders not yet shipped or using ship methods without tracking.
      - Price distribution: Highly skewed - median unit price $54.94, but ranges from $1.37 to $3578.27. High-value items (>$2000) appear in ~6% of line items.
      - Line total pattern: Log-normal distribution with median $183.94, mean $989.34. Most line items are modest value, but tails extend to $22K+ for high-quantity luxury purchases.
      
      Usage Guidance:
      Foundational fact table for sales analytics. Essential for calculating revenue totals, analyzing product performance, measuring discount impact, and understanding purchasing behavior. Most revenue metrics aggregate lineTotal; product analysis groups by productID; discount analysis filters or segments by unitPriceDiscount or specialOfferID. For customer behavior analysis, aggregate to order level first via salesOrderID to avoid over-counting multi-item orders. For product profitability, join to product table for cost data then calculate margin (lineTotal - cost). When analyzing average order value, aggregate line items by order first to get order-level totals.
      
      Critical Context:
      - lineTotal is calculated in staging as (unitPrice * orderQty * (1 - unitPriceDiscount)) and represents net revenue after discounts but before taxes/freight. This is the primary revenue metric field.
      - All dates shifted forward using shift_date() macro to make dataset feel current (max date aligns with March 28, 2025). Historical patterns span ~3 years.
      - Null carrierTrackingNumber doesn't indicate data quality issue - reflects legitimate business states (orders not shipped yet, certain ship methods, or in-store pickup).
      - salesOrderDetailID is unique within entire table (not just within order) - serves as primary key alone, though conceptually represents line item number within order.
      - unitPrice reflects actual selling price at time of sale (may differ from product.listPrice due to negotiated pricing, promotions, or price changes over time).
      - High orderQty outliers (>20) typically involve accessories or components sold in bulk, not bikes.
      - No line items exist without corresponding order in sales_order_header - referential integrity is clean.

    relationships:
      - name: sales_order_header
        description: >
          Business relationship: Every line item belongs to exactly one sales order. Order header provides order-level context (customer, dates, shipping, totals, status) that applies to all line items within that order. Join to get customer attribution, order timing, territory assignment, shipping details, and order-level calculated fields (purchase context filters, consultation level, etc.).
          Join considerations: Many-to-one from detail to header. Each salesOrderID in details appears in header exactly once. Each order in header typically has multiple detail rows (avg 3.9 line items per order, but distribution is right-skewed with many single-item orders).
          Coverage: 100% of line items match to header. Clean referential integrity - no orphaned details.
          Cardinality notes: Standard fact-to-dimension pattern. When joining, expect row count to remain same (detail-level grain preserved). When aggregating metrics from details, group by salesOrderID first to get order-level aggregates before further analysis to avoid over-representing multi-item orders.
        source_col: salesOrderID
        ref_col: salesOrderID
        cardinality: many-to-one
      
      - name: product
        description: >
          Business relationship: Each line item represents sale of one specific product. Product table provides descriptive attributes (name, category, subcategory, costs, specifications, attributes) needed for product performance analysis, profitability calculation, and inventory analysis.
          Join considerations: Many-to-one from detail to product. Each productID in details matches exactly one product. Popular products appear across many line items (top product appears in 3.7% of line items; long tail products may appear only once).
          Coverage: 100% of line items match to products. Only in-catalog products can be sold.
          Cardinality notes: Standard fact-to-dimension pattern. Use INNER JOIN (all details have valid productID). Join preserved detail-level grain. For product-level metrics, aggregate details by productID: SUM(lineTotal) for revenue, COUNT(*) for transaction count, SUM(orderQty) for units sold, AVG(unitPrice) for typical selling price.
        source_col: productID
        ref_col: productID
        cardinality: many-to-one

    columns:      
      - name: salesOrderDetailID
        description: |
          Primary key uniquely identifying each line item across the entire table. Represents sequential line item numbering across all orders.
          Range: 1-100827 with perfect uniqueness (~121K current rows, suggesting some IDs retired or skipped). Despite name suggesting "detail within order", this ID is globally unique, not just within an order. Conceptually represents the line item number, but implemented as table-wide identifier.
          Use as primary key for uniqueness. Combine with salesOrderID for composite business key if needed for readability, though salesOrderDetailID alone guarantees uniqueness. No nulls, no duplicates. Sequential but not gapless - skipped numbers are normal (order cancellations, returns, or system behavior).
        data_type: BIGINT
      
      - name: carrierTrackingNumber
        description: |
          Shipment tracking identifier assigned by carrier for this line item. Enables shipment tracking and delivery confirmation.
          Format: Standardized XX00-X000-XX pattern (e.g., "52CC-460F-B3"). ~2K distinct tracking numbers across line items. 45% null, indicating orders not yet shipped, ship methods without tracking, in-store pickup, or bundled shipments where tracking applies at order level rather than line level.
          Use null vs non-null to segment shipped vs unshipped items. However, interpret carefully - null doesn't definitively mean "not shipped" as some fulfillment methods legitimately don't generate line-level tracking. For true shipment analysis, prefer sales_order_header.shipDate which provides order-level shipping status. Tracking numbers aren't strictly unique - multiple line items from same order may share tracking when shipped together.
        data_type: VARCHAR
      
      - name: orderQty
        description: |
          Quantity of product units ordered on this line item. Drives revenue via multiplication with unitPrice.
          Highly concentrated at low quantities: 58% are qty 1, 71% are qty 1-2, 89% are qty 1-4. Long tail extends to qty 44, but quantities above 10 represent <3% of line items. Mean: 2.4, median: 1, heavily right-skewed distribution.
          Represents customer purchase decision quantity. Multiply by unitPrice to calculate pre-discount subtotal, or use calculated lineTotal field which accounts for both quantity and discount. High quantities (>20) almost always involve accessories, components, or clothing - not bikes (bikes rarely exceed qty 5). For inventory and fulfillment analysis, this is units to pick/ship. For revenue analysis, this amplifies unit economics (price and discount) into total line value.
          Watch out for: Outlier quantities can distort averages. When analyzing "typical" purchase behavior, consider median or restrict to qty ≤10. For total units sold analysis, SUM(orderQty) grouped appropriately. No nulls - quantity always specified.
        data_type: BIGINT
        options:
          - value: 1
            description: Single unit - dominant pattern; ~58% of line items
          - value: 2
            description: Two units; ~13% of line items
          - value: 3
            description: Three units; ~9% of line items
          - value: 4
            description: Four units; ~7% of line items
          - value: 5
            description: Five units; ~4% of line items

Define agents as code

Agents are YAML files committed to your repository. Unlike traditional scripts where you write procedural code, agents are declarative—you describe what you want accomplished in natural language. Here’s a complete agent:

name: docs-updater
description: Automatically updates dbt model documentation when models change

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: standard

prompt: |
  When dbt models change in this PR:
  1. Profile each changed model using the warehouse
  2. Update the YAML documentation with descriptions and stats
  3. Validate the YAML with dbt parse
  4. Commit the changes to this PR branch

Key Components:

Triggers — When the agent runs: pull_request, scheduled, event, or manual
Tools — What the agent can do: safe (read-only), standard (read/write), or unrestricted (full access including bash)
Prompt — Your instructions in natural language describing what to accomplish

When a developer opens a PR that changes a model file, this agent automatically runs, profiles the model, updates documentation, and commits the changes—no manual intervention required.

See the Creating Agents guide for complete configuration options including granular control over permissions, file access patterns, specialized triggers, and more.

Agents execute in secure sandboxes

When triggered, Buster:

Clones your repository into an isolated container
Connects to your data warehouse using configured credentials
Loads project context including all auto-generated documentation and code
Reads the agent configuration and constructs the execution context
Executes the agent with access only to permitted tools and resources
Takes actions like creating PRs, running SQL, or sending notifications
Logs everything for audit trails and debugging

Security First: Agents run in isolated sandboxes with precisely controlled permissions. You define which files they can access, what SQL they can run, and what actions they can take. All operations are logged for compliance and audit requirements.

Agents do what you tell them

Agents execute whatever workflows you define in your prompts—simple or complex, rigid or adaptive. With prompts, you have complete flexibility to design agents for virtually any data engineering task. Simple directive:

prompt: |
  When a model changes, update its documentation and commit to the PR.

Multi-step conditional workflow:

prompt: |
  Check if the null rate in orders.revenue exceeds 5%.
  
  If it does:
    - Determine if this is a new issue or ongoing problem
    - Query recent orders to find the pattern
    - Create a GitHub issue with investigation queries
    - Send critical alert to #data-incidents
    - Page the on-call engineer if affecting production
  
  If null rate is 2-5%:
    - Send warning to #data-pipeline
    - Create low-priority issue
  
  If null rate < 2%:
    - No action needed (within normal range)

Complex cross-system orchestration:

prompt: |
  When upstream schema changes are detected:
  
  1. Identify all downstream models affected by the change
  2. For each affected model:
     - Analyze the impact (breaking vs non-breaking)
     - Update the staging model to handle the new schema
     - Propagate changes to downstream marts
     - Update tests and documentation
  3. Run dbt build on affected models to validate changes
  4. If successful:
     - Create a PR with all changes
     - Comment with impact summary and test results
     - Tag appropriate team members for review
  5. If failures occur:
     - Create detailed issue with error analysis
     - Send alert to #data-engineering with remediation steps
     - Schedule follow-up check in 1 hour

You control the complexity, the decision logic, the error handling, and the actions taken. Agents adapt to your specific context and instruction.

Introduction

Guides

Examples

Resources

Why Buster?

How it works

Auto-document your dbt project

Define agents as code

Agents execute in secure sandboxes

Agents do what you tell them

Introduction

Guides

Examples

Resources

​Why Buster?

​How it works

​Auto-document your dbt project

​Define agents as code

​Agents execute in secure sandboxes

​Agents do what you tell them

Why Buster?

How it works

Auto-document your dbt project

Define agents as code

Agents execute in secure sandboxes

Agents do what you tell them