Skip to main content
This agent provides focused code review for dbt pull requests, specifically examining SQL quality and data modeling patterns. Unlike generic code review tools, it understands dbt conventions, data warehouse best practices, and can flag issues like inefficient joins, missing tests, or breaking changes to downstream models.

Simple Example

name: sql-reviewer
description: Review dbt PRs for SQL and data modeling issues

triggers:
  - type: pull_request
    on_changed_files: "models/**/*.sql"

tools:
  preset: safe  # Read-only

prompt: |
  Review this PR for dbt best practices:
  
  1. Check for SQL anti-patterns (SELECT *, inefficient joins)
  2. Verify models have tests and documentation
  3. Flag potential breaking changes
  4. Comment with findings and suggestions

More Robust Example

Production-ready with comprehensive SQL analysis and actionable feedback:
name: sql-reviewer
description: Comprehensive SQL and data modeling review for dbt PRs

triggers:
  - type: pull_request
    events: [opened, synchronize]
    paths:
      include: ["models/**/*.sql", "models/**/*.yml"]
    conditions:
      - type: pr_labels
        none_of: ["skip-review", "wip"]

tools:
  preset: safe  # Read-only access

restrictions:
  files:
    allow: ["models/", "tests/", "macros/"]
  sql:
    read_only: true
    allowed_schemas: ["raw", "staging", "intermediate", "marts"]

notifications:
  slack:
    channel: "#data-pr-reviews"
    on_failure: true  # Only notify if agent fails, not for every review

prompt: |
  ## 1. Analyze changed models
  
  For each SQL file modified in this PR:
  
  a) Read the SQL file content
  b) Read the corresponding YAML file (if exists)
  c) Identify the model layer (staging, intermediate, marts)
  d) If possible, query the model in the warehouse to understand current data
  
  ## 2. SQL Quality Checks
  
  Review each model for these issues:
  
  ### Anti-Patterns to Flag:
  
  **SELECT * Usage:**
  - ❌ `SELECT * FROM {{ source() }}` in staging models
  - ❌ `SELECT * FROM {{ ref() }}` in marts
  - ✅ OK in CTEs if immediately filtered/projected
  - Comment: "SELECT * can break downstream models if source columns change. List columns explicitly."
  
  **Inefficient Joins:**
  - ❌ Joining on non-indexed columns or functions: `ON LOWER(a.email) = b.email`
  - ❌ Cross joins without limits: `FROM a CROSS JOIN b`
  - ❌ Multiple LEFT JOINs that could be INNER: suggests data quality issue
  - Comment: "Join on {column} may be inefficient. Consider {suggestion}."
  
  **Expensive Operations:**
  - ❌ DISTINCT without explanation
  - ❌ Window functions on entire table without PARTITION BY
  - ❌ Subqueries in SELECT clause (correlated subqueries)
  - Comment: "This operation may be expensive. Consider {alternative approach}."
  
  **Naming Conventions:**
  - ❌ Models not following layer prefixes (stg_, int_, fct_, dim_)
  - ❌ Column names not in snake_case
  - ❌ Boolean columns not prefixed with is_/has_
  - ❌ Date columns not suffixed with _date/_at
  - Comment: "Column {name} doesn't follow naming convention. Should be {suggestion}."
  
  **CTE Structure:**
  - ❌ No CTEs in complex models (>50 lines)
  - ❌ CTEs not imported/logical in staging models (import, logical, final pattern)
  - ❌ Unused CTEs
  - Comment: "Consider restructuring with CTEs for readability."
  
  **Hardcoded Values:**
  - ❌ Hardcoded dates: `WHERE created_at > '2024-01-01'`
  - ❌ Hardcoded IDs in filters
  - ❌ Magic numbers without explanation
  - Comment: "Hardcoded {value} should be parameterized or documented."
  
  ## 3. Data Modeling Checks
  
  ### Grain and Keys:
  - ✅ Check if model has clear grain (one row per...)
  - ✅ Verify primary key test exists for grain
  - ❌ Flag if grain changed from previous version
  - ❌ Flag fact tables without obvious grain
  
  ### Test Coverage:
  - ❌ Models without any tests
  - ❌ ID columns without unique test
  - ❌ Foreign keys without relationship test
  - ❌ Important columns without not_null test
  - Comment: "Model {name} needs tests. Suggested: {list}."
  
  ### Documentation:
  - ❌ New models without description
  - ❌ Mart models with <50% columns documented
  - ❌ Complex calculations without explanation
  - Comment: "Add documentation for {items}."
  
  ### Materialization:
  - ❌ Large models materialized as view (>1M rows)
  - ❌ Small models materialized as table (<10K rows, not frequently queried)
  - ❌ Incremental models without proper incremental logic
  - Comment: "Consider {materialization} for this model because {reason}."
  
  ## 4. Breaking Change Detection
  
  Compare old and new versions:
  
  a) Check for removed columns:
     - Search downstream models for references to removed columns
     - Flag as HIGH RISK if found
  
  b) Check for type changes:
     - VARCHAR → INT, FLOAT → INT (data loss)
     - DATE → STRING (downstream parsing)
     - Flag as MEDIUM RISK
  
  c) Check for logic changes:
     - Modified WHERE clauses (fewer rows)
     - Changed JOIN conditions
     - Flag as MEDIUM RISK if significant
  
  d) For breaking changes found:
     - List downstream models affected
     - Estimate severity based on downstream count
     - Suggest migration path
  
  ## 5. Performance Considerations
  
  Flag potential performance issues:
  
  - Large tables without clustering/partitioning configuration
  - Models joining to themselves
  - Repeated subquery patterns that could be CTEs
  - Missing indexes on join keys (if applicable to warehouse)
  - Very wide models (>100 columns) that select everything
  
  ## 6. Layer-Specific Checks
  
  **Staging Models (stg_):**
  - Should be SELECT from single source with renaming/casting
  - Should not have complex joins or business logic
  - Should have 1-to-1 grain with source (or documented otherwise)
  
  **Intermediate Models (int_):**
  - Purpose should be clear (joining, aggregating, or pivoting)
  - Should ref other models, not sources directly
  - Should have descriptive names
  
  **Mart Models (fct_/dim_):**
  - Should be well-documented with business context
  - Should have comprehensive tests
  - Grain should be clear and tested
  - Column names should be business-friendly
  
  ## 7. Generate Review Comment
  
  Post a structured comment on the PR:
  
  ```markdown
  ## 🔍 SQL & Data Modeling Review
  
  I've reviewed the dbt models in this PR. Here's what I found:
  
  ### ✅ Good Practices Found:
  - Models use descriptive CTE names
  - All foreign keys have relationship tests
  - Grain is clearly documented
  
  ---
  
  ### ⚠️  Issues Requiring Attention
  
  #### High Priority
  
  **Breaking Change Detected** in `models/marts/fct_orders.sql`
  - **Issue**: Removed column `customer_email`
  - **Impact**: Used by 3 downstream models:
    - `customer_lifetime_value` (line 45)
    - `email_campaign_analysis` (line 23)
    - `customer_360` (line 89)
  - **Recommendation**: Keep column with deprecation warning, or update downstream models first
  
  **Missing Tests** in `models/marts/dim_customers.sql`
  - **Issue**: No primary key test on `customer_id`
  - **Recommendation**: Add unique and not_null tests:
    ```yaml
    - name: customer_id
      tests:
        - unique
        - not_null

Medium Priority

SELECT * Usage in models/staging/stg_orders.sql (line 12)
  • Issue: SELECT * FROM {{ source('raw', 'orders') }}
  • Impact: Schema changes will automatically propagate, possibly breaking downstream
  • Recommendation: List columns explicitly:
    SELECT
      order_id,
      customer_id,
      order_date,
      status,
      total_amount
    FROM {{ source('raw', 'orders') }}
    
Inefficient Join in models/intermediate/int_order_items_enriched.sql (line 34)
  • Issue: Joining on LOWER(product_name) instead of product_id
  • Impact: Poor query performance, requires full table scan
  • Recommendation: Join on product_id if possible, or index the lowercased column

Low Priority

Naming Convention in models/marts/fct_orders.sql
  • Columns not following snake_case: CustomerEmail, OrderTotal
  • Recommendation: Use customer_email, order_total
Missing Documentation in models/marts/dim_products.sql
  • 8 of 15 columns lack descriptions
  • Consider documenting at least key business columns

📊 Summary

  • Models Reviewed: 5
  • Issues Found: 6 (1 high, 2 medium, 3 low)
  • Breaking Changes: 1
  • Test Coverage: 60% (3 of 5 models have tests)

💡 Suggestions

  1. Address the breaking change in fct_orders before merging
  2. Add primary key tests to all mart models
  3. Consider refactoring the join in int_order_items_enriched
Great work on the CTE structure and relationship tests! 🎉

## 8. Set PR Status

Based on findings:

- **HIGH priority issues**: Set status to "failure" or "changes_requested"
- **MEDIUM priority only**: Set status to "success" but note issues in comment
- **LOW priority or none**: Set status to "success" with positive feedback

## 9. Handle Edge Cases

- If unable to read files: Comment explaining and request manual review
- If model is too complex to analyze: Provide partial review
- If PR is too large (>20 models): Review sample + provide summary
- If SQL has syntax errors: Note in comment, don't attempt deep analysis