name: sql-reviewer
description: Comprehensive SQL and data modeling review for dbt PRs
triggers:
- type: pull_request
events: [opened, synchronize]
paths:
include: ["models/**/*.sql", "models/**/*.yml"]
conditions:
- type: pr_labels
none_of: ["skip-review", "wip"]
tools:
preset: safe # Read-only access
restrictions:
files:
allow: ["models/", "tests/", "macros/"]
sql:
read_only: true
allowed_schemas: ["raw", "staging", "intermediate", "marts"]
notifications:
slack:
channel: "#data-pr-reviews"
on_failure: true # Only notify if agent fails, not for every review
prompt: |
## 1. Analyze changed models
For each SQL file modified in this PR:
a) Read the SQL file content
b) Read the corresponding YAML file (if exists)
c) Identify the model layer (staging, intermediate, marts)
d) If possible, query the model in the warehouse to understand current data
## 2. SQL Quality Checks
Review each model for these issues:
### Anti-Patterns to Flag:
**SELECT * Usage:**
- ❌ `SELECT * FROM {{ source() }}` in staging models
- ❌ `SELECT * FROM {{ ref() }}` in marts
- ✅ OK in CTEs if immediately filtered/projected
- Comment: "SELECT * can break downstream models if source columns change. List columns explicitly."
**Inefficient Joins:**
- ❌ Joining on non-indexed columns or functions: `ON LOWER(a.email) = b.email`
- ❌ Cross joins without limits: `FROM a CROSS JOIN b`
- ❌ Multiple LEFT JOINs that could be INNER: suggests data quality issue
- Comment: "Join on {column} may be inefficient. Consider {suggestion}."
**Expensive Operations:**
- ❌ DISTINCT without explanation
- ❌ Window functions on entire table without PARTITION BY
- ❌ Subqueries in SELECT clause (correlated subqueries)
- Comment: "This operation may be expensive. Consider {alternative approach}."
**Naming Conventions:**
- ❌ Models not following layer prefixes (stg_, int_, fct_, dim_)
- ❌ Column names not in snake_case
- ❌ Boolean columns not prefixed with is_/has_
- ❌ Date columns not suffixed with _date/_at
- Comment: "Column {name} doesn't follow naming convention. Should be {suggestion}."
**CTE Structure:**
- ❌ No CTEs in complex models (>50 lines)
- ❌ CTEs not imported/logical in staging models (import, logical, final pattern)
- ❌ Unused CTEs
- Comment: "Consider restructuring with CTEs for readability."
**Hardcoded Values:**
- ❌ Hardcoded dates: `WHERE created_at > '2024-01-01'`
- ❌ Hardcoded IDs in filters
- ❌ Magic numbers without explanation
- Comment: "Hardcoded {value} should be parameterized or documented."
## 3. Data Modeling Checks
### Grain and Keys:
- ✅ Check if model has clear grain (one row per...)
- ✅ Verify primary key test exists for grain
- ❌ Flag if grain changed from previous version
- ❌ Flag fact tables without obvious grain
### Test Coverage:
- ❌ Models without any tests
- ❌ ID columns without unique test
- ❌ Foreign keys without relationship test
- ❌ Important columns without not_null test
- Comment: "Model {name} needs tests. Suggested: {list}."
### Documentation:
- ❌ New models without description
- ❌ Mart models with <50% columns documented
- ❌ Complex calculations without explanation
- Comment: "Add documentation for {items}."
### Materialization:
- ❌ Large models materialized as view (>1M rows)
- ❌ Small models materialized as table (<10K rows, not frequently queried)
- ❌ Incremental models without proper incremental logic
- Comment: "Consider {materialization} for this model because {reason}."
## 4. Breaking Change Detection
Compare old and new versions:
a) Check for removed columns:
- Search downstream models for references to removed columns
- Flag as HIGH RISK if found
b) Check for type changes:
- VARCHAR → INT, FLOAT → INT (data loss)
- DATE → STRING (downstream parsing)
- Flag as MEDIUM RISK
c) Check for logic changes:
- Modified WHERE clauses (fewer rows)
- Changed JOIN conditions
- Flag as MEDIUM RISK if significant
d) For breaking changes found:
- List downstream models affected
- Estimate severity based on downstream count
- Suggest migration path
## 5. Performance Considerations
Flag potential performance issues:
- Large tables without clustering/partitioning configuration
- Models joining to themselves
- Repeated subquery patterns that could be CTEs
- Missing indexes on join keys (if applicable to warehouse)
- Very wide models (>100 columns) that select everything
## 6. Layer-Specific Checks
**Staging Models (stg_):**
- Should be SELECT from single source with renaming/casting
- Should not have complex joins or business logic
- Should have 1-to-1 grain with source (or documented otherwise)
**Intermediate Models (int_):**
- Purpose should be clear (joining, aggregating, or pivoting)
- Should ref other models, not sources directly
- Should have descriptive names
**Mart Models (fct_/dim_):**
- Should be well-documented with business context
- Should have comprehensive tests
- Grain should be clear and tested
- Column names should be business-friendly
## 7. Generate Review Comment
Post a structured comment on the PR:
```markdown
## 🔍 SQL & Data Modeling Review
I've reviewed the dbt models in this PR. Here's what I found:
### ✅ Good Practices Found:
- Models use descriptive CTE names
- All foreign keys have relationship tests
- Grain is clearly documented
---
### ⚠️ Issues Requiring Attention
#### High Priority
**Breaking Change Detected** in `models/marts/fct_orders.sql`
- **Issue**: Removed column `customer_email`
- **Impact**: Used by 3 downstream models:
- `customer_lifetime_value` (line 45)
- `email_campaign_analysis` (line 23)
- `customer_360` (line 89)
- **Recommendation**: Keep column with deprecation warning, or update downstream models first
**Missing Tests** in `models/marts/dim_customers.sql`
- **Issue**: No primary key test on `customer_id`
- **Recommendation**: Add unique and not_null tests:
```yaml
- name: customer_id
tests:
- unique
- not_null