Skip to main content
This agent performs comprehensive audits of your dbt repository to keep documentation, tests, and observability up to date without requiring changes in every PR. It runs on a schedule to identify gaps, enforce standards, and create issues for improvement—reducing per-PR friction while maintaining quality.

Simple Example

name: repo-auditor
description: Weekly audit of dbt repository quality

triggers:
  - type: scheduled
    cron: "0 8 * * 1"  # Monday at 8 AM

tools:
  preset: safe  # Read-only

prompt: |
  Audit the dbt repository:
  1. Find models without documentation
  2. Find models without tests
  3. Check for naming convention violations
  4. Create a report issue with findings

More Robust Example

Production-ready with comprehensive checks and actionable reports:
name: repo-auditor
description: Comprehensive weekly audit of dbt repository quality

triggers:
  - type: scheduled
    cron: "0 7 * * 1"  # Every Monday at 7 AM
    timezone: "America/New_York"
    only_on_weekdays: true
    context:
      audit_type: weekly_full
      create_issues: true

tools:
  preset: safe  # Read-only access

restrictions:
  files:
    allow: ["models/", "tests/", "macros/", ".buster/audits/"]
  sql:
    read_only: true

notifications:
  slack:
    channel: "#data-quality"
    on_success: true

prompt: |
  ## Audit Categories
  
  Run comprehensive repository audit across multiple dimensions:
  
  1. Documentation Coverage
  2. Test Coverage
  3. Naming Conventions
  4. File Organization
  5. Performance & Materialization
  6. Dependency Health
  7. Code Quality
  
  ---
  
  ## 1. Documentation Coverage Audit
  
  ### A. Find undocumented models
  
  For each model file:
  - Check if corresponding YAML exists
  - Check if model has description
  - Check if columns have descriptions
  - Calculate coverage percentage
  
  ```bash
  # Find all models
  find models/ -name "*.sql" -type f
  
  # For each, check for YAML documentation
  # Parse YAML to verify descriptions exist

B. Calculate coverage metrics

Documentation Coverage by Layer:
- Staging: 234/245 models (95.5%)
- Intermediate: 89/112 models (79.5%)
- Marts: 145/156 models (92.9%)

Column Documentation:
- Staging: 1,234/1,890 columns (65.3%)
- Intermediate: 456/678 columns (67.3%)
- Marts: 890/945 columns (94.2%)

Overall: 468/513 models documented (91.2%)

C. Identify priority gaps

High Priority (should be documented):
  • Mart models without descriptions
  • Models with >10 downstream dependencies
  • Models referenced in exposures
  • Public models (if using access config)
Medium Priority:
  • Intermediate models
  • Models changed in last 30 days
Low Priority:
  • Staging models (often straightforward)
  • Experimental/temp models

2. Test Coverage Audit

A. Find models without tests

Check each model for:
  • Unique tests on potential key columns
  • Not null tests on required columns
  • Relationship tests on foreign keys
  • Any custom tests
Models by Test Count:
- 0 tests: 67 models (HIGH RISK)
- 1-2 tests: 143 models
- 3-5 tests: 189 models
- 6+ tests: 114 models (WELL TESTED)

B. Identify missing key tests

For each model:
  • Look for columns ending in _id or _key
  • Check if they have unique test
  • Look for obvious foreign key relationships
  • Check if relationship tests exist
Example findings:
Missing Primary Key Tests:
- fct_orders.order_id - No unique test
- dim_customers.customer_id - No unique test
- fct_sessions.session_id - No unique test

Missing Foreign Key Tests:
- fct_orders.customer_id → dim_customers
- fct_sessions.user_id → dim_users
- int_order_items.product_id → dim_products

Missing Not Null Tests:
- Critical columns without not_null:
  * fct_orders: order_date, customer_id
  * dim_customers: email, created_at

C. Test execution health

If possible, check recent test results:
# Parse dbt run_results.json
cat target/run_results.json | jq '.results[] | select(.status != "pass")'
Identify:
  • Tests that frequently fail
  • Tests that are flaky
  • Tests that take very long

3. Naming Convention Audit

A. Model naming

Check each model follows conventions: Expected Patterns:
  • Staging: stg_{source}__{table}
  • Intermediate: int_{description}
  • Facts: fct_{business_entity}
  • Dimensions: dim_{business_entity}
Violations Found:
Staging layer violations:
- staging_orders.sql → Should be stg_ecommerce__orders.sql
- raw_customers.sql → Should be stg_crm__customers.sql

Mart layer violations:
- customer_table.sql → Should be dim_customers.sql
- orders_fact.sql → Should be fct_orders.sql
- final_revenue.sql → Should be fct_revenue.sql

Files not following layer prefix:
- legacy_report.sql → No layer prefix
- temp_analysis.sql → Temp file in production?

B. Column naming

Check columns for:
  • snake_case usage
  • Boolean prefixes (is_, has_, should_)
  • Date suffixes (_date, _at, _timestamp)
  • ID suffixes (_id, _key)
-- Sample query to check column patterns
SELECT 
  table_schema,
  table_name,
  column_name
FROM information_schema.columns
WHERE table_schema IN ('staging', 'marts')
  AND (
    column_name LIKE '% %'  -- Has spaces
    OR column_name != LOWER(column_name)  -- Not lowercase
    OR column_name LIKE '%ID' -- Should be _id
  )
Common Issues:
  • CamelCase columns: CustomerID, OrderDate
  • Uppercase: USER_ID, CREATED_AT
  • Inconsistent naming: custID vs customer_id
  • Missing prefixes: active → should be is_active

4. File Organization Audit

A. Directory structure

Check for proper organization:
Expected Structure:
models/
  ├── staging/
  │   ├── source1/
  │   │   ├── _source1__models.yml
  │   │   ├── _source1__sources.yml
  │   │   └── stg_source1__*.sql
  │   └── source2/
  ├── intermediate/
  │   └── int_*.sql
  └── marts/
      ├── core/
      ├── marketing/
      └── finance/
Issues Found:
  • Models in wrong directories
  • Staging models not grouped by source
  • Missing _schema.yml files
  • Orphaned files

B. YAML organization

Check for:
  • One schema.yml per directory OR per source
  • Source definitions co-located with staging
  • Consistent naming (_schema.yml vs schema.yml)

5. Performance & Materialization Audit

A. Materialization appropriateness

Check each model’s materialization: Rules:
  • Views: For simple transformations, low query frequency
  • Tables: For expensive transformations, high query frequency
  • Incremental: For large, append-only datasets
  • Ephemeral: For DRY code, not queried directly
Potential Issues:
Over-Materialized (table but could be view):
- dim_date: 365 rows, materialized as table
- dim_business_days: 252 rows, queried 2x/month
Recommendation: Change to view

Under-Materialized (view but should be table):
- fct_sessions: 50M rows, queried 100x/day
- int_orders_enriched: Complex joins, used by 5 models
Recommendation: Change to table or incremental

Missing Incremental:
- fct_events: 500M rows, append-only, full-refresh daily
Recommendation: Convert to incremental

B. Performance hints

Look for expensive patterns:
  • Large CTEs that could be intermediate models
  • Repeated subqueries
  • Window functions without PARTITION BY
  • DISTINCT without explanation

6. Dependency Health Audit

A. Dependency graph analysis

# Generate dependency graph
dbt ls --select +fct_orders --output json
Check for: Circular Dependencies: (Should never happen)
ERROR: model_a → model_b → model_c → model_a
Long Dependency Chains: (>8 levels deep)
source → stg → int1 → int2 → int3 → int4 → int5 → fct
Recommendation: Simplify chain Unused Models: (No downstream dependencies)
Models with 0 downstream refs:
- int_legacy_orders (last modified 180 days ago)
- int_temp_analysis (last modified 45 days ago)
Recommendation: Remove or document purpose High-Fan-Out Models: (Used by many downstream)
Models with >20 downstream dependencies:
- stg_salesforce__accounts (45 downstream)
- dim_customers (38 downstream)
Note: Changes to these have high impact

B. Source freshness

Check source freshness configurations:
  • Are all sources monitored?
  • Are thresholds reasonable?
  • Any sources consistently failing?

7. Code Quality Audit

A. SQL patterns

Search for anti-patterns: SELECT * usage:
grep -r "SELECT \*" models/ --include="*.sql" | wc -l
Hardcoded values:
grep -r "WHERE.*'2024-" models/ --include="*.sql"
Complex IN clauses:
grep -r "IN ([^)]\{100,\})" models/ --include="*.sql"

B. Macro usage

Check for:
  • Repeated code that could be macroized
  • Unused macros
  • Undocumented macros

C. Config consistency

Check dbt_project.yml:
  • Consistent materialization defaults
  • Appropriate schema naming
  • Tags being used effectively

Generate Comprehensive Report

Create Summary Issue

Post weekly audit results as GitHub issue:
## 📊 Weekly Repository Audit - {{date}}

Comprehensive quality check of the dbt repository.

---

## Executive Summary

| Category | Score | Status | Trend |
|----------|-------|--------|-------|
| Documentation | 91.2% | ✅ Excellent | ↗️ +2.1% |
| Test Coverage | 78.5% | ⚠️  Good | → +0.3% |
| Naming Conventions | 88.9% | ✅ Good | ↗️ +1.5% |
| File Organization | 95.1% | ✅ Excellent | → 0% |
| Performance | 73.2% | ⚠️  Fair | ↘️ -3.2% |
| Dependencies | 92.4% | ✅ Excellent | → +0.1% |
| Code Quality | 81.7% | ✅ Good | ↗️ +2.8% |

**Overall Health**: 86.3% (↗️ +0.8% from last week)

---

## 🔴 High Priority Issues (Action Required)

### Missing Tests on Critical Models (12 models)

Mart models without primary key tests:
- [ ] `fct_orders` - Missing unique test on `order_id`
- [ ] `fct_sessions` - Missing unique test on `session_id`
- [ ] `dim_customers` - Missing unique test on `customer_id`

**Recommendation**: Add unique + not_null tests to all primary keys
**Estimated Effort**: 2 hours
**Assigned**: @data-quality-team

### Performance Issues (5 models)

Models that should be incremental:
- [ ] `fct_events` - 500M rows, full-refresh taking 45 min
- [ ] `fct_page_views` - 200M rows, full-refresh taking 20 min

**Recommendation**: Convert to incremental materialization
**Estimated Savings**: ~$450/month in compute costs
**Assigned**: @data-platform-team

---

## 🟡 Medium Priority Issues (Improve When Possible)

### Documentation Gaps (45 models)

Intermediate models without descriptions:
- `int_orders_enriched`, `int_customer_segments`, `int_revenue_rollup`
- [Full list: 45 models](link_to_detailed_report)

**Recommendation**: Add model and column descriptions
**Estimated Effort**: 6 hours

### Naming Convention Violations (23 files)

Files not following naming standards:
- `staging_orders.sql` → Should be `stg_ecommerce__orders.sql`
- `customer_table.sql` → Should be `dim_customers.sql`
- [Full list: 23 violations](link_to_detailed_report)

**Recommendation**: Rename files to follow conventions
**Estimated Effort**: 3 hours

---

## ✅ Wins This Week

- Added 15 new tests across marts layer (+3.2% coverage)
- Documented 8 previously undocumented models
- Removed 3 unused intermediate models
- Improved naming consistency in staging layer

---

## 📈 Trends

**Improving**:
- Documentation coverage (91% → 93% over last month)
- Code quality score (79% → 82%)

**Declining**:
- Performance score (76% → 73%) - need to address incremental models

**Stable**:
- Test coverage (~78% for 3 months)
- File organization (consistently high)

---

## 🎯 Next Week's Goals

1. Add tests to all marts primary keys
2. Convert top 2 slow models to incremental
3. Document 20 intermediate models
4. Fix top 10 naming violations

**Target Overall Health**: 88%

---

## 📎 Detailed Reports

- [Full Documentation Report](link)
- [Test Coverage Details](link)
- [Naming Violations List](link)
- [Performance Analysis](link)
- [Dependency Graph](link)

---

*Generated by Buster Repo Auditor*  
*Next audit: {{next_week_date}}*

Send Slack Summary

Post to #data-quality:
📊 **Weekly Repo Audit Complete**

**Overall Health**: 86.3% (↗️ +0.8%)

**Priority Actions**:
• 12 models need tests 🔴
• 5 models should be incremental 🔴
• 45 models need documentation 🟡

**Wins**:
✅ +15 new tests added
✅ +8 models documented
✅ 3 unused models removed

**Full Report**: {{issue_link}}

Great work team! Let's tackle those high-priority items this week. 💪

Store Audit History

Save results for trend tracking: .buster/audits/audit_{{date}}.json:
{
  "date": "2024-01-15",
  "scores": {
    "documentation": 91.2,
    "tests": 78.5,
    "naming": 88.9,
    "organization": 95.1,
    "performance": 73.2,
    "dependencies": 92.4,
    "code_quality": 81.7,
    "overall": 86.3
  },
  "counts": {
    "total_models": 513,
    "documented_models": 468,
    "tested_models": 403,
    "high_priority_issues": 17,
    "medium_priority_issues": 68
  },
  "issue_url": "https://github.com/..."
}
Use this to track improvements over time and generate trend charts.