Repo Audits

This agent performs comprehensive audits of your dbt repository to keep documentation, tests, and observability up to date without requiring changes in every PR. It runs on a schedule to identify gaps, enforce standards, and create issues for improvement—reducing per-PR friction while maintaining quality.

Simple Example

name: repo-auditor
description: Weekly audit of dbt repository quality

triggers:
  - type: scheduled
    cron: "0 8 * * 1"  # Monday at 8 AM

tools:
  preset: safe  # Read-only

prompt: |
  Audit the dbt repository:
  1. Find models without documentation
  2. Find models without tests
  3. Check for naming convention violations
  4. Create a report issue with findings

More Robust Example

Production-ready with comprehensive checks and actionable reports:

name: repo-auditor
description: Comprehensive weekly audit of dbt repository quality

triggers:
  - type: scheduled
    cron: "0 7 * * 1"  # Every Monday at 7 AM
    timezone: "America/New_York"
    only_on_weekdays: true
    context:
      audit_type: weekly_full
      create_issues: true

tools:
  preset: safe  # Read-only access

restrictions:
  files:
    allow: ["models/", "tests/", "macros/", ".buster/audits/"]
  sql:
    read_only: true

notifications:
  slack:
    channel: "#data-quality"
    on_success: true

prompt: |
  ## Audit Categories
  
  Run comprehensive repository audit across multiple dimensions:
  
  1. Documentation Coverage
  2. Test Coverage
  3. Naming Conventions
  4. File Organization
  5. Performance & Materialization
  6. Dependency Health
  7. Code Quality
  
  ---
  
  ## 1. Documentation Coverage Audit
  
  ### A. Find undocumented models
  
  For each model file:
  - Check if corresponding YAML exists
  - Check if model has description
  - Check if columns have descriptions
  - Calculate coverage percentage
  
  ```bash
  # Find all models
  find models/ -name "*.sql" -type f
  
  # For each, check for YAML documentation
  # Parse YAML to verify descriptions exist

B. Calculate coverage metrics

Documentation Coverage by Layer:
- Staging: 234/245 models (95.5%)
- Intermediate: 89/112 models (79.5%)
- Marts: 145/156 models (92.9%)

Column Documentation:
- Staging: 1,234/1,890 columns (65.3%)
- Intermediate: 456/678 columns (67.3%)
- Marts: 890/945 columns (94.2%)

Overall: 468/513 models documented (91.2%)

C. Identify priority gaps

High Priority (should be documented):

Mart models without descriptions
Models with >10 downstream dependencies
Models referenced in exposures
Public models (if using access config)

Medium Priority:

Intermediate models
Models changed in last 30 days

Low Priority:

Staging models (often straightforward)
Experimental/temp models

2. Test Coverage Audit

A. Find models without tests

Check each model for:

Unique tests on potential key columns
Not null tests on required columns
Relationship tests on foreign keys
Any custom tests

Models by Test Count:
- 0 tests: 67 models (HIGH RISK)
- 1-2 tests: 143 models
- 3-5 tests: 189 models
- 6+ tests: 114 models (WELL TESTED)

B. Identify missing key tests

For each model:

Look for columns ending in _id or _key
Check if they have unique test
Look for obvious foreign key relationships
Check if relationship tests exist

Example findings:

Missing Primary Key Tests:
- fct_orders.order_id - No unique test
- dim_customers.customer_id - No unique test
- fct_sessions.session_id - No unique test

Missing Foreign Key Tests:
- fct_orders.customer_id → dim_customers
- fct_sessions.user_id → dim_users
- int_order_items.product_id → dim_products

Missing Not Null Tests:
- Critical columns without not_null:
  * fct_orders: order_date, customer_id
  * dim_customers: email, created_at

C. Test execution health

If possible, check recent test results:

# Parse dbt run_results.json
cat target/run_results.json | jq '.results[] | select(.status != "pass")'

Identify:

Tests that frequently fail
Tests that are flaky
Tests that take very long

3. Naming Convention Audit

A. Model naming

Check each model follows conventions: Expected Patterns:

Staging: stg_{source}__{table}
Intermediate: int_{description}
Facts: fct_{business_entity}
Dimensions: dim_{business_entity}

Violations Found:

Staging layer violations:
- staging_orders.sql → Should be stg_ecommerce__orders.sql
- raw_customers.sql → Should be stg_crm__customers.sql

Mart layer violations:
- customer_table.sql → Should be dim_customers.sql
- orders_fact.sql → Should be fct_orders.sql
- final_revenue.sql → Should be fct_revenue.sql

Files not following layer prefix:
- legacy_report.sql → No layer prefix
- temp_analysis.sql → Temp file in production?

B. Column naming

Check columns for:

snake_case usage
Boolean prefixes (is_, has_, should_)
Date suffixes (_date, _at, _timestamp)
ID suffixes (_id, _key)

-- Sample query to check column patterns
SELECT 
  table_schema,
  table_name,
  column_name
FROM information_schema.columns
WHERE table_schema IN ('staging', 'marts')
  AND (
    column_name LIKE '% %'  -- Has spaces
    OR column_name != LOWER(column_name)  -- Not lowercase
    OR column_name LIKE '%ID' -- Should be _id
  )

Common Issues:

CamelCase columns: CustomerID, OrderDate
Uppercase: USER_ID, CREATED_AT
Inconsistent naming: custID vs customer_id
Missing prefixes: active → should be is_active

4. File Organization Audit

A. Directory structure

Check for proper organization:

Expected Structure:
models/
  ├── staging/
  │   ├── source1/
  │   │   ├── _source1__models.yml
  │   │   ├── _source1__sources.yml
  │   │   └── stg_source1__*.sql
  │   └── source2/
  ├── intermediate/
  │   └── int_*.sql
  └── marts/
      ├── core/
      ├── marketing/
      └── finance/

Issues Found:

Models in wrong directories
Staging models not grouped by source
Missing _schema.yml files
Orphaned files

B. YAML organization

Check for:

One schema.yml per directory OR per source
Source definitions co-located with staging
Consistent naming (_schema.yml vs schema.yml)

5. Performance & Materialization Audit

A. Materialization appropriateness

Check each model’s materialization: Rules:

Views: For simple transformations, low query frequency
Tables: For expensive transformations, high query frequency
Incremental: For large, append-only datasets
Ephemeral: For DRY code, not queried directly

Potential Issues:

Over-Materialized (table but could be view):
- dim_date: 365 rows, materialized as table
- dim_business_days: 252 rows, queried 2x/month
Recommendation: Change to view

Under-Materialized (view but should be table):
- fct_sessions: 50M rows, queried 100x/day
- int_orders_enriched: Complex joins, used by 5 models
Recommendation: Change to table or incremental

Missing Incremental:
- fct_events: 500M rows, append-only, full-refresh daily
Recommendation: Convert to incremental

B. Performance hints

Look for expensive patterns:

Large CTEs that could be intermediate models
Repeated subqueries
Window functions without PARTITION BY
DISTINCT without explanation

6. Dependency Health Audit

A. Dependency graph analysis

# Generate dependency graph
dbt ls --select +fct_orders --output json

Check for: Circular Dependencies: (Should never happen)

ERROR: model_a → model_b → model_c → model_a

Long Dependency Chains: (>8 levels deep)

source → stg → int1 → int2 → int3 → int4 → int5 → fct

Recommendation: Simplify chain Unused Models: (No downstream dependencies)

Models with 0 downstream refs:
- int_legacy_orders (last modified 180 days ago)
- int_temp_analysis (last modified 45 days ago)

Recommendation: Remove or document purpose High-Fan-Out Models: (Used by many downstream)

Models with >20 downstream dependencies:
- stg_salesforce__accounts (45 downstream)
- dim_customers (38 downstream)

Note: Changes to these have high impact

B. Source freshness

Check source freshness configurations:

Are all sources monitored?
Are thresholds reasonable?
Any sources consistently failing?

7. Code Quality Audit

A. SQL patterns

Search for anti-patterns: SELECT * usage:

grep -r "SELECT \*" models/ --include="*.sql" | wc -l

Hardcoded values:

grep -r "WHERE.*'2024-" models/ --include="*.sql"

Complex IN clauses:

grep -r "IN ([^)]\{100,\})" models/ --include="*.sql"

B. Macro usage

Check for:

Repeated code that could be macroized
Unused macros
Undocumented macros

C. Config consistency

Check dbt_project.yml:

Consistent materialization defaults
Appropriate schema naming
Tags being used effectively

Generate Comprehensive Report

Create Summary Issue

Post weekly audit results as GitHub issue:

## 📊 Weekly Repository Audit - {{date}}

Comprehensive quality check of the dbt repository.

---

## Executive Summary

| Category | Score | Status | Trend |
|----------|-------|--------|-------|
| Documentation | 91.2% | ✅ Excellent | ↗️ +2.1% |
| Test Coverage | 78.5% | ⚠️  Good | → +0.3% |
| Naming Conventions | 88.9% | ✅ Good | ↗️ +1.5% |
| File Organization | 95.1% | ✅ Excellent | → 0% |
| Performance | 73.2% | ⚠️  Fair | ↘️ -3.2% |
| Dependencies | 92.4% | ✅ Excellent | → +0.1% |
| Code Quality | 81.7% | ✅ Good | ↗️ +2.8% |

**Overall Health**: 86.3% (↗️ +0.8% from last week)

---

## 🔴 High Priority Issues (Action Required)

### Missing Tests on Critical Models (12 models)

Mart models without primary key tests:
- [ ] `fct_orders` - Missing unique test on `order_id`
- [ ] `fct_sessions` - Missing unique test on `session_id`
- [ ] `dim_customers` - Missing unique test on `customer_id`

**Recommendation**: Add unique + not_null tests to all primary keys
**Estimated Effort**: 2 hours
**Assigned**: @data-quality-team

### Performance Issues (5 models)

Models that should be incremental:
- [ ] `fct_events` - 500M rows, full-refresh taking 45 min
- [ ] `fct_page_views` - 200M rows, full-refresh taking 20 min

**Recommendation**: Convert to incremental materialization
**Estimated Savings**: ~$450/month in compute costs
**Assigned**: @data-platform-team

---

## 🟡 Medium Priority Issues (Improve When Possible)

### Documentation Gaps (45 models)

Intermediate models without descriptions:
- `int_orders_enriched`, `int_customer_segments`, `int_revenue_rollup`
- [Full list: 45 models](link_to_detailed_report)

**Recommendation**: Add model and column descriptions
**Estimated Effort**: 6 hours

### Naming Convention Violations (23 files)

Files not following naming standards:
- `staging_orders.sql` → Should be `stg_ecommerce__orders.sql`
- `customer_table.sql` → Should be `dim_customers.sql`
- [Full list: 23 violations](link_to_detailed_report)

**Recommendation**: Rename files to follow conventions
**Estimated Effort**: 3 hours

---

## ✅ Wins This Week

- Added 15 new tests across marts layer (+3.2% coverage)
- Documented 8 previously undocumented models
- Removed 3 unused intermediate models
- Improved naming consistency in staging layer

---

## 📈 Trends

**Improving**:
- Documentation coverage (91% → 93% over last month)
- Code quality score (79% → 82%)

**Declining**:
- Performance score (76% → 73%) - need to address incremental models

**Stable**:
- Test coverage (~78% for 3 months)
- File organization (consistently high)

---

## 🎯 Next Week's Goals

1. Add tests to all marts primary keys
2. Convert top 2 slow models to incremental
3. Document 20 intermediate models
4. Fix top 10 naming violations

**Target Overall Health**: 88%

---

## 📎 Detailed Reports

- [Full Documentation Report](link)
- [Test Coverage Details](link)
- [Naming Violations List](link)
- [Performance Analysis](link)
- [Dependency Graph](link)

---

*Generated by Buster Repo Auditor*  
*Next audit: {{next_week_date}}*

Send Slack Summary

Post to #data-quality:

📊 **Weekly Repo Audit Complete**

**Overall Health**: 86.3% (↗️ +0.8%)

**Priority Actions**:
• 12 models need tests 🔴
• 5 models should be incremental 🔴
• 45 models need documentation 🟡

**Wins**:
✅ +15 new tests added
✅ +8 models documented
✅ 3 unused models removed

**Full Report**: {{issue_link}}

Great work team! Let's tackle those high-priority items this week. 💪

Store Audit History

Save results for trend tracking: .buster/audits/audit_{{date}}.json:

{
  "date": "2024-01-15",
  "scores": {
    "documentation": 91.2,
    "tests": 78.5,
    "naming": 88.9,
    "organization": 95.1,
    "performance": 73.2,
    "dependencies": 92.4,
    "code_quality": 81.7,
    "overall": 86.3
  },
  "counts": {
    "total_models": 513,
    "documented_models": 468,
    "tested_models": 403,
    "high_priority_issues": 17,
    "medium_priority_issues": 68
  },
  "issue_url": "https://github.com/..."
}

Use this to track improvements over time and generate trend charts.

Getting Started

Examples

Guides

Resources

Simple Example

More Robust Example

B. Calculate coverage metrics

C. Identify priority gaps

2. Test Coverage Audit

A. Find models without tests

B. Identify missing key tests

C. Test execution health

3. Naming Convention Audit

A. Model naming

B. Column naming

4. File Organization Audit

A. Directory structure

B. YAML organization

5. Performance & Materialization Audit

A. Materialization appropriateness

B. Performance hints

6. Dependency Health Audit

A. Dependency graph analysis

B. Source freshness

7. Code Quality Audit

A. SQL patterns

B. Macro usage

C. Config consistency

Generate Comprehensive Report

Create Summary Issue

Send Slack Summary

Store Audit History

Getting Started

Examples

Guides

Resources

​Simple Example

​More Robust Example

​B. Calculate coverage metrics

​C. Identify priority gaps

​2. Test Coverage Audit

​A. Find models without tests

​B. Identify missing key tests

​C. Test execution health

​3. Naming Convention Audit

​A. Model naming

​B. Column naming

​4. File Organization Audit

​A. Directory structure

​B. YAML organization

​5. Performance & Materialization Audit

​A. Materialization appropriateness

​B. Performance hints

​6. Dependency Health Audit

​A. Dependency graph analysis

​B. Source freshness

​7. Code Quality Audit

​A. SQL patterns

​B. Macro usage

​C. Config consistency

​Generate Comprehensive Report

​Create Summary Issue

​Send Slack Summary

​Store Audit History

Simple Example

More Robust Example

B. Calculate coverage metrics

C. Identify priority gaps

2. Test Coverage Audit

A. Find models without tests

B. Identify missing key tests

C. Test execution health

3. Naming Convention Audit

A. Model naming

B. Column naming

4. File Organization Audit

A. Directory structure

B. YAML organization

5. Performance & Materialization Audit

A. Materialization appropriateness

B. Performance hints

6. Dependency Health Audit

A. Dependency graph analysis

B. Source freshness

7. Code Quality Audit

A. SQL patterns

B. Macro usage

C. Config consistency

Generate Comprehensive Report

Create Summary Issue

Send Slack Summary

Store Audit History