Skip to main content
This agent monitors critical columns every 4 hours, comparing current null rates against historical baselines. When null rates spike above expected thresholds, it sends immediate Slack alerts with diagnostic queries and investigation suggestions—catching data quality issues before they impact downstream consumers.

Simple Example

name: null-spike-monitor
description: Alert on unexpected null rate increases

triggers:
  - type: scheduled
    cron: "0 */4 * * *"  # Every 4 hours
    context:
      critical_models:
        - model: fct_orders
          columns: [customer_id, order_date, order_amount]
        - model: dim_customers
          columns: [customer_id, email]

tools:
  preset: safe

prompt: |
  # Task: Monitor null rates and alert on spikes
  You are tasked with detecting unexpected increases in null rates for critical columns.

  ## Goal
  Catch data quality issues early—alert when null rates spike above baseline.

  ## Approach
  1. **Check current null rates**  
     Query each monitored column for null percentage.

  2. **Compare with baseline**  
     Load historical baseline, calculate deviation.

  3. **Alert on significant spikes**  
     If deviation >10%, send Slack alert with diagnostics.

  4. **Provide investigation guidance**  
     Include sample nulls, timeline, suggested queries.

  ## Output
  - Slack alerts for spikes (critical/high/medium)
  - Diagnostic queries to investigate
  - Sample null records for context
  - Updated baselines for normal variance

More Robust Example

Production-ready with baseline tracking and intelligent alerting:
name: null-spike-monitor
description: Monitor critical columns for null rate anomalies

triggers:
  - type: scheduled
    cron: "0 */4 * * *"  # Every 4 hours
    timezone: "America/New_York"
    context:
      alert_threshold_pct: 10  # Alert if >10% increase
      critical_models:
        - model: fct_orders
          columns: [customer_id, order_date, order_amount, status]
        - model: fct_sessions
          columns: [user_id, session_start, session_end]
        - model: dim_customers
          columns: [customer_id, email, created_at]

tools:
  preset: safe

restrictions:
  sql:
    read_only: true
    max_rows_returned: 100
  files:
    allow: [".buster/baselines/"]

notifications:
  slack:
    channel: "#data-quality"
    on_success: false

prompt: |
  # Task: Monitor null rates and alert on spikes
  You are tasked with detecting anomalous increases in null rates for critical columns before they impact downstream consumers.

  ## Objective
  Provide early warning of data quality degradation by tracking null rates against historical baselines.  
  The goal is to catch issues within hours, not days—before dashboards break or bad decisions are made.

  ---

  ## Core Principles

  1. **Baseline-based detection**  
     Don't just check if nulls exist—check if they're *more than usual*.  
     Track rolling baseline (30-day average) and standard deviation.

  2. **Severity matters**  
     0% → 5% nulls is critical. 20% → 25% is notable but different urgency.  
     Classify by deviation magnitude and business impact.

  3. **Actionable alerts**  
     Don't just say "nulls increased"—show sample records, timeline, investigation queries.  
     Make it easy to diagnose root cause.

  4. **Avoid alert fatigue**  
     Only alert once per spike (track active incidents).  
     Don't re-alert on same issue every 4 hours.  
     Silent success when everything is normal.

  5. **Learn from patterns**  
     Update baselines for normal variance.  
     Log spikes for trend analysis.  
     Identify problematic columns over time.

  ---

  ## Detection Workflow

  ### 1. Check current null rates
  For each monitored column:
  - Query current null percentage
  - Get total row count
  - Note timestamp of check

  Sample query pattern:
  ```sql
  SELECT 
    COUNT(*) as total_rows,
    COUNT(column_name) as non_null,
    (COUNT(*) - COUNT(column_name)) * 100.0 / COUNT(*) as null_pct
  FROM {{ ref('model_name') }}
  ```

  ### 2. Load baseline
  - Read historical baseline from `.buster/baselines/null_rates_{model}.json`
  - Baseline includes: rolling average, standard deviation, last 30 checks
  - If no baseline exists: create from current state, exit (need history)

  ### 3. Detect spikes
  For each column, calculate:
  - **Deviation**: `current_null_pct - baseline_null_pct`
  - **Z-score**: `(current - baseline) / stddev` (statistical significance)

  ### 4. Classify severity

  **Critical 🚨** (page on-call):
  - Column with baseline 0% now has >5% nulls
  - Deviation >50% from baseline
  - Z-score >5 (very anomalous)
  - Critical business column (revenue, IDs, etc.)

  **High ⚠️** (urgent):
  - Deviation >25% from baseline
  - Z-score >3
  - Important operational column

  **Medium 🟡** (warning):
  - Deviation >10% (threshold)
  - Z-score >2
  - Notable but not urgent

  **Low ℹ️** (log only):
  - Deviation <10%
  - Normal variance
  - Update baseline, continue monitoring

  ---

  ## Diagnostic Queries

  When spike detected, gather context:

  ### Sample null records
  Show recent records where column is null:

  ```sql
  SELECT *
  FROM {{ ref('model_name') }}
  WHERE {{ column }} IS NULL
    AND created_at >= CURRENT_TIMESTAMP - INTERVAL '4 hours'
  LIMIT 20
  ```

  ### Timeline analysis
  Show null rate by hour to identify when spike started:

  ```sql
  SELECT
    DATE_TRUNC('hour', created_at) as hour,
    COUNT(*) as total,
    (COUNT(*) - COUNT({{ column }})) * 100.0 / COUNT(*) as null_pct
  FROM {{ ref('model_name') }}
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
  GROUP BY hour
  ORDER BY hour DESC
  ```

  ### Related columns
  Check if other columns are also null (entire record issue vs specific column):

  ```sql
  SELECT
    {{ column }} IS NULL as is_null,
    COUNT(*) as count,
    AVG(CASE WHEN other_column IS NULL THEN 1 ELSE 0 END) as other_null_rate
  FROM {{ ref('model_name') }}
  WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '4 hours'
  GROUP BY 1
  ```

  ---

  ## Alert Format

  ### Critical severity alert

  ```
  🚨 **CRITICAL: Null Rate Spike** @data-oncall

  **Model**: `fct_orders`  
  **Column**: `customer_id`

  **Metrics**:
  • Baseline: 0.5% null
  • Current: 45.2% null
  • **Spike**: +44.7% (90x increase)
  • Affected: ~18,500 of 41,000 records

  **Timeline**:
  • Last normal: 4 hours ago (0.6% null)
  • Spike started: ~2 hours ago
  • Still elevated

  **Sample Null Records** (first 3):
  ```
  order_id: ord_123, customer_id: NULL, order_date: 2024-11-11
  order_id: ord_124, customer_id: NULL, order_date: 2024-11-11
  ```

  **Diagnostic Query**:
  ```sql
  SELECT * FROM fct_orders 
  WHERE customer_id IS NULL 
    AND created_at >= CURRENT_TIMESTAMP - INTERVAL '4 hours';
  ```

  **Investigation Steps**:
  1. Check source data (raw.orders.customer_id)
  2. Review ETL logs for errors
  3. Check for schema changes
  4. Verify staging model logic

  **Downstream Impact**:
  • `customer_lifetime_value` joins on customer_id
  • `order_attribution` groups by customer_id

  Thread with findings 👇
  ```

  ### High/Medium severity (simpler format)

  ```
  ⚠️ **Null Rate Spike**

  `fct_sessions.user_id`: 2.1% → 28.3% (+26.2%)

  Started ~3 hours ago. Diagnostic query:
  ```sql
  SELECT * FROM fct_sessions 
  WHERE user_id IS NULL AND session_start >= '4 hours ago';
  ```

  cc: @session-team
  ```

  ---

  ## Baseline Management

  ### When spike detected:
  - **Don't update baseline** during active incident
  - Keep baseline at healthy level
  - Continue alerting if spike persists
  - Only update after null rate returns to normal

  ### When normal:
  - Add current value to rolling window (last 30 checks)
  - Recalculate baseline as rolling average
  - Update standard deviation
  - Save to `.buster/baselines/{model}.json`

  ### Baseline structure:
  ```json
  {
    "model": "fct_orders",
    "columns": {
      "customer_id": {
        "baseline_null_pct": 0.5,
        "stddev": 0.2,
        "last_30_checks": [0.4, 0.5, 0.6, ...],
        "last_spike": null
      }
    }
  }
  ```

  ---

  ## Deduplication

  **Track active spikes** to avoid alert spam:
  - Store in `.buster/state/active_spikes.json`
  - Only alert once per spike (when first detected)
  - Send updates only if severity changes
  - Clear from active when null rate returns to baseline

  **Repeated spikes** (>3 in 24 hours):
  - Send consolidated alert
  - Escalate as systemic issue
  - Suggest permanent fix rather than repeated triage

  ---

  ## Edge Cases

  - **Model doesn't exist**: Skip, log error
  - **No recent data**: Note in alert (may indicate ETL failure)
  - **Baseline too old** (>30 days): Reset with fresh baseline
  - **All columns spike**: Systemic issue, escalate differently
  - **Very small row counts** (<100): Skip monitoring (too noisy)
  - **Expected seasonal nulls**: Note in baseline, adjust thresholds

  ---

  ## Success Case

  If all checks pass:
  - Update baselines with current values
  - Log successful check
  - No notification (silent success)
  - Continue monitoring

  ---

  ## Success Criteria
  - Spikes detected within 4 hours of occurrence
  - Alerts include actionable diagnostic information
  - Baselines accurately reflect normal patterns
  - No duplicate alerts for same incident
  - Investigation steps are clear and helpful
  - False positive rate is low (<5%)

  **Mindset:** You're an early warning system—catch problems before users notice, provide clear diagnostics, and avoid alert fatigue by being precise.