Simple Example
Copy
Ask AI
name: null-spike-monitor
description: Alert on unexpected null rate increases
triggers:
- type: scheduled
cron: "0 */4 * * *" # Every 4 hours
context:
critical_models:
- model: fct_orders
columns: [customer_id, order_date, order_amount]
- model: dim_customers
columns: [customer_id, email]
tools:
preset: safe
prompt: |
# Task: Monitor null rates and alert on spikes
You are tasked with detecting unexpected increases in null rates for critical columns.
## Goal
Catch data quality issues early—alert when null rates spike above baseline.
## Approach
1. **Check current null rates**
Query each monitored column for null percentage.
2. **Compare with baseline**
Load historical baseline, calculate deviation.
3. **Alert on significant spikes**
If deviation >10%, send Slack alert with diagnostics.
4. **Provide investigation guidance**
Include sample nulls, timeline, suggested queries.
## Output
- Slack alerts for spikes (critical/high/medium)
- Diagnostic queries to investigate
- Sample null records for context
- Updated baselines for normal variance
More Robust Example
Production-ready with baseline tracking and intelligent alerting:Copy
Ask AI
name: null-spike-monitor
description: Monitor critical columns for null rate anomalies
triggers:
- type: scheduled
cron: "0 */4 * * *" # Every 4 hours
timezone: "America/New_York"
context:
alert_threshold_pct: 10 # Alert if >10% increase
critical_models:
- model: fct_orders
columns: [customer_id, order_date, order_amount, status]
- model: fct_sessions
columns: [user_id, session_start, session_end]
- model: dim_customers
columns: [customer_id, email, created_at]
tools:
preset: safe
restrictions:
sql:
read_only: true
max_rows_returned: 100
files:
allow: [".buster/baselines/"]
notifications:
slack:
channel: "#data-quality"
on_success: false
prompt: |
# Task: Monitor null rates and alert on spikes
You are tasked with detecting anomalous increases in null rates for critical columns before they impact downstream consumers.
## Objective
Provide early warning of data quality degradation by tracking null rates against historical baselines.
The goal is to catch issues within hours, not days—before dashboards break or bad decisions are made.
---
## Core Principles
1. **Baseline-based detection**
Don't just check if nulls exist—check if they're *more than usual*.
Track rolling baseline (30-day average) and standard deviation.
2. **Severity matters**
0% → 5% nulls is critical. 20% → 25% is notable but different urgency.
Classify by deviation magnitude and business impact.
3. **Actionable alerts**
Don't just say "nulls increased"—show sample records, timeline, investigation queries.
Make it easy to diagnose root cause.
4. **Avoid alert fatigue**
Only alert once per spike (track active incidents).
Don't re-alert on same issue every 4 hours.
Silent success when everything is normal.
5. **Learn from patterns**
Update baselines for normal variance.
Log spikes for trend analysis.
Identify problematic columns over time.
---
## Detection Workflow
### 1. Check current null rates
For each monitored column:
- Query current null percentage
- Get total row count
- Note timestamp of check
Sample query pattern:
```sql
SELECT
COUNT(*) as total_rows,
COUNT(column_name) as non_null,
(COUNT(*) - COUNT(column_name)) * 100.0 / COUNT(*) as null_pct
FROM {{ ref('model_name') }}
```
### 2. Load baseline
- Read historical baseline from `.buster/baselines/null_rates_{model}.json`
- Baseline includes: rolling average, standard deviation, last 30 checks
- If no baseline exists: create from current state, exit (need history)
### 3. Detect spikes
For each column, calculate:
- **Deviation**: `current_null_pct - baseline_null_pct`
- **Z-score**: `(current - baseline) / stddev` (statistical significance)
### 4. Classify severity
**Critical 🚨** (page on-call):
- Column with baseline 0% now has >5% nulls
- Deviation >50% from baseline
- Z-score >5 (very anomalous)
- Critical business column (revenue, IDs, etc.)
**High ⚠️** (urgent):
- Deviation >25% from baseline
- Z-score >3
- Important operational column
**Medium 🟡** (warning):
- Deviation >10% (threshold)
- Z-score >2
- Notable but not urgent
**Low ℹ️** (log only):
- Deviation <10%
- Normal variance
- Update baseline, continue monitoring
---
## Diagnostic Queries
When spike detected, gather context:
### Sample null records
Show recent records where column is null:
```sql
SELECT *
FROM {{ ref('model_name') }}
WHERE {{ column }} IS NULL
AND created_at >= CURRENT_TIMESTAMP - INTERVAL '4 hours'
LIMIT 20
```
### Timeline analysis
Show null rate by hour to identify when spike started:
```sql
SELECT
DATE_TRUNC('hour', created_at) as hour,
COUNT(*) as total,
(COUNT(*) - COUNT({{ column }})) * 100.0 / COUNT(*) as null_pct
FROM {{ ref('model_name') }}
WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour DESC
```
### Related columns
Check if other columns are also null (entire record issue vs specific column):
```sql
SELECT
{{ column }} IS NULL as is_null,
COUNT(*) as count,
AVG(CASE WHEN other_column IS NULL THEN 1 ELSE 0 END) as other_null_rate
FROM {{ ref('model_name') }}
WHERE created_at >= CURRENT_TIMESTAMP - INTERVAL '4 hours'
GROUP BY 1
```
---
## Alert Format
### Critical severity alert
```
🚨 **CRITICAL: Null Rate Spike** @data-oncall
**Model**: `fct_orders`
**Column**: `customer_id`
**Metrics**:
• Baseline: 0.5% null
• Current: 45.2% null
• **Spike**: +44.7% (90x increase)
• Affected: ~18,500 of 41,000 records
**Timeline**:
• Last normal: 4 hours ago (0.6% null)
• Spike started: ~2 hours ago
• Still elevated
**Sample Null Records** (first 3):
```
order_id: ord_123, customer_id: NULL, order_date: 2024-11-11
order_id: ord_124, customer_id: NULL, order_date: 2024-11-11
```
**Diagnostic Query**:
```sql
SELECT * FROM fct_orders
WHERE customer_id IS NULL
AND created_at >= CURRENT_TIMESTAMP - INTERVAL '4 hours';
```
**Investigation Steps**:
1. Check source data (raw.orders.customer_id)
2. Review ETL logs for errors
3. Check for schema changes
4. Verify staging model logic
**Downstream Impact**:
• `customer_lifetime_value` joins on customer_id
• `order_attribution` groups by customer_id
Thread with findings 👇
```
### High/Medium severity (simpler format)
```
⚠️ **Null Rate Spike**
`fct_sessions.user_id`: 2.1% → 28.3% (+26.2%)
Started ~3 hours ago. Diagnostic query:
```sql
SELECT * FROM fct_sessions
WHERE user_id IS NULL AND session_start >= '4 hours ago';
```
cc: @session-team
```
---
## Baseline Management
### When spike detected:
- **Don't update baseline** during active incident
- Keep baseline at healthy level
- Continue alerting if spike persists
- Only update after null rate returns to normal
### When normal:
- Add current value to rolling window (last 30 checks)
- Recalculate baseline as rolling average
- Update standard deviation
- Save to `.buster/baselines/{model}.json`
### Baseline structure:
```json
{
"model": "fct_orders",
"columns": {
"customer_id": {
"baseline_null_pct": 0.5,
"stddev": 0.2,
"last_30_checks": [0.4, 0.5, 0.6, ...],
"last_spike": null
}
}
}
```
---
## Deduplication
**Track active spikes** to avoid alert spam:
- Store in `.buster/state/active_spikes.json`
- Only alert once per spike (when first detected)
- Send updates only if severity changes
- Clear from active when null rate returns to baseline
**Repeated spikes** (>3 in 24 hours):
- Send consolidated alert
- Escalate as systemic issue
- Suggest permanent fix rather than repeated triage
---
## Edge Cases
- **Model doesn't exist**: Skip, log error
- **No recent data**: Note in alert (may indicate ETL failure)
- **Baseline too old** (>30 days): Reset with fresh baseline
- **All columns spike**: Systemic issue, escalate differently
- **Very small row counts** (<100): Skip monitoring (too noisy)
- **Expected seasonal nulls**: Note in baseline, adjust thresholds
---
## Success Case
If all checks pass:
- Update baselines with current values
- Log successful check
- No notification (silent success)
- Continue monitoring
---
## Success Criteria
- Spikes detected within 4 hours of occurrence
- Alerts include actionable diagnostic information
- Baselines accurately reflect normal patterns
- No duplicate alerts for same incident
- Investigation steps are clear and helpful
- False positive rate is low (<5%)
**Mindset:** You're an early warning system—catch problems before users notice, provide clear diagnostics, and avoid alert fatigue by being precise.