Synthetic Monitoring
Series
Observability in DepthReal-user monitoring tells you what users experienced. Synthetic monitoring tells you what they would experience — before they arrive. If your checkout flow breaks at 3 AM in Frankfurt and no users are active in that region, RUM sees nothing. A synthetic check running every 60 seconds from Frankfurt fires an alert within a minute of the first failure.
That is the core value proposition: synthetics work when real users don't. But most teams either over-invest in synthetic tests that duplicate their uptime check (HTTP 200, done) or under-invest and treat it as a checkbox. This post covers the design principles that make a synthetic suite genuinely useful.
What Synthetics Catch That RUM Cannot
Synthetics and RUM are complementary — the mistake is treating one as a substitute for the other.
Check Design Principles
Principle 1: Check the user journey, not the URL
A 200 from /checkout means nothing if the page renders but the payment button is missing. Scripted browser checks that assert on DOM state are far more valuable than HTTP status checks.
Principle 2: One check = one user scenario
Avoid multi-purpose checks. A checkout_happy_path check that also validates the returns flow becomes a maintenance burden and produces ambiguous failures. Small, focused scenarios are easier to maintain and pinpoint.
Principle 3: Location diversity matches your SLO scope
If your SLO covers "global availability," run checks from at least three geographic regions. A CDN misconfiguration often affects one region while leaving others intact.
Principle 4: Check frequency proportional to failure impact
| Check type | Frequency | Justification |
|---|---|---|
| Health endpoint | 30s | Foundation for alerting |
| Auth flow | 2m | High impact, cheap to run |
| Checkout happy path | 5m | Stateful, slower to execute |
| Payment integration | 10m | Third-party rate limits apply |
| Full E2E purchase | 30m | Expensive, leaves DB state |
Grafana Synthetic Monitoring: HTTP Probe
Grafana's synthetic monitoring (built on Blackbox Exporter and k6) exposes checks as Prometheus-compatible metrics:
# Grafana Synthetic Monitoring API probe definition
{
"job": "checkout-health",
"target": "https://api.example.com/health",
"frequency": 30000,
"timeout": 5000,
"probeLocations": ["Frankfurt", "Singapore", "Virginia"],
"settings": {
"http": {
"method": "GET",
"headers": [
{"name": "Accept", "value": "application/json"}
],
"validStatusCodes": [200],
"failIfBodyNotMatchesRegexp": ["\"status\":\"ok\""],
"tlsConfig": {
"insecureSkipVerify": false
}
}
},
"labels": {
"team": "platform",
"env": "production"
}
}The probe generates metrics you can alert on:
# Alert: probe failing from multiple regions
count by (job) (
probe_success{job="checkout-health"} == 0
) >= 2Scripted Browser Checks with k6 + Playwright
For user-journey checks, k6 Browser (Playwright-based) gives you a real browser with OTel trace emission:
// k6 browser check: checkout happy path
import { browser } from 'k6/browser';
import { check } from 'k6';
export const options = {
scenarios: {
checkout: {
executor: 'constant-vus',
vus: 1,
duration: '2m',
},
},
};
export default async function () {
const page = await browser.newPage();
try {
// Step 1: Load product page
await page.goto('https://shop.example.com/products/ABC');
await page.waitForSelector('[data-testid="add-to-cart"]');
// Step 2: Add to cart
await page.click('[data-testid="add-to-cart"]');
check(page, {
'cart count updated': p =>
p.locator('[data-testid="cart-count"]').textContent() === '1',
});
// Step 3: Proceed to checkout
await page.goto('https://shop.example.com/checkout');
await page.waitForSelector('[data-testid="payment-form"]', { timeout: 3000 });
check(page, {
'payment form visible': p =>
p.locator('[data-testid="payment-form"]').isVisible(),
'SSL secure indicator': p =>
p.url().startsWith('https://'),
});
} finally {
await page.close();
}
}Run this from Grafana Cloud k6 or self-hosted k6 OSS with the metrics going to Prometheus:
# k6 prometheus remote write output
export K6_PROMETHEUS_RW_SERVER_URL=http://prometheus:9090/api/v1/write
export K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true
k6 run --out experimental-prometheus-rw checkout.jsCertificate and TLS Expiry Monitoring
This is the "silly" failure that takes down production more often than it should. Blackbox Exporter gives you probe_ssl_earliest_cert_expiry:
# prometheus.yml — TLS expiry scrape
scrape_configs:
- job_name: tls_expiry
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com
- https://shop.example.com
- https://auth.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# Alert
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
for: 1h
annotations:
summary: "TLS cert for {{ $labels.instance }} expires in < 14 days"Canary Validation with Synthetics
Run synthetic checks against your canary deployment before routing real traffic. In Argo Rollouts this is a built-in analysis template:
# Argo Rollouts AnalysisTemplate using synthetic probe
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: synthetic-gate
spec:
metrics:
- name: checkout-probe-success
interval: 60s
successCondition: result[0] >= 0.99
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
avg_over_time(
probe_success{job="checkout-health", env="canary"}[5m]
)If the probe success rate drops below 99% during canary, the rollout pauses and triggers rollback — before any real users experience the failure.
Key Takeaways
- Synthetics catch pre-traffic failures, third-party outages, and certificate expiry before real users encounter them — RUM cannot cover these cases because it requires actual traffic.
- Check the user journey (DOM assertions, form submissions) not just the HTTP status code — a 200 response with a broken UI is a failure.
- Frequency should be proportional to failure impact; health endpoints warrant 30s intervals, full E2E purchase flows warrant 30m or less.
- Multi-region checks (minimum three locations) are required for SLOs that claim global availability — regional CDN and DNS failures are a common failure mode.
- TLS certificate expiry alerts at 14 days give you enough runway to rotate certificates through normal change management without emergency procedures.
- Integrating synthetic checks into canary rollout gates (Argo Rollouts, Flagger) turns passive monitoring into active deployment safety rails.