Skip to main content
Observability in Depth

Synthetic Monitoring

Ravinder··5 min read
ObservabilityTelemetrySynthetic MonitoringUptimeSLO
Share:
Synthetic Monitoring

Real-user monitoring tells you what users experienced. Synthetic monitoring tells you what they would experience — before they arrive. If your checkout flow breaks at 3 AM in Frankfurt and no users are active in that region, RUM sees nothing. A synthetic check running every 60 seconds from Frankfurt fires an alert within a minute of the first failure.

That is the core value proposition: synthetics work when real users don't. But most teams either over-invest in synthetic tests that duplicate their uptime check (HTTP 200, done) or under-invest and treat it as a checkbox. This post covers the design principles that make a synthetic suite genuinely useful.

What Synthetics Catch That RUM Cannot

flowchart LR subgraph Synthetic["Synthetic Monitoring"] S1["Pre-traffic failures\n(off-hours, low-traffic regions)"] S2["Third-party dependency failures\n(payment gateway, CDN)"] S3["Auth/certificate expiry\n(before users hit it)"] S4["Regression in new deploys\n(canary validation)"] end subgraph RUM["Real-User Monitoring"] R1["Actual user experience\n(real device, network)"] R2["Geographic distribution\n(where users actually are)"] R3["Business funnel drop-off\n(real sessions)"] end style Synthetic fill:#2ecc71,color:#000 style RUM fill:#3498db,color:#fff

Synthetics and RUM are complementary — the mistake is treating one as a substitute for the other.

Check Design Principles

Principle 1: Check the user journey, not the URL

A 200 from /checkout means nothing if the page renders but the payment button is missing. Scripted browser checks that assert on DOM state are far more valuable than HTTP status checks.

Principle 2: One check = one user scenario

Avoid multi-purpose checks. A checkout_happy_path check that also validates the returns flow becomes a maintenance burden and produces ambiguous failures. Small, focused scenarios are easier to maintain and pinpoint.

Principle 3: Location diversity matches your SLO scope

If your SLO covers "global availability," run checks from at least three geographic regions. A CDN misconfiguration often affects one region while leaving others intact.

Principle 4: Check frequency proportional to failure impact

Check type Frequency Justification
Health endpoint 30s Foundation for alerting
Auth flow 2m High impact, cheap to run
Checkout happy path 5m Stateful, slower to execute
Payment integration 10m Third-party rate limits apply
Full E2E purchase 30m Expensive, leaves DB state

Grafana Synthetic Monitoring: HTTP Probe

Grafana's synthetic monitoring (built on Blackbox Exporter and k6) exposes checks as Prometheus-compatible metrics:

# Grafana Synthetic Monitoring API probe definition
{
  "job": "checkout-health",
  "target": "https://api.example.com/health",
  "frequency": 30000,
  "timeout": 5000,
  "probeLocations": ["Frankfurt", "Singapore", "Virginia"],
  "settings": {
    "http": {
      "method": "GET",
      "headers": [
        {"name": "Accept", "value": "application/json"}
      ],
      "validStatusCodes": [200],
      "failIfBodyNotMatchesRegexp": ["\"status\":\"ok\""],
      "tlsConfig": {
        "insecureSkipVerify": false
      }
    }
  },
  "labels": {
    "team": "platform",
    "env": "production"
  }
}

The probe generates metrics you can alert on:

# Alert: probe failing from multiple regions
count by (job) (
  probe_success{job="checkout-health"} == 0
) >= 2

Scripted Browser Checks with k6 + Playwright

For user-journey checks, k6 Browser (Playwright-based) gives you a real browser with OTel trace emission:

// k6 browser check: checkout happy path
import { browser } from 'k6/browser';
import { check } from 'k6';
 
export const options = {
  scenarios: {
    checkout: {
      executor: 'constant-vus',
      vus: 1,
      duration: '2m',
    },
  },
};
 
export default async function () {
  const page = await browser.newPage();
 
  try {
    // Step 1: Load product page
    await page.goto('https://shop.example.com/products/ABC');
    await page.waitForSelector('[data-testid="add-to-cart"]');
 
    // Step 2: Add to cart
    await page.click('[data-testid="add-to-cart"]');
    check(page, {
      'cart count updated': p =>
        p.locator('[data-testid="cart-count"]').textContent() === '1',
    });
 
    // Step 3: Proceed to checkout
    await page.goto('https://shop.example.com/checkout');
    await page.waitForSelector('[data-testid="payment-form"]', { timeout: 3000 });
 
    check(page, {
      'payment form visible': p =>
        p.locator('[data-testid="payment-form"]').isVisible(),
      'SSL secure indicator': p =>
        p.url().startsWith('https://'),
    });
 
  } finally {
    await page.close();
  }
}

Run this from Grafana Cloud k6 or self-hosted k6 OSS with the metrics going to Prometheus:

# k6 prometheus remote write output
export K6_PROMETHEUS_RW_SERVER_URL=http://prometheus:9090/api/v1/write
export K6_PROMETHEUS_RW_TREND_AS_NATIVE_HISTOGRAM=true
k6 run --out experimental-prometheus-rw checkout.js

Certificate and TLS Expiry Monitoring

This is the "silly" failure that takes down production more often than it should. Blackbox Exporter gives you probe_ssl_earliest_cert_expiry:

# prometheus.yml — TLS expiry scrape
scrape_configs:
  - job_name: tls_expiry
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://api.example.com
          - https://shop.example.com
          - https://auth.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
 
# Alert
- alert: SSLCertExpiringSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
  for: 1h
  annotations:
    summary: "TLS cert for {{ $labels.instance }} expires in < 14 days"

Canary Validation with Synthetics

Run synthetic checks against your canary deployment before routing real traffic. In Argo Rollouts this is a built-in analysis template:

# Argo Rollouts AnalysisTemplate using synthetic probe
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: synthetic-gate
spec:
  metrics:
    - name: checkout-probe-success
      interval: 60s
      successCondition: result[0] >= 0.99
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            avg_over_time(
              probe_success{job="checkout-health", env="canary"}[5m]
            )

If the probe success rate drops below 99% during canary, the rollout pauses and triggers rollback — before any real users experience the failure.

Key Takeaways

  • Synthetics catch pre-traffic failures, third-party outages, and certificate expiry before real users encounter them — RUM cannot cover these cases because it requires actual traffic.
  • Check the user journey (DOM assertions, form submissions) not just the HTTP status code — a 200 response with a broken UI is a failure.
  • Frequency should be proportional to failure impact; health endpoints warrant 30s intervals, full E2E purchase flows warrant 30m or less.
  • Multi-region checks (minimum three locations) are required for SLOs that claim global availability — regional CDN and DNS failures are a common failure mode.
  • TLS certificate expiry alerts at 14 days give you enough runway to rotate certificates through normal change management without emergency procedures.
  • Integrating synthetic checks into canary rollout gates (Argo Rollouts, Flagger) turns passive monitoring into active deployment safety rails.
Share: