Integrations and automation

If you already run synthetic checks, metrics, or log-based alerts in tools such as Pingdom or Grafana, the approach we recommend first is to connect them through Incido monitors and correlation groups. That path keeps one coherent incident per product area, rolls additional failing signals into the same record, and—when you configure it—can resolve the incident automatically when enough monitors recover. You trade a modest amount of Dashboard setup for a large reduction in custom automation you would otherwise have to maintain on top of the API.

The sections below walk through why that model works, a concrete B2B SaaS example with multiple monitors on the same subsystem, and what commonly goes wrong when teams map their real alerts into Incido. Programmatic creates through the API (with deduplication keys) remain valuable for pipelines, ticketing, and runbooks; they appear at the end, together with what you still own when you bypass monitors.

At a glance: monitoring path

This is the flow we want most teams to standardize on first. Each external signal uses its own webhook URL and becomes its own Incido monitor; monitors that describe the same outage belong in the same correlation group so Incido opens one ongoing incident and updates it as more signals fire or recover—including optional automatic resolution when your thresholds say the platform is healthy again.

The diagram uses the Northbridge example from later on this page: three different tools, three webhooks, three Incido monitors, one correlation group, one incident. You might add more monitors later; the merge pattern stays the same.

First signal - outside-in uptime. Pingdom (or any similar synthetic product) calls your customer-facing API health URL from the public internet. When it flips to DOWN, it POSTs to one Incido monitor URL configured as the Pingdom type, so Incido marks that monitor unhealthy without custom parsing.

Second signal - real error traffic. Grafana (or another metrics stack) fires when the same API’s 5xx percentage or error budget crosses a threshold—the load balancer can still return 200 to a tiny probe while customers see write failures. That alert’s notification channel points at a second Incido monitor URL, usually the Grafana type.

Third signal - async work stalling. Your workers, queue, or APM sends a webhook when job backlog, stuck inbound webhooks, or failed job count goes past a limit. A generic Incido monitor evaluates that payload with JSONLogic so Incido can mark unhealthy when background processing is impaired, which often tracks the same outage as API timeouts or stale data.

All three monitors attach to the same correlation group (Northbridge platform health, key northbridge-core in the walkthrough below), so the first qualifying failure creates a single incident and the others enrich that record instead of opening parallel incidents.

Field-by-field behavior, monitor types (Pingdom, Grafana, generic JSONLogic), and threshold semantics are documented in Monitors and correlation groups. Incidents describes how monitor-created incidents move through stages and what customers see when those incidents publish.

Why monitors first

A correlation group ties several monitors to a single ongoing incident for that slice of your product. When a second or third monitor in the same group turns unhealthy, Incido does not open a duplicate incident for the same event—it updates the existing one: additional affected components, possible severity escalation when a monitor is configured to force a higher severity, and the same public timeline your subscribers already follow.

That aggregation is the answer to a common real-world shape: you might have a synthetic check on a public API, a Grafana alert on elevated 5xx rates for the same service, and a separate signal that background jobs or queues are failing. All of them can reflect one underlying platform problem. Putting them in one group expresses “these alerts corroborate each other” instead of “each alert is its own incident.”

Automatic resolution is the other major advantage relative to a bare API integration. When automatic resolution is enabled on the group and enough monitors return to healthy, Incido moves the incident toward Post Incident or Closed (according to your workflow transitions) and can notify subscribers that recovery is underway. Automatic resolution does not run while any monitor that has force trigger enabled is still unhealthy, even if the overall count of unhealthy monitors is otherwise at or below the resolution threshold—so a signal you marked as always worth paging on can hold the incident open until it recovers. If you create incidents only through the API, you are responsible for every transition, including when to resolve and whether subscribers should be told; monitors encode much of that closure logic for you once thresholds match how your team thinks about “green again.”

Unrelated failures should live in separate correlation groups. A database outage that only affects an internal admin tool does not belong in the same group as customer-facing API monitors unless you deliberately want one incident to represent both.

Example: Northbridge (B2B SaaS)

Imagine Northbridge, a fictional B2B SaaS product. Customers integrate against a public REST API; the product also runs async workers that process webhooks and long-running jobs. Your observability stack already exposes several independent signals. You want one disciplined incident story when “the platform is sick,” not three parallel incidents because three tools fired.

Create one correlation group in Incido—say Northbridge platform health—and give it a stable group key (for example northbridge-core) if you use configuration export or API automation later; the display name is what operators read day to day. The monitoring path diagram above matches that arrangement: Pingdom on /v1/health, Grafana on 5xx for that API, and a generic webhook for job or backlog distress—all tied to northbridge-core so Incido keeps one incident. Each tool posts to its own monitor URL. When Pingdom keeps failing on the health URL because the endpoint is unreachable, Incido shows that monitor as unhealthy inside the group; together with your trigger threshold, that sustained synthetic failure is the simple outside-in signal that the platform is not reachable for customers, without opening a separate incident for every probe cycle.

In more detail: a Pingdom (or similar) check on the public API health endpoint answers “can we complete a simple request from the outside?” A Grafana alert rule fires when 5xx rate or error budget crosses a threshold for the same API—often a better signal than pure uptime when the service is up but failing requests. A generic monitor ingests webhooks from your queue or worker observability when job backlog or failure rate indicates that async processing is impaired, which often correlates with the same outage customers feel as timeouts or stale data. Optionally add a fourth monitor: a regional synthetic or edge probe (still generic or Pingdom depending on vendor) if you care separately about EU vs US reachability; it still belongs in this group if a regional blip should participate in the same incident narrative.

Assign affected components and component impact per monitor so the public status page stays accurate when only part of the surface is hurt. Use force severity on the monitor that should win when, for example, the worker monitor represents total pipeline failure while the API monitor only shows partial degradation.

Trigger threshold expresses how many monitors must be unhealthy before Incido creates an incident. Set it to 1 when any single signal is authoritative enough to open an incident—typical when each monitor is already a high-quality slice of truth. Set it to 2 or higher when you want corroboration: the API synthetic might flap during a deploy, but if the Grafana error-rate alert and the worker health signal are both red, you are confident this is a real platform event worth a single customer-facing incident. That is the “stronger signal when multiple monitors agree” pattern.

Activation threshold, when you use it, keeps the incident in Triage until enough monitors are unhealthy to justify moving to Active and fully customer-visible escalation—useful when you want internal visibility before you commit to the public story. Details and constraints are in Monitors and correlation groups.

Resolution threshold and automatic resolution define when unhealthy counts have dropped far enough that Incido may close the loop without a human clicking resolve—something you do not get for free from a one-off POST /incidents integration unless you build equivalent logic yourself.

When the mapping feels wrong

These situations usually point to grouping or thresholds, not to Incido ignoring webhooks.

We get several incidents for what feels like one outage. The usual cause is monitors sitting in different correlation groups when they should share one, or an API create opening a second incident alongside a monitor-driven one because API deduplication keys are unrelated to monitor grouping. Align groups with blast radius and keep a deliberate rule for when the API is allowed to create parallel records.

Incidents open too slowly. The trigger threshold may be higher than your risk tolerance, or monitors that should corroborate each other were split across groups. Lower the threshold or consolidate monitors that describe the same customer-visible failure mode.

The incident resolved in Incido but we believe the platform is still impaired. Check automatic resolution and resolution threshold settings, and whether a force trigger monitor is still unhealthy—Incido will not auto-resolve while that monitor remains down. Also confirm no second incident or maintenance still affects the same components on your status pages.

Alerts flap and customers see noise. You may need a higher trigger or activation threshold, or calmer upstream alert rules in Grafana or your vendor—the tuning often happens outside Incido, but the group thresholds are where you express how much agreement you require before committing to a public incident.

Programmatic integration and deduplication keys

Pipelines, ticketing systems, and internal runbooks often need to create or update incidents and maintenances directly. That is a legitimate second path. It does not replace monitors for health signals: when you use the API, you own idempotent deduplication keys, stage transitions, resolution timing, and how those actions line up with subscriber notifications unless you implement the same discipline the monitor stack gives you for free.

A deduplication key is a short, stable identifier (lowercase letters, numbers, and dashes, within product limits) that names the underlying problem, not a single HTTP request. Retries and duplicate webhooks should reuse the same key so they converge on one record.

For incidents, Incido deduplicates only against incidents in Triage or Active: the same key returns the existing incident instead of creating another. After Post Incident or Closed, you may reuse the key for a later outage. For maintenances, deduplication applies while a maintenance is in Draft, Scheduled, or In Progress; see Maintenances. Incident and maintenance updates can carry their own deduplication keys so repeated posts do not duplicate timeline entries; see the API reference for field names.

When the API path is enough on its own, it is usually because the event is not a repeating health signal from probes—for example opening an incident from a deployment gate or linking a maintenance window from a change calendar.

The diagram below is about who calls whom. The deduplication key is not a third step after the API—it is a field your system sends on each create so identical retries map to the same incident or maintenance. Without that discipline, you must implement your own idempotency elsewhere.

The create request body still carries the deduplication key; the arrow only names the transport so the diagram stays readable.

Troubleshooting

Webhooks seem ignored. Confirm the monitor URL and secret, the monitor type matches your vendor’s payload, and the monitor is enabled. Payload matching is covered in Monitors and correlation groups.

An API maintenance create returned an existing row. Expected when the same deduplication key is reused while another maintenance in Draft, Scheduled, or In Progress already holds that key. Use a new key or finish the existing maintenance first.

Monitors and API both fire for the same outage. Expect two incidents unless you rely on one path or align API keys and operational practice with monitor grouping; monitor aggregation does not merge arbitrary API creates.

At a glance: monitoring path​

Why monitors first​

Example: Northbridge (B2B SaaS)​

When the mapping feels wrong​

Programmatic integration and deduplication keys​

Troubleshooting​