Webhooks done right: delivery, retries, and signing
Webhook best practices for B2B SaaS integrations. At-least-once delivery, retries with backoff, idempotency, signed payloads, and replay protection that hold up in production.
A webhook is a promise. The partner promises to tell your system the moment something changes, and your system promises to receive that message, prove it is genuine, and act on it exactly once. Most webhook integrations honor the first half of that promise on day one and quietly break the second half the first time anything goes wrong. The endpoint was mid-deploy, the payload was forged, the same event arrived twice, the database write timed out. None of these are rare. They are the normal weather a webhook receiver runs in, and the difference between a webhook integration that works and one that loses data is whether you designed for that weather or hoped it would stay sunny.
This guide is about the part teams skip. Standing up an endpoint that returns 200 is an afternoon. Building a receiver that survives at-least-once delivery, verifies every payload, processes each event exactly once, retries the right failures, and never silently drops a message is the actual work. We will walk through delivery guarantees, retries and backoff, idempotency, signing, and replay protection, in the order you should reason about them, and end with the mistakes that show up in nearly every webhook postmortem.
If you are still deciding whether to use webhooks at all, start with our guide to webhooks vs polling, which covers the tradeoff and the hybrid pattern most production integrations land on. This post assumes you have decided to consume webhooks and want to do it well.
The 60-second version
If you only read one section, read this one:
- Treat delivery as at-least-once, never exactly-once. The same event will arrive more than once, and some events will arrive out of order. Design for both from the start.
- Acknowledge fast, process later. Verify, store the raw event durably, return 200 in milliseconds, and do the real work off a queue. Slow processing inside the request looks like a failed delivery and triggers retries.
- Be idempotent by event id. Record which event ids you have handled and make reprocessing a no-op. This is the single property that makes retries and duplicates safe.
- Verify the signature before you trust a byte. Your endpoint is public. Check the HMAC signature on the raw body, reject failures, and use a constant-time comparison.
- Protect against replay. A captured valid request can be resent. Bind a timestamp into the signature, reject stale requests, and reject ids you have already seen.
- Retry with backoff, then dead-letter. Retry transient failures on an increasing schedule, cap the attempts, and park the rest where you can replay them. Alert when the dead-letter queue fills.
- Reconcile behind the webhook. A periodic poll that backfills anything the push dropped turns silent permanent loss into temporary, self-correcting lag.
Delivery guarantees: what "at-least-once" really means
Before you write a line of receiver code, get clear on what the sender promises, because almost every webhook system in production promises at-least-once delivery, and almost none promise exactly-once. The distinction decides your whole architecture.
At-least-once means the sender will keep trying until it gets a success response, so a given event may be delivered more than once. If your endpoint returns 200 but your network drops the response on the way back, the sender never sees the success and tries again. You handled the event; the sender does not know that; you get a duplicate. This is not a bug in the sender. It is the only honest guarantee a distributed system can make over an unreliable network.
Exactly-once is what everyone wants and what no sender can truly deliver end to end. The closest you get is at-least-once delivery plus idempotent processing on your side, which produces exactly-once effects. That is the real target: you cannot stop duplicates from arriving, but you can make sure a duplicate changes nothing.
Ordering is not guaranteed either. Webhooks are independent HTTP requests. Under retries especially, a "subscription.updated" event can land before the "subscription.created" event it depends on. If your logic assumes arrival order equals event order, it will corrupt state the first time a retry reshuffles things.
The practical consequences fall out directly from these three facts:
| The guarantee you actually get | The design it forces |
|---|---|
| Delivery is at-least-once | Process idempotently; duplicates must be safe |
| Effects must be exactly-once | Dedupe by event id before acting |
| Ordering is not guaranteed | Key off ids and timestamps, not arrival order |
| Some events never arrive at all | Reconcile with a periodic poll |
Read that table as a contract. Every reliable webhook receiver is an implementation of those four rows, and the rest of this guide is how you build each one.
Acknowledge fast, process later
The most common architectural mistake is doing the real work inside the request. The naive receiver accepts the POST, verifies it, writes to three tables, calls a downstream service, and then returns 200. It works in the demo. In production it fails twice over.
First, senders enforce a timeout, usually a few seconds. If your processing takes longer than that window, even occasionally under load, the sender gives up waiting, treats the delivery as failed, and retries. Now you are processing the same event a second time while looking unreliable to the partner, who sees a timeout in their delivery log.
Second, any failure deep in processing, a downstream 500, a lock timeout, throws away an event you already accepted unless you happened to store it first.
The fix is a clean split between receiving and processing:
- Verify the signature on the raw request body.
- Persist the raw event durably, ideally onto a queue, keyed by the event id.
- Return 200 immediately, before any business logic runs.
- Process off the queue, asynchronously, with its own retries.
This is sometimes called the accept-then-process or store-and-forward pattern, and it changes the failure model entirely. Once the raw event is durably stored, a processing failure is recoverable: you retry from the stored copy instead of asking the partner to redeliver. The HTTP request becomes a thin, fast acknowledgment whose only job is to verify and durably capture. The HTTP POST method reference on MDN is a useful refresher on the semantics your endpoint is implementing here, since a webhook receiver is fundamentally a POST handler with strict latency and trust requirements.
A good target is to return 200 in tens of milliseconds. If your acknowledgment path is touching more than the signature check and one durable write, it is doing too much.
Idempotency: the property that makes retries safe
Because delivery is at-least-once, your receiver will process the same logical event more than once. Idempotency is the guarantee that doing so changes nothing the second time. Without it, every duplicate is a chance to double-charge, double-provision, or double-notify, and duplicates are not edge cases here. They are routine.
The mechanism is straightforward. Every well-built webhook carries a unique event id. Before you apply an event, check whether you have already recorded that id:
- Keep a processed-events table keyed by event id, with a unique constraint.
- On each event, try to insert the id first. If the insert succeeds, this is the first time; process it. If it violates the unique constraint, you have seen this event; return success and do nothing.
- Make the insert and the processing atomic where it matters, so you never record an id as handled when the work behind it failed, or vice versa. A single transaction that both marks the id and applies the change is the cleanest version.
The mental model worth internalizing is the formal one: an idempotent operation produces the same result whether applied once or many times. The MDN definition of idempotent is a concise reference, and it is exactly the property you are engineering into your processing layer.
A few practical notes:
- Use the partner's event id, not your own request id. A retry of the same event reuses the same event id, which is precisely what lets you catch it. A per-request id would change on each retry and defeat the dedupe.
- Keep processed ids long enough to outlast the sender's retry window. If the partner retries for 24 hours, a dedupe table that forgets ids after an hour will let a late retry through as a fresh event.
- Idempotency covers ordering too, partially. If you key state changes off ids and timestamps rather than arrival order, an out-of-order duplicate is easier to handle correctly.
| Without idempotency | With idempotency by event id |
|---|---|
| A retry re-applies the event | A retry is detected and skipped |
| A duplicate double-charges or double-provisions | A duplicate is a no-op that returns 200 |
| Out-of-order retries corrupt state | Ids and timestamps decide outcome, not arrival |
| You fear retries and disable them | Retries are safe, so you can be aggressive about them |
Retries and backoff: which failures to retry, and how
Retries are the sender's job and your job both. The partner retries failed deliveries to your endpoint; your processing layer retries failed work off the queue. Both need a backoff schedule, and both need to know which failures are worth retrying.
Retry transient failures. Do not retry permanent ones. A 503 from a downstream service, a lock timeout, a brief network blip: retry these, because the next attempt may succeed. A malformed payload, a validation error, a 4xx that says the input is wrong: do not retry these, because the next attempt will fail identically. Retrying a permanent failure just burns attempts and delays the moment you notice the real problem.
Back off on an increasing schedule. Immediate retries hammer a service that is already struggling. A typical schedule grows the gap each time: a few seconds, then tens of seconds, then minutes, then hours. Exponential backoff with a cap is the standard shape.
Add jitter. If a downstream service blips and a thousand queued events all retry on the same fixed schedule, they retry in lockstep and hammer it in synchronized waves. Randomizing each delay by a small fraction spreads the load and is the difference between a recovery and a self-inflicted outage.
Cap the attempts, then dead-letter. Retrying forever is just a slow way to hide a problem. After a set number of attempts, move the event to a dead-letter queue rather than dropping it or looping indefinitely. A parked event you can inspect and replay is recoverable. A dropped one is a support ticket you have not received yet.
Respect 429 and Retry-After. If you retry by calling back into the partner's API as part of processing, and they return 429 Too Many Requests, back off for at least the interval their Retry-After header specifies. The 429 status and its companions are defined in RFC 6585, and a well-behaved client treats them as instructions, not suggestions.
| Failure type | Example | Retry? |
|---|---|---|
| Transient infrastructure | 503, lock timeout, network blip | Yes, with backoff and jitter |
| Rate limited | 429 with Retry-After | Yes, after the stated interval |
| Permanent client error | Malformed payload, validation failure | No, send to dead-letter for inspection |
| Exhausted attempts | Still failing after the cap | Stop, dead-letter, alert |
Signing and verification: prove the payload is real
Your webhook endpoint is a public URL. Anyone who learns it can POST to it, which means anyone can try to feed your system fake events: a forged "payment succeeded," a spoofed "account upgraded." Signature verification is the line between a webhook and an open door, and it is non-negotiable for any endpoint that triggers consequential work.
The standard mechanism is an HMAC signature. The partner shares a secret with you out of band. For each delivery, they compute a keyed hash of the raw request body using that secret and send it in a header. You recompute the same hash on your side and compare. If they match, the payload came from someone who holds the secret and was not altered in transit. HMAC is specified in RFC 2104, and understanding the construction helps you avoid the implementation traps.
Three things have to be right or the verification is theater:
- Hash the raw bytes, before any parsing. Sign and verify the exact body the partner sent. If you parse the JSON and re-serialize it, key ordering or whitespace changes will make a valid signature fail. Capture the raw body in your framework before any middleware touches it, and hash that.
- Use a constant-time comparison. Comparing the two signatures with an ordinary string equality leaks timing information an attacker can use to forge a signature byte by byte. Use the constant-time comparison your crypto library provides, not
==. - Reject on failure with a clear status. A signature mismatch should return 401 and never reach your processing layer. Log it, because a spike in mismatches is either a misconfiguration or an attack, and both are worth knowing about.
OWASP's REST Security Cheat Sheet is a solid independent reference for the input validation and authentication concerns that surround a public endpoint like this, and it is worth reading alongside whatever the specific partner documents.
Replay protection: a valid signature is not enough
Here is the subtle attack a signature alone does not stop. Suppose someone captures a genuine, correctly signed webhook request, perhaps from logs, a proxy, or a compromised intermediary. They cannot alter it without breaking the signature, but they can resend it unchanged, as many times as they like. Every copy verifies perfectly, because every copy is a real, signed request. That is a replay attack, and against an endpoint that provisions access or moves money, replaying a single valid "upgrade granted" event repeatedly is a real problem.
Two defenses, used together, close it:
- Bind a timestamp into the signature and reject stale requests. Good webhook signing schemes sign the timestamp along with the body, and send the timestamp in the same header. You verify the signature, then check that the timestamp is recent, within a tolerance of a few minutes that accounts for clock skew. A captured request replayed an hour later fails the freshness check even though its signature is valid. The tolerance must be tight enough to matter and loose enough to survive normal clock drift.
- Dedupe by event id, which you already do. Your idempotency layer is also replay protection. A replayed event carries the same event id you already recorded, so even a replay inside the timestamp window is caught and ignored. This is one more reason the processed-events table earns its keep.
Replay protection is a class of attack that the broader security community documents well; the OWASP API Security Project catalogs the API-level risks worth designing against, replay among them. The pattern to remember is that authenticity and freshness are different properties: the signature proves authenticity, the timestamp and the dedupe table prove freshness. You need both.
| Defense | Stops |
|---|---|
| HMAC signature on raw body | Forged or tampered payloads |
| Constant-time comparison | Timing attacks against the signature check |
| Timestamp in signature + freshness window | Replay of an old captured request |
| Dedupe by event id | Replay inside the window, and ordinary duplicates |
The dead-letter queue and reconciliation backstop
Even with verification, idempotency, and retries, some events will not make it through. A bug in your processing code, a downstream outage longer than your retry window, a malformed payload you cannot handle: these end up in the dead-letter queue, and that is the system working as designed. The dead-letter queue is where you put events you could not process so that they are recoverable rather than lost.
Two practices make it valuable instead of decorative:
- Alert when events land there. A dead-letter queue nobody watches is just a slower way to lose data. When it fills past a threshold, page someone. The first person to notice a broken integration should be you, not the customer whose data went stale.
- Build a replay path. You should be able to reprocess a parked event, or a batch of them, once you have fixed the cause. Replay runs through the same idempotent processing as live events, so reprocessing is safe even if some of the batch already succeeded.
Behind all of this sits the final backstop: reconciliation. Assume, despite everything, that some events never arrived at all, dropped during a deploy, lost when retries exhausted. A periodic poll that compares your copy of the data against the partner's source of truth and backfills any gaps converts silent permanent loss into temporary, self-correcting lag. We cover the full hybrid pattern, webhooks for speed plus a reconciliation poll for completeness, in webhooks vs polling, and it is the single highest-value addition to any webhook integration.
Webhook best practices across a partner portfolio
None of this is exotic. It is the standard kit for consuming webhooks, and the more connectors you run, the more it pays to standardize it once rather than reinvent it per partner. A team that builds three integrations and writes three different signature checks, three different dedupe schemes, and three different retry policies has tripled both the work and the surface area for bugs. A team that builds one hardened receiver pattern and applies it everywhere has a convention support can reason about and new engineers can learn once.
This is part of treating your integrations as a portfolio with shared conventions rather than a pile of one-offs. The same discipline applies in the other direction too. If you are the partner producing webhooks for others to consume, the quality of your signing, your retry policy, and your delivery guarantees directly determines how hard you are to build against. An API that pushes signed, retried, well-documented events with stable ids is a pleasure to integrate; one that forces every consumer to guess at delivery semantics is a tax on everyone who connects. The full set of decisions that make your API easy to build on is in our guide to a partner-ready API, and documenting your webhook behavior precisely is a chapter of that work.
Common mistakes, and the fix
Treating webhooks as fire-and-forget. The fix: build the full receiver. Verify signatures, ack fast, dedupe by event id, retry with backoff, dead-letter, and reconcile. A webhook without these is not reliable sync, it is a hope that nothing ever fails.
Doing the real work inside the request. The fix: verify, store the raw event durably, return 200, and process off a queue. Slow in-request processing trips the sender's timeout and triggers duplicate deliveries while looking unreliable.
Skipping idempotency. The fix: dedupe on the partner's event id with a unique constraint, and make reprocessing a no-op. Duplicates and retries are normal, and an integration that double-applies an event corrupts data its first week live.
Verifying a parsed body instead of the raw bytes. The fix: capture and hash the exact body the partner sent, before any parsing or re-serialization, and compare in constant time. Re-serialized JSON breaks valid signatures and a non-constant-time compare leaks the secret.
Signing without replay protection. The fix: bind a timestamp into the signature, reject stale requests, and lean on your event-id dedupe. A valid signature only proves authenticity, not freshness, so a captured request can be replayed without it.
Retrying forever, or not at all. The fix: retry transient failures with capped exponential backoff and jitter, then dead-letter the rest and alert. Infinite retries hide problems; zero retries drop recoverable events on the first blip.
FAQ
What does at-least-once delivery mean for webhooks? It means the sender keeps retrying until it gets a success response, so the same event can be delivered more than once. You cannot prevent duplicates; you handle them by processing idempotently, which produces exactly-once effects even though delivery is at-least-once.
How do I stop processing the same webhook twice? Dedupe on the event id the partner sends. Keep a processed-events table with a unique constraint, try to record the id before acting, and if it is already present, return 200 and do nothing. Use the partner's stable event id, not a per-request id, so retries reuse the same id.
How should I verify a webhook signature? Hash the raw request body, exactly as sent, with the shared secret using the partner's specified HMAC algorithm, then compare your computed signature to the one in the header using a constant-time comparison. Reject mismatches with 401 before any processing. Never hash a parsed and re-serialized body.
What is a webhook replay attack and how do I prevent it? It is resending a captured, validly signed request to trigger the same effect repeatedly. Prevent it by binding a timestamp into the signature and rejecting requests outside a short freshness window, and by deduping on event id so a replayed event is recognized and ignored.
How many times should I retry a failed webhook, and on what schedule? Retry transient failures only, on exponential backoff with jitter, capped at a fixed number of attempts, often somewhere between five and a dozen over a window of hours. After the cap, move the event to a dead-letter queue and alert. Do not retry permanent failures like malformed payloads.
What goes in a dead-letter queue? Events you could not process after exhausting retries, plus events that failed permanently, like ones with unparseable payloads. Park them where you can inspect and replay them, alert when the queue fills, and reprocess through the same idempotent path once the cause is fixed.
Do webhooks guarantee ordering? No. They arrive as independent HTTP requests and can land out of order, especially under retries. Key your state changes off event ids and timestamps rather than arrival order, and use a reconciliation poll to converge on the partner's truth regardless of the order events showed up.
Further reading
- RFC 2104, HMAC: Keyed-Hashing for Message Authentication for the construction behind webhook signature verification.
- OWASP REST Security Cheat Sheet for the input validation and authentication concerns around a public endpoint.
- MDN on idempotency for the formal property your processing layer must implement.
- RFC 6585, additional HTTP status codes which defines 429 Too Many Requests, the response your retries must respect.
The short version
A reliable webhook integration is a receiver built for the failure modes that are guaranteed to happen. Delivery is at-least-once, so duplicates and out-of-order events are routine, not edge cases. Acknowledge fast by verifying and durably storing the raw event, then process off a queue. Make processing idempotent by deduping on the partner's event id, so retries and duplicates change nothing. Verify the HMAC signature on the raw body with a constant-time comparison, and add replay protection with a signed timestamp and a freshness window. Retry transient failures with capped backoff and jitter, dead-letter the rest, alert, and reconcile with a periodic poll so nothing stays lost.
Do that work and webhooks deliver the speed customers feel without the silent data loss engineers dread. Skip it and the integration is fast right up until the first failure, and then it is worse than the polling it replaced.
If you want help designing the reliable webhook stack for a specific integration, or making your own API easy for partners to build against, that is exactly what a Partner Audit is for. We review your product, API, and partner potential, then define what to build, who to approach, and how to ship it.