Monitoring and maintaining shipped integrations

What to watch in a shipped integration, how to alert on it, and how to survive partner API changes without a launch-week fire.

A playbook board showing a shipped integration moving from a live state through error, latency, and sync checks, with an alert routed to an on-call owner.

An integration ships, the launch announcement goes out, and the team moves on to the next item in the backlog. Three weeks later a customer emails support: their records stopped syncing on Tuesday. Nobody noticed, because nobody was watching. The partner quietly changed a field name in their API, your sync job started failing silently, and the first signal you got was an angry customer who had already lost trust in the integration you spent a quarter building.

This is the part of integration work that gets the least planning and causes the most damage. Shipping an integration is a project with a deadline. Running one is a commitment with no end date, and the failure mode is not a dramatic outage. It is slow erosion: a sync that drifts, a webhook that stops arriving, a latency creep that turns a fast integration into a slow one, all of it invisible until a customer feels it. An integration you do not monitor is an integration you are slowly losing.

This guide covers what to watch in a shipped integration, how to turn those signals into alerts that reach the right person, and how to handle the thing that breaks integrations more than your own bugs: the partner changing their API underneath you. It pairs with our guide to API versioning for integrations and our guide to making your API partner-ready, which cover the build side of the same surface.

The 60-second version

  • A shipped integration is a commitment, not a deliverable. The dangerous failures are slow and silent, not loud.
  • Watch four signal families: errors, latency, sync health, and partner-side changes. Each one fails differently and needs its own check.
  • Alert on symptoms the customer feels, not on every internal blip. A page should mean someone needs to act now.
  • Sync failures are the silent killer. Track records that should have synced and did not, not just calls that errored.
  • Partner API changes are a when, not an if. Subscribe to their changelog, watch for deprecation signals, and build for graceful degradation.
  • Give every integration a named owner and a runbook. An integration owned by "the team" is owned by no one when it breaks at 2 AM.
  • Rate limits and auth expiry are scheduled failures. Treat a 429 or a token expiry as something you plan for, not something you discover in production.

Why a shipped integration is the start, not the finish

The mental model that causes most integration pain is treating launch as the finish line. The code is written, the demo worked, the announcement is out, so the work is done. But an integration sits on top of two systems you do not fully control: your own platform, which keeps changing, and the partner's platform, which changes on a schedule you do not set and often do not see. Every release on either side is a chance for the integration to break.

The failures that matter are rarely the loud ones. A total outage gets noticed in minutes, because everything stops and someone shouts. The expensive failures are partial and quiet. A sync job that processes 98 percent of records and silently drops the other 2 percent. A webhook handler that started returning errors after a deploy, so events queue up on the partner's side and arrive hours late, or not at all. A latency increase that pushes a real-time feel into a sluggish one without ever crossing a threshold anyone set. None of these page anyone. All of them erode the thing you built.

There is a trust dimension here that goes beyond uptime. When a customer adopts an integration, they stop doing a manual task and start depending on automation. The moment that automation fails quietly, they are worse off than before they trusted you, because now the work is not happening and they do not know it. The bar for a shipped integration is not "mostly works." It is "fails loudly enough that you fix it before the customer notices." That bar is only reachable with monitoring designed for it.

A useful frame from site reliability practice is to monitor what your users experience, not just what your servers report. Google's chapter on monitoring distributed systems is a good independent reference for the difference between symptom-based alerting, which catches what users feel, and cause-based alerting, which floods you with internal noise. The same distinction is the backbone of integration monitoring.

What to watch: the four signal families

Integration monitoring is not one dashboard. It is four different questions, each with its own data and its own failure shape. Watch all four, because a green light on one tells you nothing about the other three.

Signal family The question it answers What a bad reading looks like
Errors Are calls failing, and which ones A spike in 5xx, or a steady drip of 4xx on one endpoint
Latency Are calls slow enough to hurt the experience The p95 climbing past where the workflow feels real-time
Sync health Is the data that should move actually moving Records that should have synced and did not, growing over time
Partner-side change Did the API we depend on change A new deprecation header, a schema shift, a quietly different response

The trap is watching only the first one. Errors are the easiest to instrument and the loudest when they happen, so teams build error dashboards and call it monitoring. But an integration can have a clean error rate and still be broken: the calls succeed, they just return stale or incomplete data, or they succeed slowly enough that the customer experience degrades. Each family below is a check you cannot skip.

Watching errors without drowning in them

Errors are where monitoring usually starts, and the first mistake is treating all of them the same. A useful error view separates failures by who caused them and whether you can act. A 500 from the partner's API is their problem to fix but your problem to handle. A 422 from your own malformed request is a bug in your integration code. A 401 means your credentials expired. Each points at a different fix, so a single "error rate" number hides more than it shows.

Group errors by the dimensions that change what you do about them:

Error class Likely cause Your move
401 / 403 Expired or revoked credentials, missing scope Refresh the token, re-auth, check the scope grant
422 Your request is malformed or violates a new validation rule Fix the payload, check whether the partner changed validation
429 You are exceeding the partner's rate limit Back off, batch, or request a higher limit
5xx The partner's service is failing Retry with backoff, degrade gracefully, do not hammer them
Timeouts Network or a slow partner endpoint Set sane timeouts, retry idempotent calls only

Two practices keep an error view honest. First, track the error rate as a proportion of total calls, not a raw count, so a traffic spike does not look like an incident and a low-traffic failure does not hide. Second, separate the error rate per partner endpoint, because a single failing endpoint is invisible in an aggregate that averages it against the healthy ones. The endpoint-level view is what turns "errors are up" into "the contacts sync endpoint started returning 422 after their Tuesday release," which is an actionable sentence.

A note on 429 specifically, because it is the error teams most often misread as random. A rate-limit response is not a malfunction; it is the partner telling you to slow down, and the well-behaved response is documented in the spec. The MDN reference for the 429 Too Many Requests status describes the Retry-After header the partner may send, which tells your client exactly how long to wait. Honoring it turns a recurring error into a non-event. We go deeper on the build side of this in our webhooks versus polling guide, because a polling integration is far more likely to hit limits than a webhook-driven one.

Latency: the slow degradation nobody alerts on

Latency is the signal most teams forget, because it rarely fails outright. A call that used to take 200 milliseconds and now takes two seconds has not errored. It still returns a 200. But if that call sits in the path of something a customer waits on, the integration now feels broken even though every health check is green. Latency is how an integration degrades without ever triggering an error alert.

Watch latency as a distribution, not an average. The average hides the tail, and the tail is what customers feel. If your average response is 300 milliseconds but your p99 is eight seconds, one in a hundred calls is a bad experience, and at any real volume that is a steady stream of unhappy moments. Track p50, p95, and p99 separately, and alert on the high percentiles, because that is where degradation shows up first.

Percentile What it tells you Why it matters
p50 (median) The typical experience A rising median means broad slowdown, not an outlier
p95 The experience of your slowest 5 percent Where degradation becomes noticeable at scale
p99 The tail, your worst regular calls The calls that generate support tickets and timeouts

Measure latency at the boundary you control: the time from when your code calls the partner to when you get a usable response, including retries. That number is the honest one, because it is what your own workflow waits on. A partner can report fast server times while your real experience is slow due to network, retries, or rate-limit backoff. Set your latency budget against the customer-facing requirement, not the partner's internal metric, and alert when the budget is at risk rather than when it is already blown.

Sync health: catching the records that quietly never moved

This is the signal family that catches the failure in the opening story, and it is the one almost no error dashboard will ever show you. Sync health asks a different question from the others: not "did our calls work" but "did the data that should have moved actually move." An integration can have a perfect error rate and a healthy latency profile and still be failing every customer who relies on it, because the records that should have synced are quietly sitting still.

The reason errors miss this is structural. A sync can fail without any single call erroring. A webhook that never arrives produces no error, because there was no call to fail. A record filtered out by a logic bug produces no error, because your code decided, wrongly, that it did not need syncing. A partial batch that processes the first 800 of 1,000 records and times out may log one timeout while 200 records silently never moved. To catch any of these, you have to monitor the outcome, not the call.

Build sync health on reconciliation, not on call success:

  • Count what should have synced against what did. If 1,000 records changed on the source side in the last hour, did 1,000 corresponding updates land on the destination side? A growing gap is a failing sync, even with a zero error rate.
  • Track sync lag, not just sync success. A record that synced four hours late is a different problem from one that synced on time, and a creeping lag is an early warning that something upstream is slowing or queuing.
  • Watch for the silent stop. A webhook stream that goes quiet looks identical to "nothing happened." Alert on the absence of expected events, not only on bad events, so a partner outage that stops delivery actually pages someone.
  • Reconcile on a schedule. Run a periodic full or sampled comparison between the two sides, independent of the real-time sync, so drift gets caught even when every individual event looked fine.

The reconciliation check is the single highest-value piece of integration monitoring, because it is the only one that verifies the integration is doing its actual job. Everything else confirms the machinery is running. Reconciliation confirms the work is getting done.

Turning signals into alerts that mean something

Monitoring data is useless if nobody looks at it, and a dashboard nobody opens is a dashboard nobody opens. The job of alerting is to pull a human in at exactly the moment a human is needed, and no other time. Get this wrong in either direction and the monitoring fails: too few alerts and you miss the incident, too many and the team learns to ignore the ones that matter.

The principle that keeps alerting sane is to page on symptoms, not causes. A symptom is something a customer would feel: sync lag past an agreed threshold, the reconciliation gap growing, the p99 latency over budget, the error rate on a critical endpoint above a line. A cause is an internal detail: CPU is high, a queue is deep, one retry failed. Causes are useful for debugging once you are already looking, but they make terrible pages, because most of them resolve on their own and none of them are guaranteed to hurt anyone.

Severity Example condition Where it goes
Page now Sync stopped, reconciliation gap spiking, auth fully broken On-call, immediately
Ticket today Error rate up on one endpoint, latency budget at risk Owner's queue, same day
Watch A deprecation header appeared, slow lag creep Dashboard and weekly review

Two rules keep the page channel trustworthy. First, every page must be actionable: if the on-call person cannot do anything about it right now, it should not have paged. An alert that fires and resolves before anyone can act just trains the team to silence alerts. Second, alert on a trend with a threshold and a duration, not on a single bad data point, so one slow call or one transient 500 does not wake anyone. "p99 over budget for five minutes" is a signal. One slow request is noise. The goal is that when the page fires, the on-call engineer believes it, because the system has never cried wolf.

Partner API changes: the failure you did not write

Most integration failures over a long enough horizon are not your bugs. They are the partner changing their API. A field gets renamed, an endpoint gets deprecated, a default changes, validation tightens, an auth flow gets revised. You did nothing, your code is unchanged, and the integration breaks anyway, because the ground it stood on moved. Treating partner changes as a surprise is the most common reason a healthy integration suddenly is not.

The good news is that responsible partners signal changes before they make them, and there is a standard for the most important signal. The IETF's Sunset HTTP header specification, RFC 8594, defines a header a partner can return to tell you when a resource will stop working, which lets your monitoring detect a deprecation the moment the partner starts advertising it rather than the day it takes effect. If a partner sends Sunset or Deprecation headers, watch for them in your responses and alert on their first appearance. That single check converts a future outage into a scheduled task with lead time.

Beyond headers, build a habit of watching the partner the way a partner engineer watches you:

  • Subscribe to the partner's changelog and developer announcements. This is the cheapest possible monitoring, and it is the one most teams skip after launch. Someone should own reading it.
  • Watch for deprecation and sunset signals in responses. A new warning header, a changed Deprecation field, or a documentation note is your lead time. Use it.
  • Detect schema drift automatically. If a response gains, loses, or retypes a field, your code may keep running while silently mishandling data. A schema check on responses catches a contract change before it corrupts your sync.
  • Build for graceful degradation. When a partner endpoint fails or changes shape, the integration should degrade to a clear, contained failure rather than a cascade. Fail one record loudly, not the whole batch silently.

There is a security angle to partner changes that is easy to miss. A changed auth flow, a new scope requirement, or a token-handling change is both a functional break and a security event, and the way you store and rotate partner credentials matters as much as how you call the API. The OWASP API Security Project is a solid independent reference for the risks that live in integration code specifically, from broken authentication to excessive data exposure, and it is worth reviewing when a partner revises anything in their auth surface. We cover the versioning side of staying compatible through change in our integration versioning guide.

Ownership and the runbook: who fixes it at 2 AM

The best monitoring in the world fails if the alert reaches no one, or reaches someone who does not know what to do. The last mile of maintaining a shipped integration is organizational, not technical: a named owner and a runbook. An integration owned by "the team" is owned by no one, and the discovery that nobody owns it always happens at the worst possible moment, when it is already broken and the customer is already emailing.

Give every shipped integration a single named owner, the person responsible for its health, its alerts, and its response to partner changes. That owner does not have to be the one who fixes every incident, but they are the one who makes sure it gets fixed and that the runbook stays current. When ownership is clear, an alert has a destination. When it is diffuse, an alert is a hot potato that lands in a shared channel and gets ignored until it becomes a customer escalation.

A runbook turns a 2 AM page from a research project into a checklist. It does not need to be long. It needs to answer the questions an on-call engineer will have when they are woken up by an integration they may not have built:

  • What does this integration do, and who relies on it? One paragraph, so the responder understands the blast radius.
  • What are the dashboards and where are they? Direct links, not "search for it."
  • What are the common failures and their fixes? Auth expiry, rate limits, the partner's known weak endpoints, each with the first thing to try.
  • How do we contact the partner, and what is their status page? When the failure is on their side, the runbook should say how to confirm that and who to reach.
  • How do we degrade or pause safely? Sometimes the right move at 2 AM is to pause the sync cleanly and fix it in daylight, and the runbook should say how.

Review the runbook on a schedule, ideally whenever the integration or the partner's API changes. A runbook that describes an integration as it was two quarters ago is worse than none, because it sends the responder down a path that no longer exists. The same discipline of ownership and review that keeps your own API healthy, covered in our partner-ready API guide, keeps the integrations built on top of it healthy too.

Common mistakes, and the fix

Treating launch as the finish line. The fix: budget for the running cost from the start, with a named owner, a dashboard, and alerts in place before the launch announcement, not after the first incident.

Monitoring only errors. The fix: watch all four signal families, errors, latency, sync health, and partner-side change. A clean error rate hides stale data, slow calls, and silent sync stops.

Skipping reconciliation. The fix: count what should have synced against what did, on a schedule, independent of real-time success. It is the only check that verifies the integration is doing its job.

Alerting on causes instead of symptoms. The fix: page on what a customer would feel, with a threshold and a duration. Send internal causes to a dashboard for debugging, not to the on-call phone.

Assuming the partner's API will not change. The fix: subscribe to their changelog, watch for sunset and deprecation signals in responses, detect schema drift, and design for graceful degradation when it changes anyway.

Leaving the integration unowned. The fix: a single named owner and a current runbook, reviewed when anything changes. An unowned integration is an outage waiting for the worst possible time.

FAQ

What should I monitor in a shipped integration? Four signal families: errors grouped by cause and endpoint, latency as a distribution rather than an average, sync health through reconciliation rather than call success, and partner-side changes through changelogs and deprecation signals. Watching only errors is the most common gap, because an integration can have a clean error rate while returning stale data, running slowly, or silently dropping records.

Why is sync health separate from error monitoring? Because a sync can fail without any call erroring. A webhook that never arrives produces no error, a record wrongly filtered out produces no error, and a batch that times out partway may log one timeout while hundreds of records silently never moved. Reconciliation, counting what should have synced against what did, is the only check that catches these, and it is the single highest-value piece of integration monitoring.

How do I avoid alert fatigue? Page on symptoms a customer would feel, not on internal causes, and require every alert to be actionable and to fire on a threshold held over a duration rather than on a single data point. Route lower-severity signals to a dashboard or a daily queue instead of the on-call phone. The goal is that when a page fires, the on-call engineer believes it, because the system has never paged for something that did not matter.

What do I do when a partner changes their API? Catch it early and degrade gracefully. Subscribe to the partner's changelog, watch responses for Sunset and Deprecation headers and for schema drift, and design the integration so a changed or failing endpoint produces a clear, contained failure rather than a cascade. Treat a changed auth flow as a security event as well as a functional one, and revisit how you store and rotate partner credentials when it happens.

How should I handle rate limits in a live integration? Treat a 429 as expected behavior, not a malfunction. Honor the Retry-After header when the partner sends one, back off and batch rather than retrying immediately, and if you are consistently hitting the limit, request a higher one or move from polling to webhooks. A polling integration hits limits far more often than a webhook-driven one, which is one reason to prefer webhooks where the partner supports them.

Who should own a shipped integration? One named person responsible for its health, alerts, and response to partner changes, backed by a runbook that any on-call engineer can follow at 2 AM. An integration owned by "the team" is owned by no one, and that gap is always discovered at the worst moment. The owner keeps the runbook current and makes sure incidents get fixed, even if they do not personally fix each one.

How much monitoring is enough for a small team? Start with the highest-value checks and grow from there: a reconciliation gap alert, a sync-lag alert, an error-rate alert per critical endpoint, and a watch for sunset headers. That handful catches the failures that actually reach customers. You do not need a full observability platform on day one. You need the symptom alerts that fire before a customer emails support.

Further reading

The short version

A shipped integration is a commitment with no end date, and the failures that hurt are slow and silent, not loud. Monitoring it means watching four signal families, not one: errors grouped by cause, latency as a distribution, sync health through reconciliation, and partner-side changes through changelogs and deprecation headers. Alert on the symptoms a customer would feel, with thresholds that keep the page channel trustworthy, and give every integration a named owner and a current runbook so the alert reaches someone who can act.

None of this is exotic. It is the discipline of treating a live integration as a product you operate, not a project you finished. The teams that do it catch the drift before the customer does, and the integration they spent a quarter building keeps earning its place instead of quietly eroding.

If you want an outside pair of eyes on exactly that, a Partner Audit reviews your integrations, your monitoring, and your partner readiness, then gives you a concrete plan: what to watch, what to alert on, and how to stay ahead of the next partner API change.

Ready to turn partnerships into shipped product?

Start with a Partner Audit. We review your product, API, customer workflows, and partner potential.

Book a Partner Audit