Rate limits and quotas partners can design around

An independent guide to API rate limiting partners can design around. Limit strategies, the headers to return, backoff and retries, tiered quotas, and how to communicate limits clearly.

A large API reference card with a blue bookmark being read by a partner engineer node, with the line "Make your API partner-ready" beneath it.

A partner ships an integration on Friday, traffic climbs over the weekend, and on Monday their app starts getting rejected with no warning and no clue how long to wait. They did nothing wrong. Your API simply hit a limit they could not see and could not plan for. Rate limiting is not the villain in that story. The missing headers, the silent rejections, and the limit nobody documented are. A good rate limit is one a partner can design around before they ever hit it, and the difference between the two is mostly engineering you control.

This is an independent guide to rate limiting for an API that partners build on. It is written from the side of the team running the API, the producer, because that is where the choices that help or hurt partners get made. The specifics of any one platform change, and the right numbers depend on your infrastructure and your traffic, so treat the figures here as illustrations rather than recommendations. What does not change is the shape of the work: pick a limit strategy, return the headers that make the limit visible, give clients a clear way to back off, structure quotas by tier, and communicate all of it so a partner can plan. That shape is what this guide covers.

It pairs with our guide to a partner-ready API, our breakdown of API error design, and our webhooks vs polling comparison, since polling pressure and rate limits are two sides of the same load question.

The 60-second version

  • Rate limiting protects everyone, including partners. A limit keeps one noisy client from degrading the API for the rest, which is a service to the partners who behave.
  • Pick a strategy that fits your traffic. Fixed windows, sliding windows, and token buckets behave differently under bursts. Choose the one that matches how your partners actually call you.
  • Make the limit visible in every response. Return headers that say what the limit is, how much is left, and when it resets, so a client can pace itself instead of guessing.
  • Tell clients how to back off. A 429 with a clear retry signal turns a rejection into a polite "try again in N seconds" rather than a wall.
  • Structure quotas by tier, not one number for everyone. Different partners have different needs, and a tiered model lets you say yes to scale without saying yes to abuse.
  • Communicate limits before partners hit them. Document the numbers, the headers, and the backoff behavior, so the limit is a design constraint, not a surprise.
  • Design for graceful degradation. When a partner approaches a limit, the best APIs slow them down predictably rather than cutting them off without warning.

Why rate limiting helps partners, not just you

It is easy to frame rate limiting as the API protecting itself from partners. That framing is backwards, and it leads to limits that feel adversarial. A rate limit is how the API protects every well-behaved partner from the one client that, through a bug or a runaway loop, would otherwise consume all the capacity and degrade the service for everyone else. The partner whose integration keeps working during someone else's incident is the one your limit just protected.

That reframing matters because it changes how you design the limit. If the goal is to punish heavy use, you build a blunt wall and a partner hits it with no warning. If the goal is fair sharing of a finite resource, you build a limit that is visible, predictable, and accompanied by the information a client needs to stay under it. The same number can be either, depending on whether you surface it. The work in this guide is mostly about making a necessary limit feel fair, because a fair limit is one partners design around instead of fighting.

There is a reliability argument too. An API with no limits is an API one bad client can take down, which means an outage for every partner at once. A sensible limit is a stability feature, and stability is the thing partners value most in something they have built a product on top of. The limit is not the cost of integrating with you. It is part of what makes integrating with you safe.

Step 1: choose a limit strategy

The first decision is the algorithm, because different strategies behave very differently under the bursty, uneven traffic real partners generate. The common mistake is to pick the simplest one without thinking about how partners actually call the API, then discover it either rejects legitimate bursts or lets through floods it should have caught. Match the strategy to your traffic.

The three approaches most APIs start from:

Fixed window. Count requests in a fixed clock interval, say per minute, and reject anything over the limit until the window rolls over. It is the simplest to implement and the easiest to explain, which is its appeal. Its weakness is the boundary: a client can send a full window's worth of requests at the end of one minute and another full window's worth at the start of the next, briefly doubling the intended rate right at the edge.

Sliding window. Track requests over a rolling interval rather than a fixed clock boundary, so the limit applies to "the last 60 seconds" at any moment rather than "this calendar minute." It smooths out the boundary problem of the fixed window at the cost of more bookkeeping. It is a good default when bursts at window edges are causing real load spikes.

Token bucket. Refill a bucket of tokens at a steady rate, spend one token per request, and reject when the bucket is empty. It allows short bursts up to the bucket size while enforcing a sustained average rate over time, which often matches real usage well: partners are quiet, then do a burst of work, then go quiet again. The token bucket is forgiving of legitimate bursts while still capping sustained abuse.

A quick way to choose:

If your traffic is Reach for Because
Steady and you want simplicity Fixed window Easiest to build and explain
Bursty at window edges Sliding window Removes the boundary doubling problem
Quiet then bursty, then quiet Token bucket Allows bursts, caps sustained rate
A mix across many partners Token bucket per partner Fairness with burst tolerance

Whichever you pick, apply the limit per partner or per credential, not globally across all of them. A single global limit means one busy partner can starve everyone else, which is the exact failure rate limiting is supposed to prevent. Scope the limit to the credential so each partner gets their own fair share.

Step 2: return the headers that make the limit visible

A limit a client cannot see is a limit a client will hit. The single highest-leverage thing you can do for partners is to put the state of their limit in every response, so a well-written client can pace itself and never trigger a rejection in the first place. This is the difference between a limit that surprises partners and one they design around, and it costs you a few headers.

There is a widely used convention for this. On every response, return how many requests the client is allowed in the current window, how many remain, and when the window resets:

Header What it tells the client
RateLimit-Limit The ceiling for the current window
RateLimit-Remaining How many requests are left before rejection
RateLimit-Reset When the window resets and the count refills

The exact header names vary by platform, and some APIs use an X- prefixed variant of the same three values, so document whichever you choose. What matters is that all three pieces of information are present on normal responses, not only on the rejection. A client that can read remaining and reset on a successful call can slow itself down as it approaches the limit, spreading work out instead of slamming into the wall and bouncing off it.

When the client does cross the line, the right status code is 429, "Too Many Requests," which exists for exactly this case. Pair it with a clear signal of how long to wait, covered next. The principle is the same one in our API error design guidance: an error should tell the client what happened and what to do about it, and a 429 with a retry signal does both. A bare 429 with no headers tells the client only that it failed, which leaves it guessing, and guessing clients retry too fast and make the problem worse.

Step 3: tell clients how to back off

Returning a 429 is only half the job. The other half is telling the client when to try again, because a rejection without a wait time invites the worst possible response: an immediate retry, then another, hammering an API that is already signaling it is at capacity. The backoff behavior you guide clients toward is what separates a limit that recovers gracefully from one that turns into a retry storm.

The clearest signal is the Retry-After header on a 429 response, which tells the client exactly how many seconds to wait before trying again. When you can compute it, send it, because it removes all guesswork: the client waits the stated time and tries once. This is the most partner-friendly thing a rate-limited endpoint can do, and it is cheap to add.

When you cannot give an exact time, guide clients toward exponential backoff with jitter:

  • Exponential backoff means each retry waits longer than the last, say one second, then two, then four, so a client that is being limited steps back progressively instead of retrying at a fixed fast interval.
  • Jitter means adding a small random amount to each wait, so a thousand clients that all got limited at the same instant do not all retry at the same instant. Without jitter, synchronized retries create a thundering herd that re-triggers the limit the moment it lifts.
  • A retry ceiling means giving up after a sensible number of attempts and surfacing the failure, rather than retrying forever. A client stuck in an infinite retry loop is a client generating load with no path to success.

A simple backoff flow for a client to follow:

Attempt Wait before retry Note
1 (initial) none The request that got the 429
2 Retry-After, or ~1s + jitter Respect the header if present
3 ~2s + jitter Double the base wait
4 ~4s + jitter Keep doubling
5+ stop and surface Do not retry forever

You cannot force partners to implement this, but you can make it the obvious path by returning Retry-After, documenting the backoff you expect, and ideally shipping it in any SDK you provide so the right behavior is the default. An SDK that handles backoff correctly means most partners get it for free, which is the same logic as our partner-ready API guidance: the producer makes the right thing easy.

Step 4: structure quotas by tier

A single rate limit applied to every partner is a compromise that fits none of them: too low for the partner running real volume, too high for the trial account you do not yet trust. Tiered quotas solve this by matching the limit to the relationship, so you can grant scale to partners who have earned it without exposing the API to abuse from accounts you know nothing about.

The idea is to define a small number of tiers, each with its own limits, and assign partners to a tier based on their plan, their stage, or their track record:

Tier Typical fit Limit posture
Free or trial New, unproven accounts Conservative, enough to evaluate
Standard Paying partners in production Comfortable for normal volume
Scale or enterprise High-volume, established partners High, often negotiated
Internal or trusted First-party and vetted partners Highest, monitored closely

A few principles keep a tiered model honest. Make the tiers and their limits public where it makes sense, so a partner can see what scaling up gets them and plan a migration before they outgrow the current tier. Give partners a clear path to a higher tier, whether that is upgrading a plan or requesting an increase, because a partner who cannot grow on your API will eventually build around it or leave. And separate the rate limit (requests per unit of time) from any longer-window quota (requests per day or month) if you use both, because they answer different questions and a partner needs to see both to plan capacity.

The tiering logic is the same one behind partner program tiers: a tier is a way to offer more to partners who have demonstrated they will use it well, while keeping a sensible default for everyone else. The limit is not just a technical control. It is part of the commercial relationship, and treating it that way lets you say yes to growth deliberately.

Step 5: communicate limits before partners hit them

The best rate limit in the world fails its partners if they discover it by hitting it. Everything in the previous steps, the strategy, the headers, the backoff, the tiers, only helps a partner who knows it exists before they ship. Communication is not the last step bolted on after the engineering. It is what turns the engineering into something a partner can design around.

What clear communication of limits includes:

  • The numbers, in the docs. State the actual limits for each tier, the window they apply over, and any separate daily or monthly quota. A partner sizing an integration needs the real figures, not "reasonable use."
  • The headers, explained. Document the exact header names you return, what each one means, and that they appear on normal responses, so partners build clients that read and respect them.
  • The backoff you expect. Tell partners to honor Retry-After, to use exponential backoff with jitter otherwise, and to cap their retries. Spelling it out gets you better-behaved clients.
  • What a 429 looks like. Show the exact response a client gets when limited, including the status code and headers, so partners can handle it correctly on the first try.
  • How to request more. Give partners a clear path to a higher tier or a temporary increase for a known traffic event, so a launch does not turn into a wall.
  • Advance notice of changes. If you tighten a limit, tell partners before it takes effect, the same way you would for any breaking change, because a limit change can break a working integration.

This is the same discipline as documenting errors well, which we cover in API error design: the behavior a client must handle should be written down before the client encounters it. A partner who reads your limits, builds a client that reads your headers and backs off correctly, and picks the right tier will rarely hit a hard rejection. That is the goal. The limit still exists and still protects the service, but it has become a design constraint the partner planned for rather than an incident they ran into.

Common mistakes, and the fix

Treating the limit as a wall instead of a signal. The fix: return the limit, the remaining count, and the reset time on every response, so clients pace themselves before they hit the ceiling. A visible limit is one partners design around; an invisible one is one they crash into.

Returning a bare 429 with no retry guidance. The fix: send Retry-After when you can compute it, and document exponential backoff with jitter when you cannot. A rejection with no wait time invites immediate retries, which turn a brief limit into a retry storm.

Applying one global limit across all partners. The fix: scope the limit per partner or per credential, so one busy client cannot starve the rest. A global limit recreates the exact unfairness rate limiting is meant to prevent.

One rate limit for every partner regardless of relationship. The fix: define a small set of tiers with limits that match the relationship, and give partners a clear path to scale up. A single number is too low for your biggest partners and too generous for accounts you do not yet trust.

Documenting the limit only after partners complain. The fix: publish the numbers, the headers, the backoff expectation, and the 429 shape before partners build. A limit discovered by hitting it is a support ticket; a limit read in the docs is a design input.

Tightening a limit without notice. The fix: treat a stricter limit as a breaking change and give advance notice, because lowering a limit can break an integration that was working fine yesterday.

FAQ

What is the difference between a rate limit and a quota? A rate limit caps how fast a client can call you, usually requests per second or per minute, and protects the service from bursts. A quota caps how much a client can call you over a longer window, usually per day or per month, and is often tied to a plan. Many APIs use both: a per-minute rate limit to protect stability and a daily or monthly quota to match the commercial tier. A partner needs to see both to size an integration, so document them separately.

Which rate limiting algorithm should I use? It depends on your traffic. A fixed window is simplest but lets bursts double up at window boundaries. A sliding window fixes that boundary problem with more bookkeeping. A token bucket allows short legitimate bursts while capping the sustained rate, which often matches how partners actually call an API: quiet, then a burst of work, then quiet again. For most partner-facing APIs a per-partner token bucket is a sensible default, but match the choice to your real traffic patterns.

What status code should a rate-limited request return? 429, "Too Many Requests," which exists for exactly this case. Pair it with a Retry-After header when you can, so the client knows how long to wait, and include the same limit, remaining, and reset headers you return on normal responses. A 429 with a clear retry signal tells the client both what happened and what to do, which is what any good error response should do.

How should clients respond to a 429? Honor the Retry-After header if it is present and wait exactly that long before trying again. If there is no such header, use exponential backoff, doubling the wait between attempts, with a small random jitter so many clients do not all retry at the same instant, and stop after a sensible number of attempts rather than retrying forever. If you ship an SDK, build this in so partners get the right behavior by default.

How do I set the actual limit numbers? The right numbers depend on your infrastructure, your costs, and your traffic, so there is no universal answer, and you should not copy figures from another API. Start from what your service can sustain comfortably with headroom, set tier limits that match the relationships you want to support, and watch real usage. It is easier to raise a limit that turned out to be conservative than to lower one partners have already built against, so start sensible and loosen deliberately.

How is rate limiting related to webhooks and polling? Polling is one of the biggest sources of rate-limit pressure, because a client checking for changes on a tight interval generates constant load whether or not anything changed. Offering webhooks lets partners react to events instead of polling for them, which reduces the call volume hitting your limits. We compare the two in webhooks vs polling, and the connection is direct: good event delivery is partly a rate-limiting strategy.

Further reading

  • MDN on the 429 status code for the standard meaning and intended use of "Too Many Requests."
  • MDN on the Retry-After header for how to tell a client when it may try again.
  • RFC 6585 which defines the 429 status code in the HTTP standard.
  • RFC 9110 for the broader HTTP semantics that status codes and headers fit into.
  • OWASP for the security perspective, since rate limiting is also a control against abuse and brute-force traffic.

The short version

Rate limiting is a service to your partners, not a tax on them, because it protects every well-behaved client from the one that would otherwise consume all the capacity. Pick a strategy that fits your traffic, a token bucket scoped per partner is a sensible default, since it tolerates legitimate bursts while capping sustained load. Make the limit visible by returning the ceiling, the remaining count, and the reset time on every response, so a good client paces itself and never hits the wall. When a client does cross the line, return a 429 with a clear retry signal, and guide clients toward exponential backoff with jitter so a brief limit does not become a retry storm. Structure quotas by tier so you can grant scale to partners who have earned it without exposing the API to accounts you do not yet trust. And communicate all of it, the numbers, the headers, the backoff, and any changes, before partners hit the limit, because a documented limit is a design constraint a partner plans for, while an undocumented one is an incident they run into.

If you want help making your API something partners can build on with confidence, a Partner Audit reviews your API, your developer experience, and your integration surface, then hands you a concrete plan for what to improve and where partners get stuck.

Ready to turn partnerships into shipped product?

Start with a Partner Audit. We review your product, API, customer workflows, and partner potential.

Book a Partner Audit