If you build API integrations, you have probably run into this nightmare scenario: Your app is connected to a 3rd-party platform (Salesforce, Shopify, Google, etc.). Everything works flawlessly for days. Then, suddenly, the user's connection drops, and they have to manually re-authenticate. You check the logs, and you just see a bunch of 401 Unauthorized errors with no clear explanation.
99% of the time, you are dealing with a Refresh Token Race Condition.
Here is exactly why it happens and the architecture pattern to fix it once and for all.
The Problem: The Concurrency Trap
When an access token expires, your server uses the refresh token to get a new one. Many APIs use Refresh Token Rotation, meaning once a refresh token is used, it is immediately invalidated and a new one is issued.
But what happens when your app fires 5 parallel background jobs at the exact same time, and the access token happens to be expired?
To fix this, you need to ensure that only one process can ever attempt to refresh a token at a time, while the other processes wait for the new token. You can achieve this using a Distributed Lock (usually via Redis).
Here is the logic flow you need to implement in your request interceptor/middleware:
Step 1: Check the token
Before making an external API call, check if the access token is expired. If it’s valid, proceed as normal.
Step 2: Acquire the Lock
If the token is expired, attempt to acquire a Redis lock (e.g., lock:token_refresh:user_123).
The job that got the lock calls the 3rd-party API to get the new tokens.
Step 4: Save and Release
The Leader saves the new Access and Refresh tokens to your database, and then releases the Redis lock.
Step 5: The "Followers" Resume
The other parallel jobs that were sleeping notice the lock is gone, pull the new access token from the database, and execute their API calls successfully.
Pro-Tip: Add an "Expiration Buffer"
Don't wait for a 401 Unauthorized to trigger a refresh. If a token expires at 12:00:00, treat it as expired at 11:55:00. This 5-minute buffer guarantees that by the time your background jobs pick up the task, they either have a fresh token or are safely funneled into the Mutex lock process.
Has anyone else lost their mind debugging this in the past? What locking mechanism do you use to prevent it?
99% of the time, you are dealing with a Refresh Token Race Condition.
Here is exactly why it happens and the architecture pattern to fix it once and for all.
The Problem: The Concurrency Trap
When an access token expires, your server uses the refresh token to get a new one. Many APIs use Refresh Token Rotation, meaning once a refresh token is used, it is immediately invalidated and a new one is issued.
But what happens when your app fires 5 parallel background jobs at the exact same time, and the access token happens to be expired?
- All 5 jobs realize the access token is dead.
- All 5 jobs simultaneously send the same refresh token to the API provider.
- Job #1 succeeds, gets a new token pair, and the provider immediately revokes the old refresh token.
- Jobs #2, #3, #4, and #5 hit the provider a millisecond later with the now-revoked refresh token.
- The API provider flags this as a security risk (token reuse) and revokes ALL tokens. 6. Your integration breaks. The user is logged out.
To fix this, you need to ensure that only one process can ever attempt to refresh a token at a time, while the other processes wait for the new token. You can achieve this using a Distributed Lock (usually via Redis).
Here is the logic flow you need to implement in your request interceptor/middleware:
Step 1: Check the token
Before making an external API call, check if the access token is expired. If it’s valid, proceed as normal.
Step 2: Acquire the Lock
If the token is expired, attempt to acquire a Redis lock (e.g., lock:token_refresh:user_123).
- If you GET the lock: You are the "Leader."
- If you DON'T get the lock: Another job is already refreshing the token. Do not make a refresh request. Instead, put this thread to sleep (e.g., check back every 100ms) until the lock is released, then use the freshly updated token from your database.
The job that got the lock calls the 3rd-party API to get the new tokens.
Step 4: Save and Release
The Leader saves the new Access and Refresh tokens to your database, and then releases the Redis lock.
Step 5: The "Followers" Resume
The other parallel jobs that were sleeping notice the lock is gone, pull the new access token from the database, and execute their API calls successfully.
Pro-Tip: Add an "Expiration Buffer"
Don't wait for a 401 Unauthorized to trigger a refresh. If a token expires at 12:00:00, treat it as expired at 11:55:00. This 5-minute buffer guarantees that by the time your background jobs pick up the task, they either have a fresh token or are safely funneled into the Mutex lock process.
Has anyone else lost their mind debugging this in the past? What locking mechanism do you use to prevent it?