How do I deal with incident during release ?

I can say the scariest moment during the development lifecycle is incidents. The whole team keeps receiving emails, and POs start pulling members into calls. What should you do? I was in the same situation last week. So, I will note down what I have done so far to handle these incidents.

1. Identify the Issue

When we heard about the incident after the release, in my case, there were unusual API calls from a particular endpoint. AWS Cloud Watch suddenly emailed the team, and we immediately jumped into an emergency call. The first thing we did was get as much information as possible about the incident by asking some questions:

Which API reached the threshold?
What time was that? Was it constant or did it only happen to a small number of users?

By checking monitoring tools like AWS Cloud Watch, Sentry, or our log service, we can quickly identify the root cause or at least understand what the problem is. After about an hour of discussion, we concluded that there were unusual API calls from a particular endpoint.

2. Reproduce the Issue

Once we had some guesses about the root cause, reproducing the issue was crucial to verify if we were implementing the correct fix. It also helps testers, POs, and change managers approve the changes more easily. Here’s how we tried to reproduce the issue:

Tried different settings like incognito mode and clearing all cache.
Reached out to the user causing the issue to investigate what happened.

Unfortunately, we couldn't get the exact steps to reproduce the issue. But we couldn't ignore it either. We knew there were unusual calls on a certain API. So, how did we handle this endpoint call?

3. Fix the Issue

When we checked the code, we verified that the mechanism to call the endpoint hadn't changed for the last three months, and many people used the API daily without issues. In our project, we used React Query for calling and caching data. Here’s what I did:

Double-checked the query client configuration.
Reviewed recent deployments to see if any changes might have affected the API.
Analyzed logs to find patterns or anomalies related to the unusual API calls.
Consulted with team members about potential edge cases or overlooked scenarios that might cause the issue.

By following these steps, we addressed the incident effectively.

Handling incidents during a release requires a systematic approach to identify, reproduce, and fix issues. By staying calm, gathering information, and working methodically, you can resolve incidents efficiently and minimize their impact on the project. Always remember to review your processes and learn from each incident to improve your response strategies for the future. #tuanhuydev