I can say the scariest moment during the development lifecycle is incidents. The whole team keeps receiving emails, and POs start pulling members into calls. What should you do? I was in the same situation last week. So, I will note down what I have done so far to handle these incidents.
When we heard about the incident after the release, in my case, there were unusual API calls from a particular endpoint. AWS Cloud Watch suddenly emailed the team, and we immediately jumped into an emergency call. The first thing we did was get as much information as possible about the incident by asking some questions:
By checking monitoring tools like AWS Cloud Watch, Sentry, or our log service, we can quickly identify the root cause or at least understand what the problem is. After about an hour of discussion, we concluded that there were unusual API calls from a particular endpoint.
Once we had some guesses about the root cause, reproducing the issue was crucial to verify if we were implementing the correct fix. It also helps testers, POs, and change managers approve the changes more easily. Here’s how we tried to reproduce the issue:
Unfortunately, we couldn't get the exact steps to reproduce the issue. But we couldn't ignore it either. We knew there were unusual calls on a certain API. So, how did we handle this endpoint call?
When we checked the code, we verified that the mechanism to call the endpoint hadn't changed for the last three months, and many people used the API daily without issues. In our project, we used React Query for calling and caching data. Here’s what I did:
By following these steps, we addressed the incident effectively.
Handling incidents during a release requires a systematic approach to identify, reproduce, and fix issues. By staying calm, gathering information, and working methodically, you can resolve incidents efficiently and minimize their impact on the project. Always remember to review your processes and learn from each incident to improve your response strategies for the future. #tuanhuydev