How running with default configs in production kept us busy for days

How running with default configs in production kept us busy for days

#DebuggingFeb

👩‍💻 Little about me

Firstly thank you for showing interest in this article. This is my first article and I couldn’t have asked for a better topic to write about.

One of the best things I like about being a developer is the art of debugging.

Though it often gets to a state where we feel so helpless after trying 101 ideas that we have thought of to debug an issue, the feeling of accomplishment we get after finding the issue is unmatchable. I am sure most of us would have experienced a moment of pride in finding the root cause of it.

More than fixing an issue, investigating it and getting to the root cause of the issue is what excites me 🤩

💡 Context

It’s a Spring cloud stream based microservice using Apache Kafka as the message queue. The application was a consumer service which is part of a consumer group having 3-5 instances. The expected volume of messages in the Q was ~10k+ messages at specific intervals of time in a day. The consumer will read every message, transform and load them to a Postgres DB. Usually, it takes ~7-10mins for processing 10k messages.

⚔️ Error handling

If there are validation errors, it pushes the events to a Dead Letter Queue (DLQ).

If there are transient errors (errors that are temporary and will not occur if it is re-tried), it retries the processing of the event 3 times. If it still fails, it is pushed into the DLQ.

🔎 Problem

The consumer kept processing the messages over and over again which happened for around 12 hours until it was identified.

🩹 Temporary measure

We added a feature flag in the consumer to simply consume and skip the processing of every message. This is an easier way of flushing the messages in the queue. [Possible because we had the flexibility of pushing records to the Q manually and so, the messages are not lost and we can replay them again.]

🪤 Impact

Due to the continuous processing of messages over 12 hours in all the instances of the consumer group, it increased the CPU utilisation of the EC2 instance drastically. (We did not optimise it to work this way) The messages were not processed successfully and so the data did not get reflected in the other systems.

🔦 Root cause

As part of the message retry implementation, we added spring Kafka default configs to retry the messages 3 times every 15 mins, if it still fails, we configured it to be pushed to the DLQ topic.

This is where we found the problem. By default, Kafka expects messages to be acknowledged / messages to be polled by the consumer every 5mins. If not, it considers that the consumer is no longer active to read & processing messages and so it kicks the consumer out of the consumer group.

At this point, the consumer will retry to join the consumer group every 15mins by default and again the same thing happens. It was in a loop which ended in processing the same messages on and on by all the 3-5 consumer instances that we had.

🔆 Solution

We updated the Kafka configs based on the performance of our consumer & updated the spring cloud stream retry config to a smaller interval (like retry every 30 seconds for 5 times).

🔖 Lessons learnt from this not-so-minor issue

  1. Do not always go with the default configs, especially with any libraries that you are using. Know the configs and take mindful decisions on the configurations in prod.

  2. Even if you’re confident that your change is too small to not work in prod, test it in lower environments anyway! (We tested the retry behaviour with 1-minute intervals in the local environment since we didn't want to wait for 45mins to test 3 retries every 15mins - This was the reason for not catching the issue in lower environments.)

  3. Had to improve the kind of monitoring we had configured in AWS Cloudwatch. It took a day for us to identify such an issue happened as we have not configured metrics for such errors in the service.

📿 My debugging mantra

  1. We have a fix for every problem but understanding the problem is more important.

  2. Break the problem as much as possible and find solutions one by one, until you reach the eureka moment! 😎

Thanks for reaching till the end! I hope this article was useful and would love to hear your feedback!