Why does "don't touch if it works" feel broken

The first not serious rule of programming says: "don't touch if it works". It means that you should not spend much time checking things that proved to be reliable in a working system. But sometimes things can go wrong.

Introduction

Currently, I'm working on a system that consists of microservices. And we have a couple of services that were written years ago and do the job well. One of such modules is a module for sending push notifications with Amazon Simple Notification Service (SNS).
The idea of this module was to get a list of registered devices, filter only required one (based on input filter parameters) and send one-to-one notifications to each device. Service is well optimized for work with a big list of devices and can handle sending fast notifications to a big list of devices in a short time. It worked this way for two or three years and everyone was happy. But...

Small change

This push module was also able to send individual messages to one concrete device (we used it for testing and for sending personalized notifications). This was not the main function, but also worked well and we were sending such messages for a long time. Individual messages were sent with higher priority and were not stuck in a queue if there was a push to a big list of devices.
One day we realized that making filtering of devices in the push module doesn't meet our new requirements. We already had a centralized module (let's call it "sender") which also filtered users and them devices (for sending SMS or e-mails). And now two modules had duplicated logic of filtering users ("sender" module and push module). So the good plan was to move everything to one "sender" module and make push module only send notification and that's it. As the push module already supported sending individual notifications, the fastest fix was to just start sending individual notifications from "sender" module. This solution worked well for 4 or 6 months. But then...

Unexpected problems

Summer is the time when we usually have big promotion campaigns and start sending a lot more push notifications than on a regular day. But this year the speed of sending notifications became much worse. Instead of sending a million notifications in 20 minutes it's now taking 2-3 hours which is not acceptable! We started to dig in.
The first thought was that we have problems at Amazon side (because push module was working well for years and there were no big changes in it, we thought). But based on our tests Amazon SNS worked well and had no any maintaining service at this time.
Then we start to claim our "sender" service for not filtering very well (again no one was able to think about the push module itself, as it worked well for a years!). But it also worked well and was not our bottleneck.

Why we were not able to identify this

Push module, when sending messages uses a couple of queues inside of RabbitMQ: one queue was used for formatting messages, another one for getting a real endpoint for the device in Amazon SNS from the database and filtering inactive devices, and the third one was used for actually pushing messages to devices. In the process of sending notifications to a big list of users, we always had a lot of pending messages in the queue and we didn't pay attention when we examined RabbitMQ queues after we noticed problems with speed. And at some point...

Finally

I've decided to check how does the push module works inside (I worked with this module before, but still was not 100% sure how it works from end to end). I've got a list of paper and a pencil and started to write down the pass of messages inside module queues.
I've found that module has 2 types of queues for each stage (the one I've described higher).
The first type of queues, I will call them "immediate" queues, were used for sending individual messages. They had one processing worker for each queue, no prefetch of messages from the queue and were designed to go fast and independently from any messages in a queue of a second kind.
The second kind of queues, I will call them "push" queues, were used by the messages that were filtered out by the push module. This queues had 3 processing workers on each queue with the prefetch of 100 messages for each worker (to reduce the number of network calls).
So after our "small change" push module started to use only the first "immediate" queues which were 3-4 times slower processed than "push" queues.

Solution

The solution was pretty simple - move messages for big push campaigns to an optimized "push" queues and individual messages move to slower but independent "immediate" queues. To do this I've added a new flag to a message to say if it's individual "fast" message or not and "sender" module was responsible for setting this flag. So now push module will but not individual messages to a "push" queues and individual messages - in "immediate" queues. So now the problem is solved - we are sending messages for promotional campaigns fast and at the same time individual messages are not stuck in the queue and received by users. Everyone is happy!

Conclusions

Service should do one thing and do this well. In our case service did the unrelated job which is the design problem
Services need to be monitored. If we had good monitoring that will show us that something went wrong, possible it won't be a problem.

So don't trust anyone, even past you. And if possible try to check and predict how your changes will affect the system before moving it to prod (customers don't like being testers most of the time). If possible try to test, test and test your solutions in test before moving to prod :)

Thank you for your attention!

P.S. If you're are interested in microservices and don't know where to start I recommend you to try the book "Building Microservices: Designing Fine-Grained Systems" by Sam Newman. This book is a good read about ways to build and connect your microservices. It helped me understand how to build microservices and how to make a good communication between them.

Blog about software development

Search This Blog