Skip to main content

Why does "don't touch if it works" feel broken


The first not serious rule of programming says: "don't touch if it works". It means that you should not spend much time checking things that proved to be reliable in a working system. But sometimes things can go wrong.

Introduction

Currently, I'm working on a system that consists of microservices. And we have a couple of services that were written years ago and do the job well. One of such modules is a module for sending push notifications with Amazon Simple Notification Service (SNS).
The idea of this module was to get a list of registered devices, filter only required one (based on input filter parameters) and send one-to-one notifications to each device. Service is well optimized for work with a big list of devices and can handle sending fast notifications to a big list of devices in a short time. It worked this way for two or three years and everyone was happy. But...

Small change

This push module was also able to send individual messages to one concrete device (we used it for testing and for sending personalized notifications). This was not the main function, but also worked well and we were sending such messages for a long time. Individual messages were sent with higher priority and were not stuck in a queue if there was a push to a big list of devices.
One day we realized that making filtering of devices in the push module doesn't meet our new requirements. We already had a centralized module (let's call it "sender") which also filtered users and them devices (for sending SMS or e-mails). And now two modules had duplicated logic of filtering users ("sender" module and push module). So the good plan was to move everything to one "sender" module and make push module only send notification and that's it. As the push module already supported sending individual notifications, the fastest fix was to just start sending individual notifications from "sender" module. This solution worked well for 4 or 6 months. But then...

Unexpected problems

Summer is the time when we usually have big promotion campaigns and start sending a lot more push notifications than on a regular day. But this year the speed of sending notifications became much worse. Instead of sending a million notifications in 20 minutes it's now taking 2-3 hours which is not acceptable! We started to dig in.
The first thought was that we have problems at Amazon side (because push module was working well for years and there were no big changes in it, we thought). But based on our tests Amazon SNS worked well and had no any maintaining service at this time.
Then we start to claim our "sender" service for not filtering very well (again no one was able to think about the push module itself, as it worked well for a years!). But it also worked well and was not our bottleneck.

Why we were not able to identify this

Push module, when sending messages uses a couple of queues inside of RabbitMQ: one queue was used for formatting messages, another one for getting a real endpoint for the device in Amazon SNS from the database and filtering inactive devices, and the third one was used for actually pushing messages to devices. In the process of sending notifications to a big list of users, we always had a lot of pending messages in the queue and we didn't pay attention when we examined RabbitMQ queues after we noticed problems with speed. And at some point...

Finally

I've decided to check how does the push module works inside (I worked with this module before, but still was not 100% sure how it works from end to end). I've got a list of paper and a pencil and started to write down the pass of messages inside module queues.
I've found that module has 2 types of queues for each stage (the one I've described higher).
The first type of queues, I will call them "immediate" queues, were used for sending individual messages. They had one processing worker for each queue, no prefetch of messages from the queue and were designed to go fast and independently from any messages in a queue of a second kind.
The second kind of queues, I will call them "push" queues, were used by the messages that were filtered out by the push module. This queues had 3 processing workers on each queue with the prefetch of 100 messages for each worker (to reduce the number of network calls).
So after our "small change" push module started to use only the first "immediate" queues which were 3-4 times slower processed than "push" queues.

Solution

The solution was pretty simple - move messages for big push campaigns to an optimized "push" queues and individual messages move to slower but independent "immediate" queues. To do this I've added a new flag to a message to say if it's individual "fast" message or not and "sender" module was responsible for setting this flag. So now push module will but not individual messages to a "push" queues and individual messages - in "immediate" queues. So now the problem is solved - we are sending messages for promotional campaigns fast and at the same time individual messages are not stuck in the queue and received by users. Everyone is happy!

Conclusions

  • Service should do one thing and do this well. In our case  service did the unrelated job which is the design problem
  • Services need to be monitored. If we had good monitoring that will show us that something went wrong, possible it won't be a problem.
So don't trust anyone, even past you. And if possible try to check and predict how your changes will affect the system before moving it to prod (customers don't like being testers most of the time). If possible try to test, test and test your solutions in test before moving to prod :)

Thank you for your attention!

P.S. If you're are interested in microservices and don't know where to start I recommend you to try the book "Building Microservices: Designing Fine-Grained Systems" by Sam Newman. This book is a good read about ways to build and connect your microservices. It helped me understand how to build microservices and how to make a good communication between them.

Comments

Popular posts from this blog

How to Build TypeScript App and Deploy it on GitHub Pages

Quick Summary In this post, I will show you how to easily build and deploy a simple TicksToDate time web app like this: https://zubialevich.github.io/ticks-to-datetime .

Pros and cons of different ways of storing Enum values in the database

Lately, I was experimenting with Dapper for the first time. During these experiments, I've found one interesting and unexpected behavior of Dapper for me. I've created a regular model with string and int fields, nothing special. But then I needed to add an enum field in the model. Nothing special here, right? Long story short, after editing my model and saving it to the database what did I found out? By default Dapper stores enums as integer values in the database (MySql in my case, can be different for other databases)! What? It was a surprise for me! (I was using ServiceStack OrmLite for years and this ORM by default set's enums to strings in database) Before I've always stored enum values as a string in my databases! After this story, I decided to analyze all pros and cons I can imagine of these two different ways of storing enums. Let's see if I will be able to find the best option here.

Caching strategies

One of the easiest and most popular ways to increase system performance is to use caching. When we introduce caching, we automatically duplicate our data. It's very important to keep your cache and data source in sync (more or less, depends on the requirements of your system) whenever changes occur in the system. In this article, we will go through the most common cache synchronization strategies, their advantages, and disadvantages, and also popular use cases.

How to maintain Rest API backward compatibility?

All minor changes in Rest API should be backward compatible. A service that is exposing its interface to internal or/and external clients should always be backward compatible between major releases. A release of a new API version is a very rare thing. Usually, a release of a new API version means some global breaking changes with a solid refactoring or change of business logic, models, classes and requests. In most of the cases, changes are not so drastic and should still work for existing clients that haven't yet implemented a new contract. So how to ensure that a Rest API doesn't break backward compatibility?