Long live Nxs-backup v3.0!
Four years ago, Nixys team decided to make our own backup tool –
nxs-backup is a tool that solves such problems as:
- organizing backups with built-in tools;
- delivery of backups to different storages;
- rotation of backups created in storages.
Let’s start by refreshing our basic requirements for the tool when we wrote the previous version:
Backup data of the most commonly used software:
- Files (discrete and incremental backups)
- MySQL (logical/physical backups)
- PostgreSQL (logical/physical backups)
- MongoDB
- Redis
Store backups in remote repositories:
- S3
- FTP
- SSH
- SMB
- NFS
- WebDAV
Since then we have had new requirements for our tool. Below I will describe each of them in more detail.
Running a binary without rebuilding from source code on any linux
Over time, the list of systems we work with has increased significantly. Now we serve projects that use in addition to the standard deb and rpm compatible distributions, such as Arch, Suse, Alt, etc.
The latter systems were difficult to run nxs-backup on, since we only built deb and rpm packages and supported a limited list of system versions. Somewhere we rebuilt the whole package, somewhere only the binary, somewhere we just had to run the source code. In one project we used servers on ARM processors. We also had to tinker with them.
Working with the old version was very inconvenient for the engineers because of the need to work with the sources, not to mention the fact that the installation and update in this mode is much more time-consuming. Instead of configuring 10 servers an hour, we had to spend an hour on just one server.
We had known that it was much better to have a binary without system dependencies, which could run on any distribution and not have problems because of different versions of libraries and architectural differences in the systems. We wanted to make tool the same way.
Minimize docker image with nxs-backup and ENV support in configuration files
Lately a lot of projects are running in a container environment. These projects also require backups and we run nxs-backup in containers. And for container environments, it is very important to minimize image size and be able to work with environment variables.
The old version did not allow to work with environment variables. The main problem was that passwords had to be stored directly in the config. Because of this, instead of a set of variables containing only passwords, we had to put the entire config into a variable. Editing large environment variables requires more concentration from the engineers and makes troubleshooting a bit more difficult.
Also, when working with the old version we had to use an already big debian image into which we had to additionally install a number of libraries and applications to make backups work correctly, which bloated the size.
Even with the slim version of the image we got a minimum size of ~250Mb, which was a lot for one small utility. In some cases this affected the backup start time, because the image took a long time to be downloaded to the node. We wanted to get an image which would not exceed 50Mb in size.
Working with remote storages without fuse.
Another problem for container environments is using fuse to mount remote storages.
As long as you are running backups on the host it is fine: put the right packages, turn on fuse in the kernel and you are good to go.
Things get more interesting when you need fuse in a container. Without a privilege escalation with direct access to the host kernel, the task is not solved, and that is a significant security degradation.
This has to be negotiated, not all customers are willing to relax security policies. Because of that we had to make horrible crutches, that I do not want to remember. Besides, use of additional layer increases the probability of failures and requires additional monitoring of the state of the mounted resources.
It is safer and more stable to work with remote storages using their API directly.
Monitoring status and sending notifications not only to email
Teams are not using email much in their day to day work these days. It’s understandable, it’s much faster to discuss something in a group chat or on a phone call. This is why Telegram, Slack, Mattermost, MS Teams, and other similar products are so widespread.
Here, too, we have a bot that receives various alerts and notifies us about them. And of course, we’d like to see messages about backups failing in the work chat, not in the mail, among hundreds of other emails. By the way, some customers also want to see information about failures in their Slack or other messenger.
In addition, for a long time they want to be able to track the status and see the details of the work in real time. To do this, however, we need to change the format of the application, turning it into a daemon.
Insufficient performance
Another major pain point was the lack of performance in certain scenarios.
One client has a huge file dump on nearly a terabyte and everything in small files – text, pictures. We build incremental copies of this stuff and have the following problem: one year’s copy takes three days to build. That’s right, the old version simply can’t digest this volume in less than a day.
Considering the circumstances, in fact, we have no way to restore data on a specific date, which does not suit us.
Searching for a solution
All the above-mentioned problems to a greater or lesser extent caused quite a tangible pain for the IT-block, forcing to spend precious time on, of course, important things, but these costs could have been avoided. Moreover, in certain situations it created certain risks for business owners too – the probability of being without data for a certain day is extremely low, but it’s non-zero. We refused to put up with the state of affairs.
Perhaps something has changed in 4 years and new tools have appeared in the Net – we thought. We had to check it and see if anything had changed in the tools we had already reviewed.
The result of our search was disappointing. We audited and looked into a couple of new tools that we hadn’t considered before. Here they are:
- restic
- rubackup
But, like the previously considered tools, these didn’t suit us either, because they didn’t fully meet our requirements. The result of our work is a new version of nxs-backup.
Nxs-backup 3.0
Key features of the new version:
- All repositories and all types of backups implement appropriate interfaces. Jobs and storages are initialized at startup, not at runtime.
- Remote storages are no longer mounted via fuse, they are dealt with via API. We use different libraries to do this.
- Thanks to mini-framework for go-nxs-appctx applications, which we use in our projects, you can now use environment variables in configs.
- The ability to send out log events via hooks has been added. You can set different levels and get only errors or events of desired level.
- Significantly reduced running time and resource consumption when working with large number of objects.
- The application is built without using C libraries for different processor architectures.
- We also changed the format of the application delivery. Now it is a tar-archive on GitHub or Docker image with the binary inside, which has no system dependencies.
Now backups just work on your Linux kernel 2.6 or higher. This made it very easy to work with non-standard systems and accelerated the building of Docker images. The image itself was reduced to 23MB (taking into account the extra mysql and psql clients).
We have tried to keep most of the configs and application logic, but some changes are still present. All of them are related to optimization and fixing bugs of the previous version.
For example, we put the connection parameters to remote storages in the main config, so that you don’t have to prescribe them for different types of backups every time.
Performance testing
Of course, we were interested in the performance results. We tested incremental and discrete file backups of a data directory containing 10 million files.
The comparisons were:
- Bash script with tar under the hood,
- Python version of nxs-backup (hereafter denoted as nb 2),
- Go-lang version of nxs-backup (hereafter denoted as nb 3),
- Restic utility.
You can see the test results below.
As you can see, we got a significant increase in performance, making our tool less demanding on RAM while maintaining its logic and ease of u
In fact, we expected that we would encounter certain difficulties. It would have been foolish to think otherwise. But two problems caused the strongest butthurt.
Memory leak or suboptimal algorithm
Back in the previous version of nxs-backup, we used our own implementation of file archiving. The logic behind that decision was to try to avoid using external utilities to create backups, and working with files was the easiest possible step.
In practice, the solution worked, although it was not very effective with a large number of files, as you could see from the tests. At the time we put that down to Python’s peculiarities and hoped to see a significant difference when we switched to Go.
When we finally got around to load testing the new version, we got depressing results. There was no performance gain, and memory consumption was even higher than before.
We were looking for a solution. We read a bunch of articles and studies on the subject, but they all said that using “filepath.Walk” and “filepath.WalkDir” was the best way to go. And the performance of these methods only grows as new versions of the language are released.
In an attempt to optimize memory consumption, we even made mistakes when creating incremental copies. True, the broken variants were actually more efficient. For obvious reasons, we didn’t use them.
In the end it all came down to the number of files we needed to process. We tested 10 million. Garbage Collector just doesn’t seem to have enough time to clean up such an amount of generated variables.
As the result, having realized that we might bury too much time here, we decided to abandon our implementation for the time being in favor of a time-tested and truly effective solution: using GNU tar.
Perhaps we will come back to the idea of our own implementation later, for example with the release of Go 1.20 where we can work directly with allocated memory space, or when we come up with a better solution for handling tens of millions of files.
Such a different ftp
Another problem surfaced with ftp. It turns out that different servers behave differently for the same requests.
And this is a really serious problem, when for the same request you either get a normal response or an error, which seems to have nothing to do with your request, or you get no error when you expect it.
So, we had to give up using the “prasad83/goftp” library in favor of the simpler “jlaffaye/ftp” library because the former could not work correctly with the Selectel server. The error was that when connecting, the former tried to get a list of files in the working directory and got a permission error on the upstream directory. With “jlaffaye/ftp” there is no such problem because it is simpler and does not send any requests to the server itself.
The next problem was connection breakage in the absence of requests. Not all servers behave this way, but some do. Therefore I had to check before each request whether a connection failed and reconnect.
The cherry on the cake was the problem with obtaining files from the server, or rather an attempt to get a non-existent file. Some servers throw an error when trying to access such a file, others return a valid io.Reader interface object, which can even be read, only you get an empty byte slice.
All of these situations have been discovered empirically and have to be handled on their own side.
Conclusions
Most importantly, we fixed the problems of the old version, the things that were affecting engineers’ work and creating certain risks for business.
You can find the source code and upload the issue in our repository on GitHub.