Nixys > Blog > Finding the cause of slow redis on a virtual machine

Finding the cause of slow redis on a virtual machine

In our work, we often have to deal with very different configurations of the client’s server infrastructure. Today I would like to share my experience of finding the root cause of redis-cluster failures, which was very time consuming but could have been solved much faster. Perhaps our experience will save time for some beginner administrators in solving similar problems.

Problem statement

There are 3 servers in a local network with the following set of resources:

8 Intel® Xeon® E3-1270 v5 CPUs @ 3.60GHz
64 GB DDR3 RAM 1600MHz
SSD 447 GB with /var/lib/redis partition

Each of them (two physical servers and one virtual machine) has 2 redis instances, a master and a replica. Read requests can be handled by either instance, write requests only by the master. The graphical representation of the configuration is shown below:

While working on one of the tasks was noticed incorrect infrastructural distribution of redis instances on the servers:

It was noticed that one of the master replica pairs had swapped places. To restore the given order, we did a manual failover and brought the cluster back to its original appearance.

And everything would have been fine unless every day after starting the cron at 1am the master on the server 10.0.0.3 moved to the server 10.0.0.1 again.

Master Log

# Connection with slave client id #123 lost. 
# Failover auth denied to 10.0.0.1:6382 its master is up 
# Configuration change detected. Reconfiguring myself as a replica of  10.0.0.1:6382

Replica Log

FAIL message received from 10.0.0.2:6381 about 10.0.0.3:6381 
# Cluster state changed: fail 
# Failover auth granted to 10.0.0.1:6382 for epoch 799 
# Cluster state changed: ok Clear FAIL state for node 10.0.0.3:6381: master without slots is reachable again. 
Marking node 10.0.0.3:6381 as failing (quorum reached). 
Clear FAIL state for node 10.0.0.3:6381: slave is reachable again.

So, when you start the cron, data is written to the master, it becomes unavailable and the quorum kicks it out, then it appears on the radars again, but joins the cluster as a replica.

And restoring the order in the cluster is successful, until the next start of the cron, the master works as a master.

Finding the cause of cluster configuration failure.

Why does a cron job running on all servers, with the same configuration, take a long time on one server and cause no tangible problems on the others? Every time the order of masters and replicas was changed, only one master-replica pair was changed.

Resources? Could the 10.0.0.3 server have worse resources than the others? No. The configuration of the servers is identical, the virtual server has 10% less RAM, but it is not fully utilized.
Are there network issues? Perhaps there are networking issues between servers? No, there are no issues here, the performance is the same between all servers.

 # iperf -c 10.0.0.3 
------------------------------------------------------------ 
Client connecting to 10.0.0.3, TCP port 5001 
TCP window size: 85.0 KByte (default) 
------------------------------------------------------------ 
[  3] local 10.0.0.2 port 47646 connected with 10.0.0.3 port 5001 
[ ID] Interval       Transfer     Bandwidth 
[  3]  0.0-10.0 sec  1.10 GBytes   946 Mbits/sec

Is there a problem with the drives? Perhaps writing or reading is taking longer? Reading on server 10.0.0.3 is really slow:

Server-s1# hdparm -Tt /dev/sda 
/dev/sda:
 Timing cached reads:   33894 MB in  2.00 seconds = 16967.92 MB/sec
 Timing buffered disk reads: 538 MB in  3.00 seconds = 179.22 MB/sec 
Server-v3# hdparm -Tt /dev/xvda5  /dev/xvda5:
 Timing cached reads:   7352 MB in  1.99 seconds = 3700.67 MB/sec
 Timing buffered disk reads: 186 MB in  3.00 seconds =  61.98 MB/sec

At this point the ideas ran out. The redis-benchmark results clearly show that on the server 10.0.0.3 redis is simply 8-10 times slower across all metrics.

Redis-benchmark on a physical machine:

# redis-benchmark -h 10.0.0.2 -p 6381 -q 
PING_INLINE: 223214.28 requests per second 
PING_BULK: 237529.69 requests per second 
SET: 222717.16 requests per second 
GET: 237529.69 requests per second 
INCR: 227272.73 requests per second 
LPUSH: 213219.61 requests per second 
LPOP: 241545.89 requests per second 
SADD: 234741.78 requests per second 
SPOP: 225225.22 requests per second 
LPUSH (needed to benchmark LRANGE): 218340.61 requests per second LRANGE_100 (first 100 elements): 242718.45 requests per second 
LRANGE_300 (first 300 elements): 242718.45 requests per second 
LRANGE_500 (first 450 elements): 230414.75 requests per second 
LRANGE_600 (first 600 elements): 222222.23 requests per second 
MSET (10 keys): 177304.97 requests per second

Redis-benchmark on a virtual machine:

# redis-benchmark -h 10.0.0.3 -p 6381 -q 
PING_INLINE: 25542.79 requests per second 
PING_BULK: 27434.84 requests per second 
SET: 30797.66 requests per second 
GET: 26673.78 requests per second 
INCR: 31113.88 requests per second 
LPUSH: 32133.68 requests per second 
LPOP: 32331.07 requests per second 
SADD: 29594.55 requests per second 
SPOP: 30826.14 requests per second 
LPUSH (needed to benchmark LRANGE): 30988.54 requests per second 
LRANGE_100 (first 100 elements): 26413.10 requests per second 
LRANGE_300 (first 300 elements): 27100.27 requests per second 
LRANGE_500 (first 450 elements): 22706.63 requests per second 
LRANGE_600 (first 600 elements): 29429.08 requests per second 
MSET (10 keys): 23468.67 requests per second

Okay, maybe the deterioration is given by the reduced reading speed, which was defined above. But by a factor of 10? This moment, all the links on the first pages on google have already been checked several times. Suddenly, on the ObjectRocket page came across a mention that hypervisor may be to blame and worldly advice not to put redis on virtuals running Xen:

Probably the two most common operational choices which cause Redis to slow down is to 1) put it on a virtual machine – especially a Xen hypervisor based one and 2) heavy disk persistence.

So, what kind of hypervisor do we have on the virtual machine?

# virt-what xen xen-hvm

And it was only here that I discovered the documentation page, which explained that redis uses the fork operation to work.

Latency generated by fork In order to generate the RDB file in background, or to rewrite the Append Only File if AOF persistence is enabled, Redis has to fork background processes. The fork operation (running in the main thread) can induce latency by itself.

And the twist to the story is that under Xen, this operation slows down:

Fork time in different systems Modern hardware is pretty fast at copying the page table, but Xen is not. The problem with Xen is not virtualization-specific, but Xen-specific.

The developers even told Amazon users what instances can be used to avoid this problem.

For EC2 users, make sure you use HVM based modern EC2 instances, like m3.medium. Otherwise fork() is too slow.

Summary

In the analysis of the problem, which was identified by chance, had to go quite a long way. I will briefly outline the stages that had to pass in solving this problem:

Confirming the recurrence of the problem described every 24 hours
Verifying the need for a detailed analysis of the causes
Analysis of a bank of related tasks on the project, where a lot of time was spent, but no solution was found.
Comparison of available and used resources on three machines
Eliminate networking problems
Checking disk health and speed
Analyzing redis-benchmark data
A costly step to search for causes of this and similar problems in public sources
A detailed review of documentation

Conclusions

Causes affecting the server speed can be numerous, when searching for them, do not forget to check the official documentation first, especially if it is well written. Perhaps there is not only already described in detail your problem, but also presented options for solving it.

You can find more our projects on Github page.