Reliably Monitoring MySQL Replication
Replication is a wonderful thing for your clients. Having a 'hot spare' of their database(s) for redundancy, or being able to off-load read operations from the main database to increase performance, giving your client peace-of-mind about their data and application. I won't go into setting up MySQL Replication; there are more than a few guides on that already out there (here's the official documentation). Once you do have Replication running, you need to make sure that it remains running, reliably, all the time. How best to accomplish this?
The Way Monitoring Had Been
The typical method is to use SLAVE STATUS to look at information about the setup.
mysql> SHOW SLAVE STATUS\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: stg04-cf.copperfroghosting.net Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000006 Read_Master_Log_Pos: 106 Relay_Log_File: stg06-cf-relay-bin.000002 Relay_Log_Pos: 251 Relay_Master_Log_File: mysql-bin.000006 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 106 Relay_Log_Space: 409 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: 1 row in set (0.00 sec)
There are a few key pieces of information provided here.
Slave_IO_Runningtells us if the Slave is able to connect to the Master.
Slave_SQL_Runningindicates if data received from the Master is being processed.
Last_SQL_Error, are all pretty much what they say, the last error number and error from the IO or SQL threads
Seconds_Behind_Mastershows difference between last timestamp read in the binlogs and current time. This is important to understand. It does not report directly the delay between when information is updated/inserted on the master and recorded on the slave. A slow network can cause an artificially inflated number, as well as long running queries or blocking/locking operations.
Until recently, we had been relying on
Seconds_Behind_Master to tell us if Replication was working, and if it was behind the Master by any appreciable level. And, of course, we found ourselves in a perfect storm situation where Replication had silently failed. Data was being sent over to the Slave, was being read by the IO thread, but even though the SQL thread was reporting no errors, the data was not inserted into the Slave. Due to the binlogs being read,
Seconds_Behind_Master was reporting 0.
Solving The Problem
So how do we solve the problem presented by
Seconds_Behind_Master not being 100% reliable? By relying on the replication itself to give us the data. The Percona Toolkit has a couple of very handy scripts for this very purpose. On the Master, we now use pt-heartbeat to insert data.
mysql> select * from heartbeat\G *************************** 1. row *************************** ts: 2013-12-13T14:50:06.001550 server_id: 1 file: mysql-bin.000009 position: 1639186 relay_master_log_file: NULL exec_master_log_pos: NULL 1 row in set (0.00 sec)
relay_master_log_file will be NULL if you are using row based replication.
pt-heartbeat will update this row at an interval you specify, down to 0.1 seconds. Then, we read this same data from the Slave, and can calculate the actual time delay from when the data was inserted to when it became available to read. This can be accomplished just using pt-heartbeat with it's --monitor or --check switches, or, Percona also provides a set of Nagios plugins for monitoring MySQL. Metal Toad employs the pmp-check-mysql-replication-delay plugin. We also considered using pmp-check-pt-table-checksum to verify the integrity of a few selected tables, but you must use statement based replication, which is not possible for the applications we host.
So, don't rely on MySQL's own data! Use Replication to verify Replication is still working, and have a system in place to notify you the moment there is an issue.
If you want to test all the pieces, pt-slave-delay can be used to artificially force the Slave to lag behind the master.