Reliably Monitoring MySQL Replication

Replication is a wonderful thing for your clients. Having a 'hot spare' of their database(s) for redundancy, or being able to off-load read operations from the main database to increase performance, giving your client peace-of-mind about their data and application. I won't go into setting up MySQL Replication; there are more than a few guides on that already out there (here's the official documentation). Once you do have Replication running, you need to make sure that it remains running, reliably, all the time. How best to accomplish this?

The Way Monitoring Had Been

The typical method is to use SLAVE STATUS to look at information about the setup.

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: stg04-cf.copperfroghosting.net
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000006
          Read_Master_Log_Pos: 106
               Relay_Log_File: stg06-cf-relay-bin.000002
                Relay_Log_Pos: 251
        Relay_Master_Log_File: mysql-bin.000006
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 106
              Relay_Log_Space: 409
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
1 row in set (0.00 sec)

There are a few key pieces of information provided here.

  • Slave_IO_Running tells us if the Slave is able to connect to the Master.
  • Slave_SQL_Running indicates if data received from the Master is being processed.
  • Last_IO_Errno, Last_IO_Error, Last_SQL_Errno, Last_SQL_Error, are all pretty much what they say, the last error number and error from the IO or SQL threads
  • Seconds_Behind_Master shows difference between last timestamp read in the binlogs and current time. This is important to understand. It does not report directly the delay between when information is updated/inserted on the master and recorded on the slave. A slow network can cause an artificially inflated number, as well as long running queries or blocking/locking operations.

Until recently, we had been relying on Seconds_Behind_Master to tell us if Replication was working, and if it was behind the Master by any appreciable level. And, of course, we found ourselves in a perfect storm situation where Replication had silently failed. Data was being sent over to the Slave, was being read by the IO thread, but even though the SQL thread was reporting no errors, the data was not inserted into the Slave. Due to the binlogs being read, Seconds_Behind_Master was reporting 0.

Solving The Problem

So how do we solve the problem presented by Seconds_Behind_Master not being 100% reliable? By relying on the replication itself to give us the data. The Percona Toolkit has a couple of very handy scripts for this very purpose. On the Master, we now use pt-heartbeat to insert data.

mysql> select * from heartbeat\G
*************************** 1. row ***************************
                   ts: 2013-12-13T14:50:06.001550
            server_id: 1
                 file: mysql-bin.000009
             position: 1639186
relay_master_log_file: NULL
  exec_master_log_pos: NULL
1 row in set (0.00 sec)

The relay_master_log_file and relay_master_log_file will be NULL if you are using row based replication.

pt-heartbeat will update this row at an interval you specify, down to 0.1 seconds. Then, we read this same data from the Slave, and can calculate the actual time delay from when the data was inserted to when it became available to read. This can be accomplished just using pt-heartbeat with it's --monitor or --check switches, or, Percona also provides a set of Nagios plugins for monitoring MySQL. Metal Toad employs the pmp-check-mysql-replication-delay plugin. We also considered using pmp-check-pt-table-checksum to verify the integrity of a few selected tables, but you must use statement based replication, which is not possible for the applications we host.

So, don't rely on MySQL's own data! Use Replication to verify Replication is still working, and have a system in place to notify you the moment there is an issue.

If you want to test all the pieces, pt-slave-delay can be used to artificially force the Slave to lag behind the master.

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <cpp>, <java>, <php>. The supported tag styles are: <foo>, [foo].
  • Web page addresses and email addresses turn into links automatically.
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.

About the Author