Aug 30, 2011

Practical Multi-Master Replication using Shard Filters

Earlier this month I published an article on this blog describing the system of record approach to multi-master replication.  As mentioned in that article my colleagues and I at Continuent have been working on improving Tungsten to make system of record design patterns easier to implement.  This article describes how to set up system of record using Tungsten Replicator shard filters, which are a new feature in Tungsten 2.0.4.  By doing so we will create a multi-master configuration that avoids replication loops and transaction conflicts.  On top of that, it is quite easy to set up.

There are many possible system of record patterns depending on how many schemas are shared and across how many masters.  The following diagram shows three of them.  In contrast to many so-called MySQL multi-master implementations, all masters are live and accept updates.  (By contrast, schemes such as MySQL-MHA make extra masters read-only.  Don't be fooled!)  

For today's exercise we will implement the basic system of record.  Once you understand this you can quickly set up other multi-master scenarios.  

Defining Shard Master Locations

The first step is to tell Tungsten where each shard is mastered.  By mastered we mean it is the one master that receives application updates on that shard, whereas all other masters have copies only or may not even contain the shard at all.  Tungsten uses a variant of CSV (comma-separated format) where the first line contains column names.  You can have any amount of whitespace between entries.  Create a file called shards.map with your favorite editor and type in the following lines. 

shard_id master critical
tungsten_nyc nyc false
tungsten_sjc sjc false
acme         sjc     false
pinnacle        nyc     false

The first column is the name of the shard.  This must be unique--because a shard can only live on one master.  The next column is the "home" master for the shard.  This is the one and only master that should receive shard updates.  The third column defines whether the shard is critical and requires full serialization.  It will be linked to parallel replication in a later release.  

It turns out you do not need to add entries for Tungsten catalog schemas such as tungsten_nyc.  Tungsten Replicator will create them automatically.  They are shown here for completeness only.  

Creating Replication Services

Next we need to define services to replicate bi-directionally between DBMS servers and set options to filter shards using the ShardFilter class, which is new in Tungsten 2.0.4.  The shard filter helps ensure that shards replicate from their home masters only and not from other locations.   If you do not know what replication services are, you can find a description of them in this article.  

Multi-master replication is easy to mis-configure, so to prevent accidents we will tell the shard filter to generate an error any time it processes a shard it has never seen before.  The replication service will immediately fail, which signals that we have to update shard definitions.  This is the safest way to implement system of record or any multi-master configuration for that matter.  It is generally easier to restart replication after correcting the configuration than to mix up data, which can lead to major outages.  

The first step is to set replication services for each master.  These read the binlog and make transactions available to slave replication services.  Here are the commands.   Note that the sjc master is on host logos1, while the nyc master is on logos2.  The remaining examples use these names consistently. 

# Define common master settings. 
COMMON_MASTER_OPTS="--datasource-user=tungsten --datasource-password=secret \
 --home-directory=/opt/continuent --svc-parallelization-type=disk \
 --svc-extractor-filters=shardfilter \
 --property=replicator.filter.shardfilter.unknownShardPolicy=error"

# Set up sjc master. 
tools/tungsten-installer --master-slave -a --master-host=logos1 \ 
--cluster-hosts=logos1 --service-name=sjc $COMMON_MASTER_OPTS --start-and-report 

# Set up nyc master. 
tools/tungsten-installer --master-slave -a --master-host=logos2 \
--cluster-hosts=logos2 --service-name=nyc $COMMON_MASTER_OPTS --start-and-report 

The --svc-extractor-filters option adds shard filtering immediately after event extraction.  The unknownShardPolicy=error setting will cause the masters to die if they process an undefined shard.  

Now we can define the remote slave services for sjc and nyc.  These are special slaves that write transactions onto another master as opposed to a normal slave.  We would like slave services to error out on unknown shards as well.  Also (and this is important) we want them to enforce shard homes.  Here are the commands to create the services and start each one.  

COMMON_SLAVE_OPTS="--release-directory=/opt/continuent/tungsten \
--service-type=remote --allow-bidi-unsafe=true --svc-parallelization-type=disk \
--svc-applier-filters=shardfilter \
--property=replicator.filter.shardfilter.unknownShardPolicy=error \
--property=replicator.filter.shardfilter.enforceHome=true"

# Set up sjc remote slave. 
tools/configure-service -C -a --host=logos2 \
--local-service-name=nyc --role=slave \
--datasource=logos2 --master-host=logos1 $COMMON_SLAVE_OPTS sjc
$trepctl -host logos2 -service sjc start

# Set up nyc remote slave. 
tools/configure-service -C -a --host=logos1 \
--local-service-name=sjc --role=slave \
--datasource=logos1 --master-host=logos2 $COMMON_SLAVE_OPTS nyc
trepctl -host logos1 -service nyc start

The --svc-applier-filters option adds shard filtering before applying to the DBMS.  The unknownShardPolicy=error setting will cause the slaves to die if they process an undefined shard.  Finally, the enforceHome=true option means that each slave will drop any transaction that lives on a different service from that slave's master.  

At the end of this procedure, your services should be online and read to run.  Use 'trepctl services' to make sure.  

Loading Shard Definitions

To make shard definitions take effect, you must load the shard.map contents into each replication service.  You can do this any time the replicator is running but after loading new definitions you must put the replicator online again.  Here are the commands to load the shard maps onto each of the four replication services.   For each replication service, you must delete the old definitions, reload new ones, and get the replicator to go online again. 

#!/bin/bash
MAP=shard.map
for host in logos1 logos2
do
  for service in sjc nyc
  do
    trepctl -host $host -service $service shard -deleteAll
    trepctl -host $host -service $service shard -insert < $MAP
    trepctl -host $host -service $service offline
    trepctl -host $host -service $service wait -state OFFLINE
    trepctl -host $host -service $service online
  done
done

This looks a little clunky and will be reduced to a single command instead of five in a later release.  I put it in a script to make it quicker to run.  The good news is that there is just one shard map that works for all replication services, regardless of location or role.  

Once you finish this step, you can go to any replication service and list the shards it knows about.  Let's pick a service and demonstrate: 

$ trepctl -host logos1 -service sjc shard -list
shard_id master critical
tungsten_nyc nyc false
tungsten_sjc sjc false
acme sjc false
pinnacle nyc false

With this we are ready to start processing some transactions. 

Multi-Master Operation

At this point we have multi-master replication enabled between hosts logos1 and logos2.  You can try it out.  Let's add the acme database to the sjc master on logos1 as an example. 

mysql -utungsten -psecret -hlogos1 
mysql> create database acme;
mysql> use acme
mysql> create table foo (id int);
mysql> insert into foo values(1);

We can see that all of these commands replicate over to the logos2 server quite easily with the following command: 

mysql -utungsten -psecret -hlogos2 -e 'select * from acme.foo'
+------+
| id   |
+------+
|    1 | 
+------+

That seems pretty simple.  In fact it is.  You can go over to logos2 and enter transactions for pinnacle in the same way.  Data replicate back and forth.  There are no replication loops.  There are also no conflicts.  

Adding a New Shard

So what happens when we add a new shard?  The simplest way to see is to create a database using a schema name that does not exist in the shard map.   Let's try to create a database named superior on the nyc master.  

mysql -utungsten -psecret -hlogos2 -e 'create database superior'

Now check the status of the nyc master replication service.  We see it has failed with an error due to the unknown shard.   (Tungsten parses the create database command and assigns it the shard ID "superior.") 

$ trepctl -host logos2 -service nyc status
Processing status command...
NAME                     VALUE
----                     -----
...
pendingError           : Stage task failed: binlog-to-q
pendingErrorCode       : NONE
pendingErrorEventId    : mysql-bin.000157:0000000000002475;1287
pendingErrorSeqno      : 8
pendingExceptionMessage: Rejected event from unknown shard: seqno=8 shard ID=superior
...
state                  : OFFLINE:ERROR...
Finished status command...

This problem is quite easy to fix.  We just open up the shard.map file and add a row for superior so that the file contents look like the following: 

shard_id master critical
tungsten_nyc nyc false
tungsten_sjc sjc false
acme         sjc     false
pinnacle        nyc     false
superior        nyc     false

Reload the shard.map file as shown previously and you will be back in business.  Incidentally, if you do not want the superior database to be replicated to other masters, you can also specify this in the rules.  Just give superior the special master name #LOCAL as in the following example and it will not replicate outside the nyc service. 

superior        #LOCAL  false

In fact, #LOCAL means that any schema named superior will not replicate outside the service in which it is defined.  You can have an unshared schema named superior on every master.  

Where to Next?

The shard support described in this article is now part of Tungsten 2.0.4 and will appear in the official build when it is finally ready.  You can try it out right now using one of our handy nightly builds.  

We plan to build out shard filtering quite a bit from the current base.  One immediate fix is to put in a check so that if an application commits shard updates on the wrong DBMS server, the master replication service on that server will detect it and fail.  This will tell you there's a problem immediately rather than letting you wallow in blissful ignorance while your data become hopelessly mixed up.  We will also simplify the commands to update shards while replicators are online. 

Longer term we will be adding features to propagate shard definitions through replication itself.  Stay tuned for more work in this area.  If you want to help fund work to enable your own applications, please get in contact with me at Continent.  I can think of at least a dozen ways to make our multi-master support better but it's always nicer to spend the effort on features that enable real systems.  In the meantime, I hope you find multi-master with shard filtering useful and look forward to your feedback.  

No comments: