Thursday 31 May 2007

On MySQL Cluster Replication

Erik Hoekstra, from Daisycon, has pointed out this problem related to Replication in general and with a specific example on MySQL Cluster and Replication in 5.1.

In the manual for 5.1 there's an entry about scripting the failover for MySQL Cluster Replication.

In this part of the manual they speak about fetching the needed variables, like the filename and position, and place them into the CHANGE MASTER TO statement.

The example is here:


CHANGE MASTER TO
MASTER_LOG_FILE='@file',
MASTER_LOG_POS=@pos;


I'm now trying to do the following:
On a slave I've created a Federated tables, 1 pointing to the current master, and 1
to the stand-in master, should the current master fail.

Federated table 1, let's say F1, is just a simple table on the current master with only one record.

Federated table 2, let's say F2, is the ndb_binlog_index table in the mysql database on the stand-in master.

I wrote a procedure which checks if F1 exists and if not (current master down) it selects the needed vars from F2. I would like to place these variables into a CHANGE MASTER TO statement so the slave can start and pickup at the right epoch, but it seems that the CHANGE MASTER TO statement doesn't accept variables.


DELIMITER //

CREATE PROCEDURE check_master()

BEGIN
  DECLARE master_dead BOOLEAN DEFAULT false;
  DECLARE latest BIGINT(20) DEFAULT 0;
  DECLARE fil VARCHAR(255);
  DECLARE pos BIGINT(20) DEFAULT 0;

  DECLARE pointer CURSOR FOR
  SELECT `id` FROM `replication`.`master_alive`;
  DECLARE CONTINUE HANDLER FOR
          SQLSTATE 'HY000' SET master_dead = true;

  OPEN pointer;
  IF( master_dead ) THEN

    STOP SLAVE;

    SELECT @latest:=MAX(`epoch`)
      FROM `mysql`.`ndb_apply_status`;

    SELECT @file:=SUBSTRING_INDEX(File, '/', -1), @pos:=Position
      FROM `replication`.`new_master_binlog_index`
     WHERE `epoch` > @latest
     ORDER BY `epoch` ASC LIMIT 1;

    CHANGE MASTER TO
           MASTER_HOST='192.168.0.1',
           MASTER_PORT=3306,
           MASTER_USER='replica',
           MASTER_PASSWORD='mypasswd',
           MASTER_LOG_FILE='@file',
           MASTER_LOG_POS=@pos;
    START SLAVE;
    ALTER EVENT `check_master_alive` DISABLE;
  END IF;
  CLOSE pointer;
END;
//
DELIMITER ;

CREATE EVENT check_master_alive
    ON SCHEDULE EVERY 5 SECOND
       ENABLE DO CALL check_master();


This is the error message:
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CHANGE MASTER TO MASTER_HOST='10.0.33.104', MASTER_PORT=3306, MASTER_USER='repli' at line 15

I also tried the procedure with SELECT .. INTO but it doesn't seem to work at all.


I have submitted a bug - #28270 - and our tech support is working on it.

In the meantime, Erik has found a workaround to the problem, so he can implement the script. The workaround is based on the PREPARE/EXECUTE statement:

CREATE PROCEDURE check_master()
BEGIN
  DECLARE master_dead BOOLEAN DEFAULT false;
  DECLARE pointer CURSOR FOR

  SELECT `ip` FROM `replication`.`master_alive`;

  DECLARE CONTINUE HANDLER FOR
          SQLSTATE 'HY000' SET master_dead = true;

  OPEN pointer;
  IF( master_dead ) THEN

    STOP SLAVE;

    SELECT @latest:=MAX(`epoch`)
      FROM `mysql`.`ndb_apply_status`;

    SELECT @file:=SUBSTRING_INDEX(File, '/', -1), @pos:=Position
      FROM `replication`.`new_master_binlog_index`
     WHERE `epoch` > @latest
     ORDER BY `epoch` ASC LIMIT 1;

    SET @new_master_command = CONCAT( 'CHANGE MASTER TO
                 MASTER_HOST=\'192.168.0.1\',
                 MASTER_PORT=3306,
                 MASTER_USER=\'replication_user\',
                 MASTER_PASSWORD=\'mypasswd\',                  MASTER_LOG_FILE=\'',@file,'\',
                 MASTER_LOG_POS=',@pos,';' );

    PREPARE nmc FROM @new_master_command;
    EXECUTE nmc;
    DEALLOCATE PREPARE nmc;

    START SLAVE;

    ALTER EVENT `check_master_alive` DISABLE;
  END IF;
END;


In this way, the server accepts the CHANGE MASTER command.

Thanks to Erik for this great workaround, in the meantime we will check that the bug will be fixed.

Tuesday 29 May 2007

Q&A Webinar Part 4 - MySQL Cluster

Q from Olivier - Are the data nodes MySQL servers too ?
No, the data node is handled by a separate process, ndbd, that only manages data.

Q from Olivier: So, what is MySQL Cluster? A MySQL AB product ?
Technically speaking, MySQL Cluster is a storage engine, based on a network distributed database. From a commercial point of view, we refer at MySQL Cluster as a product. We also provide APIs to access to the Cluster database directly, bypassing the MySQL Server and the storage engine architecture.

Q from Ludovico: Are there load balancing mechanisms to balance sql requests between active sql server nodes?
Yes, absolutely. The SQL nodes all see the same data, and load may well be balanced between them, if necessary.
The only caveat is that transactions must be locked into the same SQL server for the duration of the transaction.
In real life, load balancing can be achieved through hardware and software load balancers, or using the capabilities provided by some connectors such as JDBC, or just handling the load balancing at http level to the web servers, assuming that on average they will generate pretty much the same workload.

Q from Ludovico: What about embedded solutions, HW load balancers or round robin dns?
Any of them should work. Keep in mind the transaction caveat though. The load balancer must be activated when the client must initiate the connection, then the transaction must be related to the same connection.

Q from Roberto: Using the API and the native access, do we lose whole or part of the HA features ?
No, not at all, Cluster, even without the SQL nodes, is fully reduntant. The HA features are at the Cluster level, below the SQL nodes.

Q from Vasilij: Does Linux support MySQL Cluster?
Absolutely, various Linux and Unices are fully supported

Q from Manuel: Is the management server is a Single Point of Failure?
No, it's not. First of all, you may have as many as you want. Secondly, it is important that the management server is on when a new node is added to the cluster or in case of failure in a small cluster with only two data nodes. During normal operations, it could be switched off.
Typically, the management node is a lightweight process and it can be colocated with other servers.

Q from Manuel: You said that the minimum number of nodes is 3. Can't we have only two nodes and the management server running on both nodes and so you have absolute redundancy with only 2 boxes?
No, having 2 nodes is not a good idea, as that might cause a split-brain situation. If you implement such configuration, MySQL Cluster will issue a warning

Q from Olivier: Where are the data stored? On MySQL servers? On the NDBD nodes ? In memory or on disk ?
Data is In the data nodes (NDB nodes) and data is in memory. With version 5.1, data can be optionally on disk, although indexes must still reside in memory.

Q from Manuel: May the management servers run on the same machine where the nodes or the SQL Servers run?
Yes, they can. There is no need for an extra physical server, provided the cluster has at least 3 physical servers.

Q from Colin: What is the simplest cluster config if all access is via C++ API?
3 nodes. 2 datanodes, one for mgm, which may be colocated with the application node, for example.

Q from John: All the examples of hostnames in the cinfig file are ipv4 addresses - does cluster support ipv6?
No, not at the moment.

Q from John: If mysql nodes and data nodes are on the same boxes can they share IP's?
Sure, they can share the same address, as they respond to different ports

Q from Manuel: What is a split-brain and how does it happen?
A split brain is a particular situation that may happen in some HA infrastructures. It may happen that in an HA solution with only two servers, in case of network fault between the servers, the two nodes cannot talk each other, but they are both active and can provide the service to their clients. In this case we have a split brain situation. MySQL cluster does not allow a split brain, since it requires a third node as arbitrator.

Q from Danilo: Is is possible to have multiple INSERT coming from different MySQL Server that require auto_increment fields?
Yes, that is absolutely possible and fully supported.

Q from Olivier: Is all the datas on every node? What happens, for example, if I have 4 nodes with 20GB of memory for each server?
First of all, you have to specify the exact number of replicas that you want. You can have only one copy of data (NoOfReplica=1), but in this scenario the system will not be redundant. You can have up to 4 replicas. WIth a classic replica of 2, you will have only two copies of the same data. In the example that you have mentioned, the total memory available will be 4 (nodes) x 20 (memory) / 2 (replica) = 40 GB in total.

Q Danilo: How many data nodes are supported?
The current limit is 48.

Q from Domenico: Can MySql Server and Data Node reside on the same physical server?
Yes, they can. For better scalability and performance, you should consider separate nodes though.

Q from Colin: If the data is not on disk which safety/backup policy is advised in case of total system failure?
There is a checkpointing and redo logging system that assure that the data are safely stored on disk. You can also used the online backup provided for disaster recovery.

Q from Ludovico: What happens if a power shortage happens and all my data nodes are turned off?
In this case you have to recover from disk up to the latest checkpoint and then you can apply everything up to the latest position in the redo log. Bear in mind you should configure the whole infrastructure in a way that is suitable for high availability.

Q from Hilmar: So in the worst case scenario, when a checkpoint was not fired after an updated, you loose power, then cluster recovers from loss of power and after the restart of the servers it has ' rolledback' to its previous checkpoint, it that correct?
First it recovers up to the latest checkpoint, then it applies the redo log.

Q from David: Does the Transaction Coordinator ensure referential integrity?
No, it does not. Foreign keys are not supported in the current release and not in 5.1

Q from Guy: Is the whole database held in memory or just the tables in use?
The whole database is in-memory. That is the attribute of an In-Memory database, to achieve deterministic performance.

Q from Ludovico: Il the FK support at SQL level coming also for NDB?
YeS, as far as the SQL interface is used.

Q from Colin: Is the FK support planned for all storage engines in future versions?
We are working on this. We are still unsure about global support for all the storage engines, but this is definitely one of the possible solutions.

Q from Manuel: If you have 2 management servers on different machines and you lose network connection between them, don't you have a split-brain situation?
Not necessarily. As long as there is an odd # of arbitration capable nodes, you don't.

Q from Jonas: Is there any estimate in time on when the 'next' MySQL version, with FK at SQL level will be available?
Not at the moment, but this is one of the hottest topics within MySQL.

Q from Manuel: Can you explain or give a URL where the use of redundant management servers is explained?
In the reference manual for version 5.0 - section 15.4.4.2 - there are examples of connection strings with multiple management nodes

Q from Guy: What happens if the DB drows significantly in a short period therefore size increases above the memory limit?
The size is fixed and preallocated in this point in time. When available memory is full it's full. 5.1 will be able to have non-indexed data on disk, which will change this.

Q from Danilo: When will you plan to introduce multiple index use in mysql select?
It is in there already, to an extent. I assume you ask about multiple indexes per table, and again, that is to an extent already implemented. The usefulness of this depends on the stoarge engine though. Falcon will support this in a very effective manner.

Q from Hilmar: How do I know for sure that a certain transaction is sent to disk asap? is possible to trigger a checkpoint using an SQL statement?
You can setup the system in order to have continuous checkpoints, although this is not the answer to the problem. The checkpoint will not be related to a single commit in any case. MySQL Cluster architecture has been designed to be an in-memory database, therefore the infrastructure underneath must be able to support the architecture. Bear in mind that MySQL cluster achieves HA through redundancy; since you can have up to 4 replicas, in the worst case scenario you have to have 4 faults on 4 servers and you may lose data executed after the last position available in the REDO LOG on disk - I am not aware of any other HA architecture that may achieve more than this, whether it is in-memory or on disk.

Q from Hilmar: But what happen if you have a power failure for the whole data center?
First of all, you should have UPS systems in place that should at least allow you to shut down the server gracefully, in order to not lose any data at all. Secondly, with version 5.1 you may use geographic replication and activate the cluster in another data center.

Q from Jonas: Is MySQL Cluster Replication is done on a node-by-node basis, or the cluster as a whole?
In 5.1, it is Cluster to Cluster.

Q from Hilmar: Has mysql cluster already been tested on sparc T1's for the SQL nodes? And are these a better option than a regular AMD box ?
SUN HW and Solaris and well supported, you have only to take care of having the same hardware type (SPARC, Intel or AMD) between nodes.

Monday 7 May 2007

Questions and Answers in the Second Session of the Online Solutions with MySQL Webinar - On Replication

Correction on the INSERT DELAYED
In slide 20 I have mentioned that the INSERT DELAYED statement can increase performance on the slave. This is wrong, since the DELAYED keyword is ignored by the SQL thread on the slave server. The INSERT DELAYED statement can increase the overall performance of an application since the control is returned to the client as soon as the row is queued into the list of inserts to execute. The INSERT DELAYED can be used with MyISAM, MEMORY and ARCHIVE.

Q from Filip: Does master & slave have to be the same db-version, and the same Operative system?
Not necessarily, you can have different versions and operating systems

Q from Danilo: Is there a way to load a backup from the master without locking tables or shutting down the master database?
Using Innodb, you can use mysqldump to produce a consistent backup without locking the tables

Q from Henk: How are primairy keys communicated back from the slave to the master
There is no need: the primary keys generated on the master are pushed "as is" on the slave / they are not regenerated
Q: How does a slave follow up an insert?
It receives the keys generated by the master, so if the slave implements foreign keys (although they would not be necessary), the data on the DB is consistent

Q from Martin: How can mysql prevent binary log corruption when link between master and slave breaks? This has happened to me, which stopped the replication thread indefinitely
The IO thread on the slave stops pumping the log. As soon as the master restart the slave restart from where it stopped

Q from Brendan: What is the latency between a transaction being committed on the master to the same transaction being committed on the slave in a replicated environment?
It depends on several factors - hardware, storage engine, workload etc. From a replication point of view, the transaction is first recorded on the relay log and then it's applied to the slave.

Q from Helen: Are all the DBs on the master server replicated? Can you selectively replicate databases?
You can selectively include or exclude DBs. This works down to the table level

Q from James: Are client-updates prohibited on a slave system when the master is active? If not then how do you prevent inconsistencies?
They are allowed. The slave can be used for any task. it is your responsibility to write only on the master. In order to make the slave read-only, you should work at a user-security level

Q from Jonas: If I understand right, I could configure the master on both servers to address them with a virtual IP, is that correct?
Yes, that is correct, you can do it in order to simplify and speed up the failover

Q from Volker: The HA solution can only work if the binlog positions on Master and HA slave are the same, right?
No, the slave can be a little delayed. For synchronous HA we use others techniques like IO replication (DRBD) or shared disk architecture

Q from Paul: How does master-master replication perform with cross continent latency? eg: UK & USA
MySQL Replication is asynchronous so it will not be affected by latency. The use of separated threads avoid that the geographical position would affect the activity on the slave DB.

Q from Danilo: What does "Non-deterministic writes" mean?
An example of it is an INSERT statement that contains the RAND() function

Q from Dave: Can replication handle 'load data infile' type operations?
Yes, the data loaded is stored in a temporary file in the tmp directory and passed to the slave through the IO thread. The SQL thread will load the data from a file available in the tmp directory of the slave server. The file are security protected from access as other.

Q from Owen: Could you use blackhole on Master 1 and split data afterwards to mutiples partioned DBs
Yes, blackhoe storage is a good way to build a relay before replication split

Q from Andrew: If the update is async, is there a possibility that an update x made after update y in the server is performed y and then x on the slave?
If done in a different transaction that commit in a reverse order it might happen. The locking mechanism of InnoDB and the default isolation level (repeatable read) make replication consistent

Q from Alessandro: I've never heard about BLACKHOLE storage engine. What is it designed for?
It is a storage engine that store nothing in the db but the binlog is generated anyway. It's used to build a relay server, reduce the use of resources and improve speed

Q from John: Can you specify a master db by host name, or must it be an IP address?
You can use either the hostname or the IP address

Q from Clive: Is there any drawback to using dual-master replication (with auto-incr offsets) to allow automatic client failover with complete safety?
Bear in mind that replication is asyncronous. For a complete safety, MySQL cluster might be the solution. or some other HA techniques : IO replication / shared disk

Q from Rob: Does master & slave have to be the same db-version, and the same Operative system. You can always replicate to a newer version right?
You can replicate to newer version and/or different OSs

Q from Rob: Is it possible to share query cache over multiple slaves?
No, the query cache are on a server basis

Q from Daniel: If I save binlogs in a NAS and net connection/write fails, will the server still work?
The master must be able to write to its current log and to generate the next one. Then, once logs are rotated, they can be pushed anywhere for safety. NAS is a solution.

Q from Carsten: Can 2 masters read from the same database files? (Implementing DFS for example) ?
This can be done using cluster-aware file systems, but there are some aspects to be carefully considered, such as the storage engine to use and the query locking.

Q from Phil: What do you think the impact of placing data and logs on different drives is?
This is an important point to optimize the IOs and it is recommended

Q from Manuel: I think multimaster replication writing on both to a table with autoincrement column will cause heavy fragmentation of data and index. How can this affect to query performance?
If load is fairly balanced between nodes, I do not see this as a problem. In other cases, it might be necessary to periodicaly optimize tables.

Q from Domenico: Is ti possible to replicate from more then one masters to fews slaves?
Currently, a slave can be linked only to one master. There are plans to change this and create multisource replication

Q from Phil: Is the use of high performance RAID important for the logs?
Absolutely, performance on the master will get benefit from this

Q from Alessandro: What is vertical partitioning?
It's a way to separate data by columns instead of rows. For example, if a large table has 10 columns, C1, C2 to C10, one can split this table into two tables: the first table will contain, for example, C1, C2 and C3, which are the most accessed columns for read and write. The second table will contain the primary key and the remaining 7 columns, that will be rarely updated. This approach can improve performance significantly, since the tables are smaller, the columns fit easier in the blocks used by the storage engine and in the caching mechanisms.

Q from Ian: When is it a good idea to choose MySQL Cluster rather than using replication for HA?
Replication and Cluster are two completely different technologies. In short terms, Replication is the right choice for scale-out solutions when there are lots of read operations, such as on web search and read. MySQL Cluster is mainly used when the ratio of read and write operations is closed to 1:1 and when queries affect small resultsets with direct key access or simple joins.

Q from Ap: What is the best method to load data into the slave, use mysqldump or use "load data from master"
LOAD DATA FROM MASTER is deprecated, the best solution is to use mysqldump

Q from Manuel: With master/slave replication you mentioned "automatic failover", how can you do that?
You can use clustering and HA software such as Linux HA to control the activity of the master. The software can switch over to the slave in case of issues. The client application will get access to the new master using a virtual IP address.

Q from Danilo: Is it possible connect master and slave with a serial link?
It's possible to connect master and slaves to any TCP/IP connection

Q from Jon: For those people who aren't lucky enough to have access to MySQL Enterprise - are there any other methods of monitoring slave status (e.g. SNMP)?
Yes, there are some monitoring tools that could help: Munin, Cacti and Ganglia might help. You can find information on these tools in the third webinar of the series (check the slides here)

Q from Jonas: If I have a 2-node cluster, with Heartbeat, each node has one IP (let's say IP1 and IP2, and share a VIP), can I configure both servers to be slaves, and the master be the VIP? Is not the active node both master and slave of himself?
Yes, in a sense, you can have a circular master replication and the master is accessible using the VIP

Q from Owen: What is best configuration for HA in a MyISAM environment with high OTW but also high muti joined selects for reporting?
IMO, the multi-joined selects are not affected by the HA solutions provided by MySQL. MyISAM can certainly help (provided that the system is generally safe from index corruption. If the requirement is to have a reporting server with multi-joined selects aside a classic OLTP server, then replication is probably the best solution.

Q from Manuel: When you configure a slave, you didn´t configure the log pos, why?
In the example, we supposed to start from a fresh installation. The log position is important when there the master and the slave must be synchronised while the system is running.