We have a setup of two mongodb shards. Each shard contains a master, a slave, a 24h slave delay slave and an arbiter. However the balancer fails to migrate any shards waiting for the delayed slave to migrate. I have tried setting _secondaryThrottle to false in the balancer config, but I still have the issue.
It seems the migration goes on for a day and then fails (A ton of waiting for slave messages in the logs). Eventually it gives up and starts a new migration. The message says waiting for 3 slaves, but the delay slave is hidden and prio 0 so it should wait for that one. And if the _secondaryThrottle worked it should not wait for any slave right?
It's been like this for a few months now so the config should have been reloaded on all mongoses. Some of the mongoses running the balancer have been restarter recently.
Does anyone have any idea how to solve the problem, we did not have these issues before starting the delayed slave, but it's just our theory.
Config:
{ "_id" : "balancer", "_secondaryThrottle" : false, "stopped" : false }
Log from shard1 master process:
[migrateThread] warning: migrate commit waiting for 3 slaves for 'xxx.xxx' { shardkey: ObjectId('4fd2025ae087c37d32039a9e') } -> {shardkey: ObjectId('4fd2035ae087c37f04014a79') } waiting for: 529dc9d9:7a [migrateThread] Waiting for replication to catch up before entering critical section
Log from shard2 master process:
Tue Dec 3 14:52:25.302 [conn1369472] moveChunk data transfer progress: { active: true, ns: "xxx.xxx", from: "shard2/mongo2:27018,mongob2:27018", min: { shardkey: ObjectId('4fd2025ae087c37d32039a9e') }, max: { shardkey: ObjectId('4fd2035ae087c37f04014a79') }, shardKeyPattern: { shardkey: 1.0 }, state: "catchup", counts: { cloned: 22773, clonedBytes: 36323458, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Update: I confirmed that removing slaveDelay got the balancer working again. As soon as they got up to speed chunks moved. So the problem seems to be related to the slaveDelay. I also confirmed that the balancer runs with "secondaryThrottle" : false. It does seem to wait for slaves anyway.
Shard2:
Tue Dec 10 11:44:25.423 [migrateThread] warning: migrate commit waiting for 3 slaves for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') } waiting for: 52a6f089:81
Tue Dec 10 11:44:26.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:27.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:28.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:29.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:30.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:31.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:31.424 [migrateThread] migrate commit succeeded flushing to secondaries for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.425 [migrateThread] migrate commit flushed to journal for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.647 [migrateThread] migrate commit succeeded flushing to secondaries for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.667 [migrateThread] migrate commit flushed to journal for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }