In this blog post I will share an issue we had with a database which is configured with AlwaysON. Before proceeding any further, the environment which we’ve got is:
Each node has Windows Server 2008R2(With all the service packs and hot fixes recommended for AlwaysON)
Running on top of VMware VShpere 5.1
SQL Server 2012(SP1) Enterprise Edition
RAM: 10 GB (8 GB assigned to SQL Server).
2 VCPU’s.
Availability Mode- Synchronous Commit
Issue: Daily around 5 AM, the secondary database is going to “Not Synchronizing/Suspect” state and until we fix this the T-Log on primary grows and all that normal jazz once the AlwaysON databases get out of Sync…(See below)
So, what’s happening?
The App team is performing data load daily around 4.30 AM. Okay…So what’s bad about that? They are loading ~30 Million Records daily, in a single transaction. Oops!!!…
From SQL Server error logs, we see the below message:
Message
AlwaysOn Availability Groups data movement for database ‘Test_DB’ has been suspended for the following reason: “system” (Source ID 2; Source string: ‘SUSPEND_FROM_REDO‘). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
This message is always accommodated with another message(Shown below):
Message
The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
Ummm…This doesn’t looks good. If you are thinking, what Locks have to do with AlwaysON secondaries, let me tell you this. With Database Mirroring and AlwaysON Rollbacks/Redo thread will also take a lock on the secondary side to avoid any other transaction to interrupt REDO process, thus guaranteeing consistency. If for some reason SQL Server is not able to acquire locks for redo thread it won’t synchronize the database starting that point. (It’s by design).
In our case what’s happening was SQL Server was running out of memory and was not able to acquire any further locks(Remember, each lock structure in SQL Server will need certain amount of memory).Basically, it says “Since I wasn’t able to acquire a lock during the REDO, I don’t know what else happened at that time and I can’t guarantee the database to be consistent. So…am not going to synchronize from this point and I will suspend the data movement and also take the database to Suspect state”).
From AlwaysON standpoint, Suspending Synchronization when the REDO thread encounters any error is by design and is done on purpose by SQL Server.
To avoid this, all they(App team) have to do is optimize their load process to better manage lock acquisition.(We are not being granted any more memory on these boxes unfortunately).
Bottom Line: Avoid huge transactions on tiny SQL Servers. Try to split the transactions into multiple chunks especially when dealing with millions/billions of rows. That helps in general many ways, not just in this particular scenario.
Have a safe and happy long weekend guys!
