AlwaysOn Availability Group Disk Failure But No Failover - by Dave Hughes

Status : 

  Won't Fix<br /><br />
		Due to several factors the product team decided to focus its efforts on other items.<br /><br />
		A more detailed explanation for the resolution of this particular item may have been provided in the comments section.


15
0
Sign in
to vote
ID 772887 Comments
Status Closed Workarounds
Type Bug Repros 5
Opened 11/30/2012 1:24:17 AM
Access Restriction Public

Description

The scenario is a follows:

2 x Physical Servers (Site 1: Node 1 + Node 2) with local C & D drive for  Operating System and Applications
2 x Hyper-V v2 Virtual Servers in a DR site (Site 2: Node 3 + Node 4, different subnet)
5 x iSCSI LUNs for data (H),  logs (L) / tempdb data (T), tempdb log (U) , backup (Z).

I believe that native Windows drivers are used for the networking and iSCSI.
1 Cluster is spanning the 4 nodes.  

SQL Server binaries installed to D
SQL Server root data (master db etc.) installed to H:/Program Files /Microsoft SQL Server...

Cluster node voting set to Node 1 + Node 2 + File Share Witness

Availability Group set to Sync + Failover on Node 1 + Node 2, Async on Node 3 + Node 4

All is currently healthy, clustering is reporting zero errors, data movement is running, Node 1 (physical) is the active server.

To simulate a SAN failure, the network adapters to the iSCSI target were disabled on the active physical server.  At this point the additional drives disappeared from Windows Explorer as expected, however the Availability Group DID NOT failover to the second node.

Running sp_server_diagnostics (on what I assume to be a dying server) reported no errors - all components reported 'clean' (possibly expect for events which may have been 'unknown').  Trying to connect to the databases threw errors as expected.

So my question is, why is sp_server_diagnostics reporting a healthy state when ALL the iSCSI drives have 'failed', including master, tempdb and the user databases.  Is this scenarion not likely to be one of the most common errors that Availability Groups (and I guess failover Cluster Instances as well - they also use sp_server_diagnostics) should protect against?

I am aware that per-database errors are not currently included in sp_server_diagnostics, but even so, a complete loss of the databases especially system databases should have flagged something!

Am I missing something?
Sign in to post a comment.
Posted by Cody Konior on 8/22/2016 at 12:55 AM
I know this is old and closed as won't fix. I don't have the full answers but the behaviour of not failing over when databases die is mentioned here: https://msdn.microsoft.com/en-us/library/hh710061.aspx as "Damaged databases and suspect databases are not detected by any failure-condition level. Therefore, a database that is damaged or suspect (whether due to a hardware failure, data corruption, or other issue) never triggers an automatic failover."

In SQL 2016 there's a checkbox on AGs to enable failover on database health, it's documented here (I couldn't find the details in the official documentation): https://blogs.msdn.microsoft.com/saponsqlserver/2016/05/02/sql-server-2016-alwayson-for-sap/

I haven't tested if it includes system databases but I suspect not. There were also reports that local tempdb can disappear on FCIs and it won't trigger a failover. I don't know what the best practice is in these cases.