AlwaysOn Cluster did not fail over successfully

I have had a serious issue with a production AlwaysOn cluster whereby the service did not successfully transition to the secondary node and I cannot find the root cause of the issue.

Some details: It is a 2 node cluster (same datacenter) with a shared disk quorum, Windows Server 2012, both are virtual machines running on VMWare vSphere 5.5. SQL Server version is 2012 Enterprise SP2 CU6

The failover occurred because of a network incident (a spanning tree recalculation caused a connection timeout between both nodes). Initial entries in the SQL Log look normal for this event, for example:

05/08/2015 11:18:06: A connection timeout has occurred on a previously established connection to availability replica 'FIN-IE-PA078' with id [6910F4A9-87E7-4836-BA79-0F41BE90266D]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

05/08/2015 11:18:06: AlwaysOn Availability Groups connection with secondary database terminated for primary database 'UserManagement' on the availability replica with Replica ID: {6910f4a9-87e7-4836-ba79-0f41be90266d}. This is an informational message only. No user action is required.

05/08/2015 11:18:07: Stopped listening on virtual network name 'FIN-IE-PA080'. No user action is required.

05/08/2015 11:18:08: AlwaysOn: The local replica of availability group 'PI-STD-AG' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

05/08/2015 11:18:08: The state of the local availability replica in availability group 'PI-STD-AG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

05/08/2015 11:18:08: AlwaysOn Availability Groups connection with secondary database terminated for primary database 'UserManagement' on the availability replica with Replica ID: {6910f4a9-87e7-4836-ba79-0f41be90266d}. This is an informational message only. No user action is required.

05/08/2015 11:18:08: The availability group database "UserManagement" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

05/08/2015 11:18:08: Nonqualified transactions are being rolled back in database UserManagement for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

At this point, there are repeated messages in the log file relating to Remote harden of transactions, all connected to GhostCleanupTask, for example:

05/08/2015 11:18:36: Nonqualified transactions are being rolled back in database UserManagement for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

This message repeats about once every 1 to 2 minutes, until a point where i manually initiated a failover on the server that was originally primary. At this point, the availability group came back online and the secondary database re-synchronized.

05/08/2015 11:36:31: The state of the local availability replica in availability group 'PI-STD-AG' has changed from 'RESOLVING_NORMAL' to 'RESOLVING_PENDING_FAILOVER'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

05/08/2015 11:36:41: AlwaysOn: The local replica of availability group 'PI-STD-AG' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

05/08/2015 11:36:41: Started listening on virtual network name 'FIN-IE-PA080'. No user action is required.

05/08/2015 11:36:42: A connection for availability group 'PI-STD-AG' from availability replica 'FIN-IE-PA077' with id [98F8CD93-0C9D-44E5-BD6B-68964D391B15] to 'FIN-IE-PA078' with id [6910F4A9-87E7-4836-BA79-0F41BE90266D] has been successfully established. This is an informational message only. No user action is required.

That is the picture from the SQL Server Error Log. Now for the windows log:

05/08/2015 11:18:02: Cluster network 'Cluster Network 1' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

05/08/2015 11:18:02: Cluster network interface 'FIN-IE-PA077 - Ethernet' for cluster node 'FIN-IE-PA077' on network 'Cluster Network 1' failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

05/08/2015 11:18:03: Health check for IP interface 'IP Address 192.168.57.62' (address '192.168.57.62') failed (status is '1117'). Run the Validate a Configuration wizard to ensure that the network adapter is functioning properly.

This message repeats several times

05/08/2015 11:18:08: Cluster resource 'PI-STD-AG_192.168.57.59' of type 'IP Address' in clustered role 'PI-STD-AG' failed.

05/08/2015 11:18:08: Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

05/08/2015 11:18:08: The Cluster service failed to bring clustered service or application 'PI-STD-AG' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

05/08/2015 11:18:08: Clustered role 'PI-STD-AG' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

My interpretation of this is that the cluster failover attempts failed, because the network condition still persisted. The network interruption lasted approximately 2 minutes, and I would have expected the cluster to come back online at this point, after the restart delay period as suggested in the last entry in the error log. However this did not happen.

Appreciate any support on this.

AlwaysOn Cluster did not fail over successfully

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...