Quantcast
Channel: SQL Server High Availability and Disaster Recovery forum
Viewing all articles
Browse latest Browse all 4532

AlwaysOn Cluster did not fail over successfully

$
0
0

I have had a serious issue with a production AlwaysOn cluster whereby the service did not successfully transition to the secondary node and I cannot find the root cause of the issue.

Some details: It is a 2 node cluster (same datacenter) with a shared disk quorum, Windows Server 2012, both are virtual machines running on VMWare vSphere  5.5. SQL Server version is 2012 Enterprise SP2 CU6

The failover occurred because of a network incident (a spanning tree recalculation caused a connection timeout between both nodes). Initial entries in the SQL Log look normal for this event, for example:

05/08/2015 11:18:06: A connection timeout has occurred on a previously established connection to availability replica 'FIN-IE-PA078' with id [6910F4A9-87E7-4836-BA79-0F41BE90266D].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

05/08/2015 11:18:06: AlwaysOn Availability Groups connection with secondary database terminated for primary database 'UserManagement' on the availability replica with Replica ID: {6910f4a9-87e7-4836-ba79-0f41be90266d}. This is an informational message only. No user action is required.

05/08/2015 11:18:07: Stopped listening on virtual network name 'FIN-IE-PA080'. No user action is required.

05/08/2015 11:18:08: AlwaysOn: The local replica of availability group 'PI-STD-AG' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

05/08/2015 11:18:08: The state of the local availability replica in availability group 'PI-STD-AG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

05/08/2015 11:18:08: AlwaysOn Availability Groups connection with secondary database terminated for primary database 'UserManagement' on the availability replica with Replica ID: {6910f4a9-87e7-4836-ba79-0f41be90266d}. This is an informational message only. No user action is required.

05/08/2015 11:18:08: The availability group database "UserManagement" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

05/08/2015 11:18:08: Nonqualified transactions are being rolled back in database UserManagement for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

At this point, there are repeated messages in the log file relating to Remote harden of transactions, all connected to GhostCleanupTask, for example:

05/08/2015 11:18:36: Nonqualified transactions are being rolled back in database UserManagement for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

This message repeats about once every 1 to 2 minutes, until a point where i manually initiated a failover on the server that was originally primary. At this point, the availability group came back online and the secondary database re-synchronized.

05/08/2015 11:36:31: The state of the local availability replica in availability group 'PI-STD-AG' has changed from 'RESOLVING_NORMAL' to 'RESOLVING_PENDING_FAILOVER'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

05/08/2015 11:36:41: AlwaysOn: The local replica of availability group 'PI-STD-AG' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

05/08/2015 11:36:41: Started listening on virtual network name 'FIN-IE-PA080'. No user action is required.

05/08/2015 11:36:42: A connection for availability group 'PI-STD-AG' from availability replica 'FIN-IE-PA077' with id  [98F8CD93-0C9D-44E5-BD6B-68964D391B15] to 'FIN-IE-PA078' with id [6910F4A9-87E7-4836-BA79-0F41BE90266D] has been successfully established.  This is an informational message only. No user action is required.

That is the picture from the SQL Server Error Log. Now for the windows log:

05/08/2015 11:18:02: Cluster network 'Cluster Network 1' is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

05/08/2015 11:18:02: Cluster network interface 'FIN-IE-PA077 - Ethernet' for cluster node 'FIN-IE-PA077' on network 'Cluster Network 1' failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

05/08/2015 11:18:03: Health check for IP interface 'IP Address 192.168.57.62' (address '192.168.57.62') failed (status is '1117'). Run the Validate a Configuration wizard to ensure that the network adapter is functioning properly.

This message repeats several times

05/08/2015 11:18:08: Cluster resource 'PI-STD-AG_192.168.57.59' of type 'IP Address' in clustered role 'PI-STD-AG' failed.

05/08/2015 11:18:08: Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

05/08/2015 11:18:08: The Cluster service failed to bring clustered service or application 'PI-STD-AG' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

05/08/2015 11:18:08: Clustered role 'PI-STD-AG' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

My interpretation of this is that the cluster failover attempts failed, because the network condition still persisted. The network interruption lasted approximately 2 minutes, and I would have expected the cluster to come back online at this point, after the restart delay period as suggested in the last entry in the error log. However this did not happen.

Appreciate any support on this.


Viewing all articles
Browse latest Browse all 4532

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>