Hi All,
I got AlwaysOn alert from our monitoring system that some databases are not synced, then I check the sql server log and event viewer, found there is pattern of the issue, which occurs every hour.
The error indicates there is connection timeout, googled the message I double checked the NT AUTHORITY\SYSTEM permission, it looks good, then I checked the port 5022 for endpoint, found svchost also listening to this port, not sure it is correct or not(our other cluster has same configuration and I can see svchost is listening the endpoint port).
I copied all messages below, does anyone experience this issue before?(I just notice when I observe the sql server around 7:33, there is no alwayson error in sql server log, but the event viewer shows errors, that is weird).
SQL Server log Occurs at 6:33, 5:33, 4:33 every hour except the 7:33 when I was observing...
AlwaysOn Availability Groups connection with secondary database terminated for primary database 'XXXYYY' on the availability replica 'XXX3' with Replica ID: {9b3bd423-f2fc-44c5-9831-41c94c4d6de2}. This is an informational message only. No user action is
required.
A connection timeout has occurred while attempting to establish a connection to availability replica 'XXX3' with id [41F87192-C846-4355-A8DC-C788EC56E93E]. Either a networking or firewall issue exists,
or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
AlwaysOn Availability Groups connection with secondary database established for primary database 'XXXYYY' on the availability replica'XXX3' with Replica ID: {41f87192-c846-4355-a8dc-c788ec56e93e}. This is an informational message only. No user action is required.
Event Viewer, occurs at 7:33, 6:33, 5:33, 4:33 ... every hour
Cluster resource 'XXXRes' of type 'SQL Server Availability Group' in clustered role 'XXXRole' failed.
Clustered role 'XXXRole' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
Cluster resource 'XXXRes' of type 'SQL Server Availability Group' in clustered role 'XXXRole' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster
Manager or the Get-ClusterResource Windows PowerShell cmdlet.
The Cluster service failed to bring clustered role 'XXXRole' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.
As the endpoint port is 5022, so I output the netstat found following ports info:
netstat -anb
[svchost.exe]
TCP 0.0.0.0:5022 0.0.0.0:0 LISTENING
[svchost.exe]
TCP 10.1.16.181:5022 10.1.16.182:55377 ESTABLISHED
[sqlservr.exe]
TCP 10.1.16.181:5022 10.1.16.183:63670 ESTABLISHED
[sqlservr.exe]
TCP 10.1.16.181:5022 10.1.16.183:63670 ESTABLISHED
[sqlservr.exe]
TCP 10.1.16.181:51465 10.1.138.161:49159 TIME_WAIT
TCP 10.1.16.181:51496 10.1.16.182:5022 ESTABLISHED
[SQLAGENT.EXE]
TCP 10.1.16.181:62876 10.1.16.183:5022 ESTABLISHED
[svchost.exe]
TCP [::]:5022 [::]:0 LISTENING