| author | MashaMSFT |
|---|---|
| ms.author | mathoma |
| ms.date | 03/29/2023 |
| ms.service | virtual-machines |
| ms.topic | include |
High availability and disaster recovery (HADR) features, such as the Always On availability group and the failover cluster instance rely on underlying Windows Server Failover Cluster technology. Review the best practices for modifying your HADR settings to better support the cloud environment.
For your Windows cluster, consider these best practices:
- Deploy your SQL Server VMs to multiple subnets whenever possible to avoid the dependency on an Azure Load Balancer or a distributed network name (DNN) to route traffic to your HADR solution.
- Change the cluster to less aggressive parameters to avoid unexpected outages from transient network failures or Azure platform maintenance. To learn more, see heartbeat and threshold settings. For Windows Server 2012 and later, use the following recommended values:
- SameSubnetDelay: 1 second
- SameSubnetThreshold: 40 heartbeats
- CrossSubnetDelay: 1 second
- CrossSubnetThreshold: 40 heartbeats
- Place your VMs in an availability set or different availability zones. To learn more, see VM availability settings.
- Use a single NIC per cluster node.
- Configure cluster quorum voting to use 3 or more odd number of votes. Don't assign votes to DR regions.
- Carefully monitor resource limits to avoid unexpected restarts or failovers due to resource constraints.
- Ensure your OS, drivers, and SQL Server are at the latest builds.
- Optimize performance for SQL Server on Azure VMs. Review the other sections in this article to learn more.
- Reduce or spread out workload to avoid resource limits.
- Move to a VM or disk that has higher limits to avoid constraints.
For your SQL Server availability group or failover cluster instance, consider these best practices:
- If you're experiencing frequent unexpected failures, follow the performance best practices outlined in the rest of this article.
- If optimizing SQL Server VM performance doesn't resolve your unexpected failovers, consider relaxing the monitoring for the availability group or failover cluster instance. However, doing so may not address the underlying source of the issue and could mask symptoms by reducing the likelihood of failure. You may still need to investigate and address the underlying root cause. For Windows Server 2012 or higher, use the following recommended values:
- Lease timeout: Use this equation to calculate the maximum lease time-out value:
Lease timeout < (2 * SameSubnetThreshold * SameSubnetDelay).
Start with 40 seconds. If you're using the relaxedSameSubnetThresholdandSameSubnetDelayvalues recommended previously, don't exceed 80 seconds for the lease timeout value. - Max failures in a specified period: Set this value to 6.
- Lease timeout: Use this equation to calculate the maximum lease time-out value:
- When using the virtual network name (VNN) and an Azure Load Balancer to connect to your HADR solution, specify
MultiSubnetFailover = truein the connection string, even if your cluster only spans one subnet.- If the client doesn't support
MultiSubnetFailover = Trueyou may need to setRegisterAllProvidersIP = 0andHostRecordTTL = 300to cache client credentials for shorter durations. However, doing so may cause additional queries to the DNS server.
- If the client doesn't support
- To connect to your HADR solution using the distributed network name (DNN), consider the following:
- You must use a client driver that supports
MultiSubnetFailover = True, and this parameter must be in the connection string. - Use a unique DNN port in the connection string when connecting to the DNN listener for an availability group.
- You must use a client driver that supports
- Use a database mirroring connection string for a basic availability group to bypass the need for a load balancer or DNN.
- Validate the sector size of your VHDs before deploying your high availability solution to avoid having misaligned I/Os. See KB3009974 to learn more.
- If the SQL Server database engine, Always On availability group listener, or failover cluster instance health probe are configured to use a port between 49,152 and 65,536 (the default dynamic port range for TCP/IP), add an exclusion for each port. Doing so prevents other systems from being dynamically assigned the same port. The following example creates an exclusion for port 59999:
netsh int ipv4 add excludedportrange tcp startport=59999 numberofports=1 store=persistent