Hi, hopefully someone will be able to find my scenario interesting enough to comment on!
We have two instances of SQL, for this example I will call them 'L' and 'J'. We also have two data-centres, for this example I will call them 'D1' and 'D2'. We are attempting to create a new solution and our hardware budget is rather large. The directive from the company is that they want to be able to run either instance from either data centre. Preferably the primary for each will be seperated, so for example:
Instance 'L' will sit in data centre 'D1' with the ability to move to 'D2', and...
Instance 'J' will sit in data centre 'D2' with the ability to move to 'D1' on request.
My initial idea was to create a 6-node cluster - 3-nodes in each data centre. Let's name these D1-1, D1-2, D1-3 and D2-1, D2-2, D2-3 to signify which data centre they sit in.
'L' could then sit on (for example) D1-1, with the option to move to D1-2 (synchronously), D2-1,D2-2 (a-synchronously)
'J' could sit on D2-3, with D2-2 as a synchronous secondary and D1-3,D1-2 as the asynchronous secondaries.
Our asynchronous secondaries in this solution are our full DR options, our synchronous secondaries are our DR option without moving to another data centre site. The synchronous secondaries will be set up as automatic fail-over partners.
In theory, that may seen like a good approach. But when I took it to the proof of concept stage, we had issues with quorum...
Because there are three nodes at each side of the fence (3 in each data centre), then neither side has the 'majority' (the number of votes required to take control of the cluster). To get around this, we used WSFC with Node and File Share majority - with the file share sitting in the D1 data centre. Now the D1 data centre has 4 votes in total, and D2 only has 3.
This is a great setup if one of our data centres was defined as the 'primary', but the business requirement is to have two primary data centres, with the ability to fail over to one another.
In the proof of concept, i tested the theory by building the example solution and dropping the connection which divides the two data centres. It caused the data centre with the file share to stay online (as it had the majority), but the other data centre lost it's availability group listeners. SQL Server stayed online, just not via the AG listener's name - i.e. we could connect to them via their hostnames, rather than the shared 'virtual' name.
So I guess really I'm wondering, did anyone else have any experience of this type of setup? or any adjustments that can be made to the example solution, or the quorum settings in order to provide a nice outcome?