This week I had a fun one with vSAN stretched clusters. After a failover test, on 2 stretched clusters both witnesses of those clusters stopped working.
Let the troubleshooting begin
First, you look at the corresponding KB article. Troubleshooting vSAN Witness Node Isolation (2150433)
- A vSAN Witness Node (Virtual or Physical) is Isolated.
To confirm witness node isolation run the command: esxcli vsan cluster get
If the output of the command returns:
Sub-Cluster Member Count: 1
Local Node State: STANDALONE
Then the Witness is confirmed to be isolated from the vSAN Cluster.
- The vSAN Witness Host cannot form a Cluster with the remaining vSAN Data Nodes in a Stretched Configuration.
- Pinging the Witness Host from a vSAN ESXi fails
- Pinging an ESXi host from a Witness works, but not with a Full TCP Frame
After testing all those settings, and all passed I was still scratching my head why the witness was isolated and living in 2 partitions.
It formed a cluster just fine… Pinging all objects worked on the correct VMK. Routes were all there.
Unicastagent showed all hosts including the witness.
So why is it still isolated? what am I missing here? it worked before the failover…
Even redeployed a new witness on the same physical witness host and it still not worked!
Something the KB article does not mention
All the tests in the KB article on vSAN Witness Node Isolation only test TCP. not UDP…
The vSAN Clustering Service uses UDP!
TCP and UDP ports for VMware vSAN network :
|12345||UDP||ESXi hosts||ESXi hosts||vSAN Clustering Service.|
|23451||ESXi hosts||vSAN Witness|
|12321||vSAN Witness||ESXi hosts|
|2233||TCP||ESXi hosts||ESXi hosts||vSAN Transport: Used for storage IO.|
|ESXi hosts||vSAN Witness|
|vSAN Witness||ESXi hosts|
|8080||TCP||ESXi hosts||ESXi hosts||vSAN Management Service|
|vCenter||ESXi hosts||VMware vSphere Profile-Driven Storage Service and vSAN Management Service|
|3260||TCP||iSCSI initiator||ESXi hosts||Default iSCSI port for vSAN ISCSI target service|
|5001||UDP||ESXi hosts||ESXi hosts||Vsanhealth-multicasttest: vSAN Health Proactive Network test. This port is enabled on demand when Proactive Network Test is running.|
|8010||TCP||Web browser||vCenter||vSAN observer default port number for live statistics. Custom port number can also be specified for vSAN observer.|
|80||TCP||ESXi hosts||ESXi hosts||vSAN Performance Service|
After deploying a new witness in a new network and another host it came up instantly. So this pointed me in the direction that it’s still a network issue!
Where the problem eventually was, was that UDP was disabled on the vSAN witness switch ACL due to network hardening. This kept on working before because the UDP connection was open at all times until the failover happened. after that, the UDP was blocked and hence the vSAN Clustering Service died.
So if you run in a similar issue with a vSAN Witness, check UDP traffic!