rac oracle脑裂深入了解Oracle RAC 脑裂 Brain Split

本帖最后由 maclean 于 2011-10-12 00:30 编辑

在理解脑裂(Brain Split)处理过程前，有必要介绍一下Oracle RAC Css(Cluster Synchronization

Services)的工作框架:

2011-10-10 19:15 上传

点击文件名下载附件

答网友的提问

Question 1:

文档里的有些说法我觉得不是很靠谱

比如

所以

有一种说法认为voting disk只要有2个足以保证冗余度就可以了，没有必要有3个或以上

voting disk，这种说法是错误的。Oracle推荐集群中至少要有3个voting disks。实际上应该是奇数个，1个也是可以的

在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的

节点，并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集

群(subcluster)内的节点数目，并依据以下2种因素决定哪个子集群应当存活下去:

1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)

2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node

number)，举例来说在一个2节点的RAC环境中总是1号节点会获胜。？？

这句话有官方出处吗？还是自己的理解？

如果这个1,2 成立，那么voting disk 还有什么用，不用voting disk 就已经可以判断出那个节点获胜了

Answer:

1个当然也可以，但是注意写的是推荐

2. 没有votedisk ，这些sub-cluster如何通信呢？

关于若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node

number) 这个Oracle的内部文档是有描述的，且也是文档中的实验可证实的

Question 2:

无奈的河马很用心，不过至少有2种情况可能没遇到过

1.发生在磁盘级的脑裂，这个在exadata环境中很常见

2.当被驱除的节点没cpu时间片反应驱逐信号，咋办

Answer:

文章的主旨是写脑裂决议的一些原理，真实世界中的多节点RAC 的情况可能复杂得多

Question:

无奈的河马能不能具体的讲讲

1.为什么 voting disk 必须是奇数？ 1个voting disk 算一票,那么3个voting disk是3票吗？

2.什么情况下 voting disk 参与投票

我看到 1. 拥有最多节点数目的子集群(Sub-cluster with largest number of Nodes)

2. 若子集群内数目相等则为拥有最低节点号的子集群(Sub-cluster with lowest node

number)，举例来说在一个2节点的RAC环境中总是1号节点会获胜。

如果上面1,2成立，那么voting disk 何时参与投票？能否举个例子，谢谢

Answer:

为了保证以上的磁盘心跳和读取”kill block”的活动始终正常运作CSS要求保证至少(N/2+1)个

投票磁盘要被节点正常访问，这样就保证了每2个节点间总是至少有一个投票磁盘是它们都

可以正常访问的

从来都没有说必须是奇数个，但推荐是奇数个

As far as voting disks are concerned, a node must be able to access strictly more than half of the voting disks at any time. So if you want to be able to tolerate a failure of n voting disks, you must have at least 2n+1 configured. (n=1 means 3 voting disks). You can configure up to 32 voting disks, providing protection against 15 simultaneous disk failures.

Oracle recommends that customers use 3 or more voting disks in Oracle RAC 10g Release 2. Note: For best availability, the 3 voting files should be physically separate disks. It is recommended to use an odd number as 4 disks will not be any more highly available than 3 disks, 1/2 of 3 is 1.5...rounded to 2, 1/2 of 4 is 2, once we lose 2 disks, our cluster will fail with both 4 voting disks or 3 voting disks.

为什么是奇数个？例如有3个votedisk ，那么其中一个节点只要能访问其中2个就 ok，而如果共有2个votedisk 那么节点需要能正常访问所有的VD。

如果是 4个ok吗？当然，4个也可以。

"在脑裂检查阶段Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的

节点，并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集

群(subcluster)内的节点数目，并依据以下2种因素决定哪个子集群应当存活下去"

另见文档实验部分的日志，如：

[ CSSD]2011-04-23 17:13:18.337 [3032460176] >TRACE: clssnmCheckDskInfo:

node 1, vrh1, state 5

with leader 1 has smaller cluster size 1; my cluster size 2 with leader 2"

3. 实验里的场景2说明了该问题：

"另一场景为1号节点未加入集群，2号节点的网络失败，因2号节点的member number较

小故其通过voting disk向3号节点发起驱逐"

"[ CSSD]2011-04-23 17:42:32.022 [3032460176] >TRACE: clssnmCheckDskInfo:

node 3, vrh3, state 5 with leader 3

has smaller cluster size 1; my cluster size 1 with leader 2"

如果还不理解，那么读读这段：

The RM (Reconfig Manager) sends a sync message to all participating nodes. Participating nodes respond with a sync acknowledgement. After this the vote phase begins and the master sends a vote message to all participating nodes. Participating nodes repond with a vote info message containing their node identifier and GM peer to peer listening endpoint. In the split-check phase, the RM uses the voting disk to verify there is no split-brain. It finds nodes heartbeating to disk that are not connected via the network. If it finds these, it will determine which nodes are talking to which and the largest subcluster survives. For example, if we have a 5 node cluster and all of the nodes are heartbeating to the voting disk but only a group of 3 can communicate via the network and a group of 2 can communication via the network, this means we have 2 subclusters. The largest subcluster (3) would survive while the other subcluster (2) would not. After this the evict phase would evict nodes previously in the cluster but not considered members in this incarnation. In this case we would send a message to evicted nodes (if possible) and write eviction notice to a ‘kill’ block in the voting file. We would wait for the node to indicate it got the eviction notice (wait for seconds). The wait is terminated by a message or status on the voting file indicating that the node got the eviction notice. In the update phase the master sends an update message containing the definitive cluster membership and node information for all particpating nodes. The participating nodes send update acknowledgements. All members queue the reconfiguration event ot their GM.