11gR2 Grid Infrastructure Redundant Interconnect aka Cluster High Availability IP (HAIP)

Preamble

We recently bought HP enclosures and blade servers and so we used HP Virtual Connect (VC) to manage blades network. We wanted to implement Oracle RAC for one of our critical database (SAP one). Then all our questions started when we asked ourselves the configuration we plan to setup. Putting our 2 RAC nodes in same enclosure ? Even as simple as this the certification using VC for RAC interconnect was not so obvious… Our HP contacts initially told us that it was not certified (and could take 3-6 months more to get Oracle approval on this configuration) forcing us to buy extra switches (which ones ?) and then retracted saying they were wrong and that it is fully certified…

Nevertheless the configuration we finally wanted to setup (for maximum redundancy) was an inter-enclosure (using 2 c7000 enclosures) with one RAC node in each enclosure. I’m not talking here of extended RAC (RAC long distance) as our enclosures are in the same data-center and are at less than a cable length distance…

When working with my system teammate on Network Interface Controller (NIC) configuration for bonding/teaming configuration I found this new (11.2.0.2) Grid Infrastructure feature calls Cluster High Availability IP (HAIP). New features that triggered more questions than answers to open points I may had…

Before entering into Oracle RAC configuration I will, as a reminder for me mainly, put some Linux commands to display the system configuration.

My testing has been done on Red Hat Enterprise Linux Server release 5.5 (Tikanga) and Oracle Enterprise Edition 11.2.0.2. The RAC cluster is made of 2 nodes.

NIC bonding/teaming

NIC bonding (or teaming) is the capability to bond together 2 or more NIC to create a special NIC called a channel bonding interface. The clubbed NICs act as one NIC to provide redundancy and increased bandwidth.

In my configuration

[root@server1 ~]# ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   1000baseT/Full
                               10000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  1000baseT/Full
                                10000baseT/Full
        Advertised auto-negotiation: No
        Speed: 2500Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes
[root@server1 ~]# ethtool eth1
Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   1000baseT/Full
                               10000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  1000baseT/Full
                                10000baseT/Full
        Advertised auto-negotiation: No
        Speed: 2500Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes

Remark:
The displayed speed may looks strange but with VC you can allocate any percentage of a 10Gb NIC.

Then moving to /etc/sysconfig/network-scripts you need to configure the 2 NICs with something like (note the MASTER and SLAVE keywords):

[root@server1 network-scripts]# cat ifcfg-eth0
# ServerEngines Corp. Emulex OneConnect 10Gb NIC (be3)
DEVICE=eth0
BOOTPROTO=static
ONBOOT=yes
HWADDR=00:17:a4:77:fc:08
MASTER=bond0
SLAVE=yes
USERCTL=no
[root@server1 network-scripts]# cat ifcfg-eth1
# ServerEngines Corp. Emulex OneConnect 10Gb NIC (be3)
DEVICE=eth1
BOOTPROTO=static
ONBOOT=yes
HWADDR=00:17:a4:77:fc:78
MASTER=bond0
SLAVE=yes
USERCTL=no

Ensure you have the required kernel modules loaded:

[root@server1 network-scripts]# modprobe --list | grep bonding
/lib/modules/2.6.18-274.3.1.el5/kernel/drivers/net/bonding/bonding.ko
[root@server1 network-scripts]# modprobe --list | grep mii
/lib/modules/2.6.18-274.3.1.el5/kernel/drivers/net/mii.ko
[root@server1 network-scripts]# modinfo bonding
filename:       /lib/modules/2.6.18-274.3.1.el5/kernel/drivers/net/bonding/bonding.ko
author:         Thomas Davis, tadavis@lbl.gov and many others
description:    Ethernet Channel Bonding Driver, v3.4.0-1
version:        3.4.0-1
license:        GPL
srcversion:     91E9D75EA26B18C60985A99
depends:        ipv6
vermagic:       2.6.18-274.3.1.el5 SMP mod_unload gcc-4.1
parm:           max_bonds:Max number of bonded devices (int)
parm:           num_grat_arp:Number of gratuitous ARP packets to send on failover event (int)
parm:           num_unsol_na:Number of unsolicited IPv6 Neighbor Advertisements packets to send on failover event (int)
parm:           miimon:Link check interval in milliseconds (int)
parm:           updelay:Delay before considering link up, in milliseconds (int)
parm:           downdelay:Delay before considering link down, in milliseconds (int)
parm:           use_carrier:Use netif_carrier_ok (vs MII ioctls) in miimon; 0 for off, 1 for on (default) (int)
parm:           mode:Mode of operation : 0 for balance-rr, 1 for active-backup, 2 for balance-xor, 3 for broadcast, 4 for 802.3ad, 5 for balance-tlb, 6 for balance-alb (charp)
parm:           primary:Primary network device to use (charp)
parm:           primary_reselect:Reselect primary slave once it comes up; 0 for always (default), 1 for only if speed of primary is better, 2 for only on active slave failure (charp)
parm:           lacp_rate:LACPDU tx rate to request from 802.3ad partner (slow/fast) (charp)
parm:           xmit_hash_policy:XOR hashing method: 0 for layer 2 (default), 1 for layer 3+4 (charp)
parm:           arp_interval:arp interval in milliseconds (int)
parm:           arp_ip_target:arp targets in n.n.n.n form (array of charp)
parm:           arp_validate:validate src/dst of ARP probes: none (default), active, backup or all (charp)
parm:           fail_over_mac:For active-backup, do not set all slaves to the same MAC.  none (default), active or follow (charp)
parm:           resend_igmp:Number of IGMP membership reports to send on link failure (int)
parm:           debug:Print debug messages; 0 for off (default), 1 for on (int)
module_sig:     883f3504e58269a94abe3920d1168f1123b820a08079cf7c9add3058728a552c7ea955a11dfffe609f439e1665f33a82ae8fc2ed4e982cc22849c8ccd

The best practices is to define bonding parameters using BONDING_OPTS in NIC dedicated configuration files. In /etc/modprobe.conf you should normally see only (modprobe bonding to reload it):

[root@server1 network-scripts]# cat /etc/modprobe.conf | grep bond
alias bond0 bonding

Then define the bonding NIC with something like:

[root@server1 network-scripts]# cat ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
IPADDR=10.75.1.222
NETWORK=10.75.1.0
NETMASK=255.255.255.0
BROADCAST=10.75.1.255
GATEWAY=10.75.1.254
BONDING_OPTS="mode=4 miimon=100"

miimon parameter controls the MII link monitoring frequency in milliseconds (MII apparently stand for media independent interface).

mode, taken from RedHat official documentation can take the following values:

Mode Description
balance-rr or 0 Sets a round-robin policy for fault tolerance and load balancing. Transmissions are received and sent out sequentially on each bonded slave interface beginning with the first one available.
active-backup or 1 Sets an active-backup policy for fault tolerance. Transmissions are received and sent out via the first available bonded slave interface. Another bonded slave interface is only used if the active bonded slave interface fails.
balance-xor or 2 Sets an XOR (exclusive-or) policy for fault tolerance and load balancing. Using this method, the interface matches up the incoming request’s MAC address with the MAC address for one of the slave NICs. Once this link is established, transmissions are sent out sequentially beginning with the first available interface.
broadcast or 3 Sets a broadcast policy for fault tolerance. All transmissions are sent on all slave interfaces.
802.3ad or 4 Sets an IEEE 802.3ad dynamic link aggregation policy. Creates aggregation groups that share the same speed and duplex settings. Transmits and receives on all slaves in the active aggregator. Requires a switch that is 802.3ad compliant.
balance-tlb or 5 Sets a Transmit Load Balancing (TLB) policy for fault tolerance and load balancing. The outgoing traffic is distributed according to the current load on each slave interface. Incoming traffic is received by the current slave. If the receiving slave fails, another slave takes over the MAC address of the failed slave.
balance-alb or 6 Sets an Active Load Balancing (ALB) policy for fault tolerance and load balancing. Includes transmit and receive load balancing for IPV4 traffic. Receive load balancing is achieved through ARP negotiation.

Default mode is 0 and I would have personally chosen mode 6 which looks the most advanced having failover and load balancing in transmit and receive… But there are multiple references on My Oracle Support (MOS) mentioning bugs with this mode, that apparently most people have chosen…

Once all configured reload parameters with ifup bond0 (bonding NIC must be down before configuring it) or even with service network restart:

[root@server1 ~]# ifconfig bond0
bond0     Link encap:Ethernet  HWaddr 00:17:A4:77:FC:08
          inet addr:10.75.1.222  Bcast:10.75.1.255  Mask:255.255.255.0
          inet6 addr: fe80::217:a4ff:fe77:fc08/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:885011418 errors:1445 dropped:0 overruns:0 frame:1441
          TX packets:2700910551 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:209023586594 (194.6 GiB)  TX bytes:2782566207400 (2.5 TiB)
[root@server1 ~]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.4.0-1 (October 7, 2008)
 
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
 
802.3ad info
LACP rate: slow
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 1
        Actor Key: 9
        Partner Key: 1
        Partner Mac Address: 00:00:00:00:00:00
 
Slave Interface: eth0
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:17:a4:77:fc:08
Aggregator ID: 1
 
Slave Interface: eth1
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:17:a4:77:fc:0a
Aggregator ID: 2

HAIP

So then why not using NIC bonding for RAC interconnect and avoiding to use HAIP ? Mainly because sometimes it may not be possible to implement it (switches limitation ?) or because the only authorized NIC bonding failover is not providing the expected result (packets in error only after timeout is case of suspicious errors as per my system teammate). When having a look to my 11.2.0.2 RAC cluster I have also seen that private network was not using what I initially expected…

To check if HAIP is running (the -init option, not documented, provides the core processes list of ClusterWare i.e. CRS, CSS, …):

[root@server1 ~]# export GRID_HOME=/oracleMWQ/GRID/112_64
[root@server1 ~]# $GRID_HOME/bin/crsctl stat res -init -w "TYPE = ora.haip.type"
NAME=ora.cluster_interconnect.haip
TYPE=ora.haip.type
TARGET=ONLINE
STATE=ONLINE on server1

To display its configuration. Two NICs have been specified for Cluster interconnect network:

[root@server1 ~]# $GRID_HOME/bin/oifcfg getif
eth0  10.10.10.0  global  cluster_interconnect
eth1  10.10.10.0  global  cluster_interconnect
bond0  10.75.1.0  global  public
[root@server1 ~]# $GRID_HOME/bin/oifcfg iflist -p -n
eth0  10.10.10.0  PRIVATE  255.255.255.0
eth0  169.254.0.0  UNKNOWN  255.255.128.0
eth1  10.10.10.0  PRIVATE  255.255.255.0
eth1  169.254.128.0  UNKNOWN  255.255.128.0
bond0  10.75.1.0  PRIVATE  255.255.255.0

Here is starting the new things of HAIP, please note the 169.254.x.x IP addresses, that can also be seen at OS level:

[root@server1 ~]# ifconfig
.
.
eth0      Link encap:Ethernet  HWaddr B4:99:BA:A7:07:2A
          inet addr:10.10.10.11  Bcast:10.10.10.255  Mask:255.255.255.0
          inet6 addr: fe80::b699:baff:fea7:72a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:317590783 errors:0 dropped:0 overruns:0 frame:0
          TX packets:102992935 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:87430872027 (81.4 GiB)  TX bytes:51030279760 (47.5 GiB)
 
eth0:1    Link encap:Ethernet  HWaddr B4:99:BA:A7:07:2A
          inet addr:169.254.5.247  Bcast:169.254.127.255  Mask:255.255.128.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
.
.
[root@server1 ~]# ip addr list eth0
10: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether b4:99:ba:a7:07:2a brd ff:ff:ff:ff:ff:ff
    inet 10.10.10.11/24 brd 10.10.10.255 scope global eth0
    inet 169.254.5.247/17 brd 169.254.127.255 scope global eth0:1
    inet6 fe80::b699:baff:fea7:72a/64 scope link
       valid_lft forever preferred_lft forever

Which can also be seen at ASM and RAC database level:

SQL> SET lines 200
SQL> col host_name FOR a15
SQL> col INSTANCE_NAME FOR a15
SQL> SELECT a.host_name, a.instance_name, b.name, b.ip_address
     FROM gv$cluster_interconnects b, gv$instance a
     WHERE a.inst_id=b.inst_id
     ORDER BY 1,2,3;
 
HOST_NAME       INSTANCE_NAME   NAME                                          IP_ADDRESS
--------------- --------------- --------------------------------------------- ------------------------------------------------
server1         +ASM1           eth0:1                                        169.254.5.247
server1         +ASM1           eth1:1                                        169.254.192.128
server2         +ASM2           eth0:1                                        169.254.90.252
server2         +ASM2           eth1:1                                        169.254.158.153

So, as specified in documentation by the way, the IPs that are used for Cluster interconnect are not the private ones specified when installing product (globally non routable) but the ones automatically allocated by HAIP in 169.254.x.x range.

References

6 thoughts on “11gR2 Grid Infrastructure Redundant Interconnect aka Cluster High Availability IP (HAIP)

  1. Thanks for the detailed explanation Yannick and a quick question – do you have a reference to show that Virtual Connect is Oracle certified? We’re building a similar infrastructure so it would save me some time!

    • Thanks for comment !

      Please refer to HP note:
      Deploying an Oracle 11gR2 Real Application Cluster (RAC) database with HP Virtual Connect FlexFabric (4AA4-0227ENW.pdf), I can email it if you really can’t find it.

      HP note rely on Oracle note:
      Certification Information for Oracle Database on Linux x86 [ID 1307056.1]
      Certification of Oracle Database 11.2 with Hewlett-Packard ProLiant running FCoE and Virtual Connect technologies [ID 1405274.1]

      Funny Gravatar btw, where did you get it ?

      Yannick.

      • Great, thanks for the quick reply. Interesting to note the a RAC is not supported when it spans multiple chassis (at least in that support note). Have you used a single chassis?

        My gravatar was done by a sketch artist at VMworld one year – get yourself to some conferences! Keep up the blog, some good information.

        • Same as you (I suppose), building a RAC cluster using a single chassis was NOT an option. What is not supported is using only Virtual Connect to link the two chassis, using an additional switch (Cisco 5548UP in our case if I remember well) has been accepted and certified by HP/Oracle after long painful discussion… So, in meanwhile, we started in one chassis and then moved to two chassis using this switch for interconnect…

  2. saminadane says:

    After bonding i try to down eth0. But in my case i getting ” request timed out “.

    My network down and i can’t access through putty. Please help me solve this issue.

    Thanks and regards.

    Here i given my configuration

    # less /etc/sysconfig/network-scripts/ifcfg-eth0
    DEVICE=eth0
    MASTER=bond0
    USERCTL=no
    SLAVE=yes
    BOOTPROTO=none
    TYPE=Ethernet
    ONBOOT=yes

    # less /etc/sysconfig/network-scripts/ifcfg-eth1
    DEVICE=eth1
    TYPE=Ethernet
    ONBOOT=yes
    BOOTPROTO=none
    MASTER=bond0
    SLAVE=yes
    USERCTL=no

    [root@bond0 network-scripts]# cat ifcfg-bond0
    DEVICE=bond0
    BOOTPROTO=none
    ONBOOT=yes
    USERCTL=no
    IPADDR=192.168.2.100
    NETWORK=192.168.2.0
    NETMASK=255.255.255.0
    BROADCAST=192.168.2.255
    GATEWAY=192.168.2.2
    BONDING_OPTS=”mode=1 miimon=100”

    • Hello,

      I have tried again a small test on Oracle Enterprise Linux 7.1 under VirtualBox (Host Only Adapter):

      [root@server3 network-scripts]# cat ifcfg-bond01
      DEVICE=bond01
      TYPE=Bond
      BONDING_MASTER=yes
      BOOTPROTO=none
      DEFROUTE=no
      IPV4_FAILURE_FATAL=no
      IPV6INIT=no
      NAME=bond01
      UUID=8dc16315-6625-4f4b-863a-19aeb7ed3761
      ONBOOT=yes
      BONDING_OPTS="updelay=0 resend_igmp=1 use_carrier=1 miimon=100 downdelay=0 xmit_hash_policy=0 primary_reselect=0 fail_over_mac=0 arp_validate=0 mode=balance-rr lacp_rate=0 arp_interval=0 ad_select=0"
      IPADDR=192.168.56.104
      PREFIX=24
      [root@server3 network-scripts]# cat ifcfg-enp0s9
      TYPE=Ethernet
      NAME=enp0s9
      UUID=7879ca45-ab7e-41cc-a4c5-30f1da286022
      DEVICE=enp0s9
      ONBOOT=yes
      MASTER=8dc16315-6625-4f4b-863a-19aeb7ed3761
      SLAVE=yes
      [root@server3 network-scripts]# cat ifcfg-enp0s10
      TYPE=Ethernet
      NAME=enp0s10
      UUID=65b647a2-98a0-4bea-a059-2b9e6daf6459
      DEVICE=enp0s10
      ONBOOT=yes
      MASTER=8dc16315-6625-4f4b-863a-19aeb7ed3761
      SLAVE=yes

      If I deactivate a slave with something like:

      ip link set dev enp0s10 down

      Then it continues to work till I deactivate the second slave NIC. So I would say expected behavior.

      For your problem have a close look to network routes and ensure the bonding module is loaded into your kernel…

      Thanks, Yannick.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>