1).The InfiniBand Network connects the database servers
and Exadata Storage Servers using the InfiniBand switches on the rack. It is a
private network between the database servers and Exadata Storage Servers.
2).A Exadata Rack contains at least 2 Infiniband switches.
The third switch is called Spine Switch.
Which connects both the leaf switches in half and full rack database machines.
3). Spine switch is for Connection of multiple racks to
form a single and larger database machine environment.
4). Each Server (Storage and Database Server) Contains 2
Infiniband ports which are bonded together in ACTIVE/PASSIVE way (till X3 and
ACTIVE/ACTIVE in X4).
5). The Active Passive connections are spread across both
the switches using FAT-TREE switched Fabric Network Topology.
6). Infiniband switches run centOS.
7). MONITOR SWITCH PORTS: To
check failed switch and sensor hardware that exceeds preset thresholds, Run
these commands every 1-2 minutes.
Use,
Login to the switch using root .
and run,
$ showunhealthy
OK - No unhealthy sensors
$ checkpower
PSU 0 present OK
PSU 1 present OK
All PSUs OK
7.1).In case of any issue reported on the above command,
Use "env_test" command.
Login as root on
IB switch and run,
# env_test
Environment test started:
Starting Environment Daemon test:
Environment daemon running
Environment Daemon test returned OK
Starting Voltage test:
Voltage ECB OK
Measured 3.3V Main = 3.27 V
Measured 3.3V Standby = 3.39 V
Measured 12V = 11.97 V
Measured 5V = 4.99 V
Measured VBAT = 3.09 V
Measured 2.5V = 2.49 V
Measured 1.8V = 1.78 V
Measured I4 1.2V = 1.22 V
Voltage test returned OK
Starting PSU test:
PSU 0 present OK
PSU 1 present OK
PSU test returned OK
Starting Temperature test:
Back temperature 29
Front temperature 30
SP temperature 48
Switch temperature 43, maxtemperature 45
Temperature test returned OK
Starting FAN test:
Fan 0 not present
Fan 1 running at rpm 12099
Fan 2 running at rpm 11881
Fan 3 running at rpm 12208
Fan 4 not present
FAN test returned OK
Starting Connector test:
Connector test returned OK
Starting Onboard ibdevice test:
Switch OK
All Internal ibdevices OK
Onboard ibdevice test returned OK
Starting SSD test:
SSD test returned OK
Environment test PASSED
8).MONITOR IB SWITCH PORTS : Use ibqueryerrors.pl on any
of the database node or switches. Storage servers need not be checked as its
automatically checked by Exadata Cell software(Part of MS)
Login as root
to database or IB Switch and run,
# ibqueryerrors.pl -s
RcvSwRelayErrors,RcvRemotePhysErrors,XmtDiscards,XmtContraintErrors,RcvContraintErrors,ExcBufOverrunErrors,Vl15Dropped
You should run this every 1 or 2 min to check if the value
is raising.
Check for SymbolErrors,RcvErrors,LinkIntegrityErrors
9). To check
infiniband Firmware versions,
On infiniband Switch, Login as root user and then,
# version | head -1 | cut -d" " -f5
10).Monitor
Database Node IB Ports:
Login to database server as root and then run
ibstatus => check that every port shows up in the
output(2 per node).
Sample Output :
Infiniband device 'mlx4_0' port 1
status:
default gid: fe80:0000:0000:0000:0021:2800:01ce:d28b
base lid: 0x26
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: IB
Infiniband device 'mlx4_0' port 2
status:
default gid: fe80:0000:0000:0000:0021:2800:01ce:d28c
base lid: 0x27
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: IB
perfquery => Check for SymbolErrors, RcvErrors, LinkIntegrityErrors
ifconfig, => Check the bondib0(ib0 and ib1) are up.
ping => Check for connectivity.
rds-ping => Check for connectivity.
11). Monitor the infiniband Fabric. Can be run from either
Database node or one of the infiniband switches.
To Locate SM running,
Login as root on DB node or IB switch and run
# sminfo
sminfo: sm lid 3 sm guid 0x2128469156a0a0, activity count
55495849 priority 14 state 3 SMINFO_MASTER
and then
# ibswitches
Switch :
0x002128469156a0a0 ports 36 "SUN DCS 36P QDR aeldb3sw-ibs0
10.146.28.50" enhanced port 0 lid 3 lmc 0
Switch :
0x00212846914ba0a0 ports 36 "SUN DCS 36P QDR aeldb3sw-ibb0
10.146.28.52" enhanced port 0 lid 2 lmc 0
Switch :
0x002128469157a0a0 ports 36 "SUN DCS 36P QDR aeldb3sw-iba0
10.146.28.51" enhanced port 0 lid 1 lmc 0
From the above command, 0x002128469156a0a0 is the Switch where SM is running(Compare the
GUID) from the above command(sminfo)
Or login to one of IB switch and simply run,
# getmaster
Local SM not enabled
20140131 09:55:06 Master SubnetManager on sm lid 3 sm guid
0x2128469156a0a0 : SUN DCS 36P QDR aeldb3sw-ibs0 10.146.28.50
12) On a Full or Half rack node, Spine switch is present
and thats where the SM should be running,
To identify spine switch,
run,
# ibnetdiscover -p | awk '/^SW + [0-9] + + [0-9] + + 0x[0-9 \ a-e]+ + [0 - 9] + x .DR
- [SW | CA] .*/ {if (spine [$4] == " ") spine[$4] == "yes"
if ((spine [$8] == "CA") spine[$4]
== "no" } END { for (val in spine) if (spine [val] ==
"yes") print val }'
13). Infiniband Cables are not as robust as Ethernet (RJ45)
ones. InfiniBand copper cables have
strict
Specifications which define the minimum bend radius that they
can tolerate.