Hi,Marcin
I have a Lustre over infiniband setup constiting of mgs, mds, and two oss, each oss has two ost's.
Each server has two IPoIB interfaces which provide multipath redundancy to the SAN block devices.
I'm using the crm configuration generated by the make-lustre-crm-config.py script
available at https://github.com/gc3-uzh-ch/schroedinger-lustre-ha
After some changes (hostnames, IPs, and the fact that in my setup I have two IPoIB interfaces
instead of just one), the script creates the attached crm.txt.
I'm familiar with https://ourobengr.com/ha/ , which says:
"If a stop (umount of the Lustre filesystem in this case) fails,
the node will be fenced/STONITHd because this is the only safe thing to do".
I have a working STONITH, with corosync communicating over eth0 interface.
Let's take the example of server-02, which mounts Lustre's mdt.
The server-02 is powered-off if I disable the eth0 interface on it,
and mdt moves onto server-01 as expected.
However if instead both IPoIB interfaces go down on server-02,
the mdt is moved to server-01, but no STONITH is performed on server-02.
This is expected, because there is nothing in the configuration about it,
only Filesystem mount/umount failure will triggers STONITH:
rsc_template lustre-target-template ocf:heartbeat:Filesystem \
op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \
op start interval=0 timeout=300 on-fail=fence \
op stop interval=0 timeout=300 on-fail=fence
How can I make umount/mount of Lustre mgt/mdt/ost fail in order to test STONITH action in these cases?