crm.txt attached
On Wed, Aug 19, 2015 at 12:22 PM, Marcin Dulak <marcin.dulak(a)gmail.com>
wrote:
Hi,
I have a Lustre over infiniband setup constiting of mgs, mds, and two oss,
each oss has two ost's.
Each server has two IPoIB interfaces which provide multipath redundancy to
the SAN block devices.
I'm using the crm configuration generated by the make-lustre-crm-config.py
script
available at
https://github.com/gc3-uzh-ch/schroedinger-lustre-ha
After some changes (hostnames, IPs, and the fact that in my setup I have
two IPoIB interfaces
instead of just one), the script creates the attached crm.txt.
I'm familiar with
https://ourobengr.com/ha/ , which says:
"If a stop (umount of the Lustre filesystem in this case) fails,
the node will be fenced/STONITHd because this is the only safe thing to
do".
I have a working STONITH, with corosync communicating over eth0 interface.
Let's take the example of server-02, which mounts Lustre's mdt.
The server-02 is powered-off if I disable the eth0 interface on it,
and mdt moves onto server-01 as expected.
However if instead both IPoIB interfaces go down on server-02,
the mdt is moved to server-01, but no STONITH is performed on server-02.
This is expected, because there is nothing in the configuration about it,
only Filesystem mount/umount failure will triggers STONITH:
rsc_template lustre-target-template ocf:heartbeat:Filesystem \
op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \
op start interval=0 timeout=300 on-fail=fence \
op stop interval=0 timeout=300 on-fail=fence
How can I make umount/mount of Lustre mgt/mdt/ost fail in order to test
STONITH action in these cases?
Marcin