On 2013/12/17 9:37 AM, "Sten Wolf" <sten(a)checkpalm.com> wrote:
This is my situation:
I have 2 nodes MDS1 , MDS2 (10.0.0.22 , 10.0.0.23) I wish to use as
failover MGS, active/active MDT with zfs.
I have a jbod shelf with 12 disks, seen by both nodes as das (the shelf
has 2 sas ports, connected to a sas hba on each node), and I am using
lustre 2.4 on centos 6.4 x64
If you are using ZFS + DNE (multiple MDTs), I'd strongly recommend to use
Lustre 2.5 instead of 2.4. There were quite a bunch of fixes in this
version for both of those features (which are both new in 2.4). Also,
Lustre 2.5 is the new long-term maintenance stream, so there will be
regular updates for that version.
I have to admit that the combination of those two features has been tested
less than either ZFS + 1 MDT or ldiskfs + 2+ MDTs separately. There are
also a couple of known performance issues with the interaction of these
features that are not yet fixed.
I do expect that this combination is working, but there will likely be
some issues that haven't been seen before.
Cheers, Andreas
I have created 3 zfs pools:
1. mgs:
# zpool create -f -o ashift=12 -O canmount=off lustre-mgs mirror
/dev/disk/by-id/wwn-0x50000c0f012306fc
/dev/disk/by-id/wwn-0x50000c0f01233aec
# mkfs.lustre --mgs --servicenode=mds1@tcp0 --servicenode=mds2@tcp0
--param sys.timeout=5000 --backfstype=zfs lustre-mgs/mgs
Permanent disk data:
Target: MGS
Index: unassigned
Lustre FS:
Mount type: zfs
Flags: 0x1064
(MGS first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp
sys.timeout=5000
2 mdt0:
# zpool create -f -o ashift=12 -O canmount=off lustre-mdt0 mirror
/dev/disk/by-id/wwn-0x50000c0f01d07a34
/dev/disk/by-id/wwn-0x50000c0f01d110c8
# mkfs.lustre --mdt --fsname=fs0 --servicenode=mds1@tcp0
--servicenode=mds2@tcp0 --param sys.timeout=5000 --backfstype=zfs
--mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt0/mdt0
warning: lustre-mdt0/mdt0: for Lustre 2.4 and later, the target index
must be specified with --index
Permanent disk data:
Target: fs0:MDT0000
Index: 0
Lustre FS: fs0
Mount type: zfs
Flags: 0x1061
(MDT first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.22@tcp failover.node=10.0.0.23@tcp
sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp
checking for existing Lustre data: not found
mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt0/mdt0
Writing lustre-mdt0/mdt0 properties
lustre:version=1
lustre:flags=4193
lustre:index=0
lustre:fsname=fs0
lustre:svname=fs0:MDT0000
lustre:failover.node=10.0.0.22@tcp
lustre:failover.node=10.0.0.23@tcp
lustre:sys.timeout=5000
lustre:mgsnode=10.0.0.22@tcp
lustre:mgsnode=10.0.0.23@tcp
3. mdt1:
# zpool create -f -o ashift=12 -O canmount=off lustre-mdt1 mirror
/dev/disk/by-id/wwn-0x50000c0f01d113e0
/dev/disk/by-id/wwn-0x50000c0f01d116fc
# mkfs.lustre --mdt --fsname=fs0 --servicenode=mds2@tcp0
--servicenode=mds1@tcp0 --param sys.timeout=5000 --backfstype=zfs
--index=1 --mgsnode=mds1@tcp0 --mgsnode=mds2@tcp0 lustre-mdt1/mdt1
Permanent disk data:
Target: fs0:MDT0001
Index: 1
Lustre FS: fs0
Mount type: zfs
Flags: 0x1061
(MDT first_time update no_primnode )
Persistent mount opts:
Parameters: failover.node=10.0.0.23@tcp failover.node=10.0.0.22@tcp
sys.timeout=5000 mgsnode=10.0.0.22@tcp mgsnode=10.0.0.23@tcp
checking for existing Lustre data: not found
mkfs_cmd = zfs create -o canmount=off -o xattr=sa lustre-mdt1/mdt1
Writing lustre-mdt1/mdt1 properties
lustre:version=1
lustre:flags=4193
lustre:index=1
lustre:fsname=fs0
lustre:svname=fs0:MDT0001
lustre:failover.node=10.0.0.23@tcp
lustre:failover.node=10.0.0.22@tcp
lustre:sys.timeout=5000
lustre:mgsnode=10.0.0.22@tcp
lustre:mgsnode=10.0.0.23@tcp
a few basic sanity checks:
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
lustre-mdt0 824K 3.57T 136K /lustre-mdt0
lustre-mdt0/mdt0 136K 3.57T 136K /lustre-mdt0/mdt0
lustre-mdt1 716K 3.57T 136K /lustre-mdt1
lustre-mdt1/mdt1 136K 3.57T 136K /lustre-mdt1/mdt1
lustre-mgs 4.78M 3.57T 136K /lustre-mgs
lustre-mgs/mgs 4.18M 3.57T 4.18M /lustre-mgs/mgs
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
lustre-mdt0 3.62T 1.00M 3.62T 0% 1.00x ONLINE -
lustre-mdt1 3.62T 800K 3.62T 0% 1.00x ONLINE -
lustre-mgs 3.62T 4.86M 3.62T 0% 1.00x ONLINE -
# zpool status
pool: lustre-mdt0
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mdt0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f01d07a34 ONLINE 0 0 0
wwn-0x50000c0f01d110c8 ONLINE 0 0 0
errors: No known data errors
pool: lustre-mdt1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mdt1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f01d113e0 ONLINE 0 0 0
wwn-0x50000c0f01d116fc ONLINE 0 0 0
errors: No known data errors
pool: lustre-mgs
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
lustre-mgs ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x50000c0f012306fc ONLINE 0 0 0
wwn-0x50000c0f01233aec ONLINE 0 0 0
errors: No known data errors
# zfs get lustre:svname lustre-mgs/mgs
NAME PROPERTY VALUE SOURCE
lustre-mgs/mgs lustre:svname MGS local
# zfs get lustre:svname lustre-mdt0/mdt0
NAME PROPERTY VALUE SOURCE
lustre-mdt0/mdt0 lustre:svname fs0:MDT0000 local
# zfs get lustre:svname lustre-mdt1/mdt1
NAME PROPERTY VALUE SOURCE
lustre-mdt1/mdt1 lustre:svname fs0:MDT0001 local
So far, so good.
My /etc/ldev.conf:
mds1 mds2 MGS zfs:lustre-mgs/mgs
mds1 mds2 fs0-MDT0000 zfs:lustre-mdt0/mdt0
mds2 mds1 fs0-MDT0001 zfs:lustre-mdt1/mdt1
my /etc/modprobe.d/lustre.conf
# options lnet networks=tcp0(em1)
options lnet ip2nets="tcp0 10.0.0.[22,23]; tcp0 10.0.0.*;"
--------------------------------------------------------------------------
---
Now, when starting the services, I get strange errors:
# service lustre start local
Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000
mount.lustre: mount lustre-mdt0/mdt0 at /mnt/lustre/local/fs0-MDT0000
failed: Input/output error
Is the MGS running?
# service lustre status local
running
attached lctl-dk.local01
If I run the same command again, I get a different error:
# service lustre start local
Mounting lustre-mgs/mgs on /mnt/lustre/local/MGS
mount.lustre: according to /etc/mtab lustre-mgs/mgs is already mounted
on /mnt/lustre/local/MGS
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000
mount.lustre: mount lustre-mdt0/mdt0 at /mnt/lustre/local/fs0-MDT0000
failed: File exists
attached lctl-dk.local02
What am I doing wrong?
I have tested lnet self-test as well, using the following script:
# cat lnet-selftest.sh
#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 10.0.0.[22,23]@tcp
lst add_group readers 10.0.0.[22,23]@tcp
lst add_group writers 10.0.0.[22,23]@tcp
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
brw write check=full size=4K
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session
and it seemed ok
# modprobe lnet-selftest && ssh mds2 modprobe lnet-selftest
# ./lnet-selftest.sh
SESSION: read/write FEATURES: 0 TIMEOUT: 300 FORCE: No
10.0.0.[22,23]@tcp are added to session
10.0.0.[22,23]@tcp are added to session
10.0.0.[22,23]@tcp are added to session
Test was added successfully
Test was added successfully
bulk_rw is running now
[LNet Rates of servers]
[R] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19739 RPC/s
[W] Avg: 19486 RPC/s Min: 19234 RPC/s Max: 19738 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s
[W] Avg: 1737.60 MB/s Min: 1680.70 MB/s Max: 1794.51 MB/s
[LNet Rates of servers]
[R] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s
[W] Avg: 19510 RPC/s Min: 19182 RPC/s Max: 19838 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s
[W] Avg: 1741.67 MB/s Min: 1679.51 MB/s Max: 1803.83 MB/s
[LNet Rates of servers]
[R] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s
[W] Avg: 19458 RPC/s Min: 19237 RPC/s Max: 19679 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s
[W] Avg: 1738.87 MB/s Min: 1687.28 MB/s Max: 1790.45 MB/s
[LNet Rates of servers]
[R] Avg: 19587 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s
[W] Avg: 19586 RPC/s Min: 19293 RPC/s Max: 19880 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s
[W] Avg: 1752.62 MB/s Min: 1695.38 MB/s Max: 1809.85 MB/s
[LNet Rates of servers]
[R] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19823 RPC/s
[W] Avg: 19528 RPC/s Min: 19232 RPC/s Max: 19824 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s
[W] Avg: 1741.63 MB/s Min: 1682.29 MB/s Max: 1800.98 MB/s
session is ended
./lnet-selftest.sh: line 17: 8835 Terminated lst stat
servers
Addendum - I can start the MGS service on the 2nd node, and then start
mdt0 service on local node:
# ssh mds2 service lustre start MGS
Mounting lustre-mgs/mgs on /mnt/lustre/foreign/MGS
# service lustre start fs0-MDT0000
Mounting lustre-mdt0/mdt0 on /mnt/lustre/local/fs0-MDT0000
# service lustre status
unhealthy
# service lustre status local
running
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division