Hi all,
TL;DR: We are having major problems getting lustre back up and running.
Versions: (Server) CentOS 6.5 + lustre 2.5.3 hpdd rpms
Order of operations tried:
- Clean CentOS install
- install lustre rpms + e2fsprogs from hpdd site
- mkfs.lustre --mgs /path/to/mgt
- mkfs.lustre --reformat --writeconf --mdt --index=01 --fsname=fs01
--mgsnode=10.56.56.201@o2ib --network=o2ib /path/to/mdt
- service lnet start on all servers
- on OSS:
for i in {01..05};do mkfs.lustre --writeconf --reformat --ost --fsname=fs01
--mgsnode=10.56.56.201@o2ib --net=o2ib --index=$i
--failnode=10.56.56.203@o2ib --mkfsoptions='-E
stride=128,stripe-width=1024' --writeconf /dev/mapper/ost$i;done
- service lustre start on MGS/MDS
- service lustre start on OSS
We seem to have two distinct problems:
1. Some OST fail to mount with a device busy error and we can't for the
life of us figure out why it's busy.
2. One OSS seems to absolutely do nothing, no output of 'mounting ...', no
failed messages, just immediately returns.
Do you guys see something I am missing here?
Thanks,
Eli
Some more background:
We recently decided to redo our lustre but instead of it being a walk in
the park we have been struggling with getting it back up for a week now.
The background of redoing was adding disks to our disk arrays which for
reasons beyond reason had been purchased with less disks then they could
hold and as a result the RAID virtual disks for OSTs were bad.
Instead of sticking with CentOS 6.4 + lustre 2.4.3 we figured we'd
immediately take the servers to 6.5 + 2.5.3 (clients are Debian, and were
working at version 2.5.3 already).
However we are not getting lustre back up.
Our topology is such
host1 MGS/MDS ----- MGT/MDT
host2 OSS1 ----- OST01-OST05
host3 OSS2 ----- OST06-OST10 (and also *all* other OSTs as failover)
host4 OSS3 ------ OST11-OST15 (and failover for OST06-OST10)
The (3) diskarrays are connected with SAS and multipathd creates device
nodes for the different disks, we have 5 virtual disks in each enclosure.
To set up the system we installed the vanilla rpms
Show replies by date
I am happy to say we figured out our problems, they were:
- MDT had index of 1 instead of 0
- and it seems the switch from --servicenode to --failnode may also have
helped.
Sorry for the disturbance.
---------- Forwarded message ----------
From: E.S. Rosenberg <esr+hpdd-discuss(a)mail.hebrew.edu>
Date: Sun, Sep 21, 2014 at 1:39 PM
Subject: Problems getting lustre back up.
To: "hpdd-discuss(a)lists.01.org" <hpdd-discuss(a)lists.01.org>
Hi all,
TL;DR: We are having major problems getting lustre back up and running.
Versions: (Server) CentOS 6.5 + lustre 2.5.3 hpdd rpms
Order of operations tried:
- Clean CentOS install
- install lustre rpms + e2fsprogs from hpdd site
- mkfs.lustre --mgs /path/to/mgt
- mkfs.lustre --reformat --writeconf --mdt --index=01 --fsname=fs01
--mgsnode=10.56.56.201@o2ib --network=o2ib /path/to/mdt
- service lnet start on all servers
- on OSS:
for i in {01..05};do mkfs.lustre --writeconf --reformat --ost --fsname=fs01
--mgsnode=10.56.56.201@o2ib --net=o2ib --index=$i
--failnode=10.56.56.203@o2ib --mkfsoptions='-E
stride=128,stripe-width=1024' --writeconf /dev/mapper/ost$i;done
- service lustre start on MGS/MDS
- service lustre start on OSS
We seem to have two distinct problems:
1. Some OST fail to mount with a device busy error and we can't for the
life of us figure out why it's busy.
2. One OSS seems to absolutely do nothing, no output of 'mounting ...', no
failed messages, just immediately returns.
Do you guys see something I am missing here?
Thanks,
Eli
Some more background:
We recently decided to redo our lustre but instead of it being a walk in
the park we have been struggling with getting it back up for a week now.
The background of redoing was adding disks to our disk arrays which for
reasons beyond reason had been purchased with less disks then they could
hold and as a result the RAID virtual disks for OSTs were bad.
Instead of sticking with CentOS 6.4 + lustre 2.4.3 we figured we'd
immediately take the servers to 6.5 + 2.5.3 (clients are Debian, and were
working at version 2.5.3 already).
However we are not getting lustre back up.
Our topology is such
host1 MGS/MDS ----- MGT/MDT
host2 OSS1 ----- OST01-OST05
host3 OSS2 ----- OST06-OST10 (and also *all* other OSTs as failover)
host4 OSS3 ------ OST11-OST15 (and failover for OST06-OST10)
The (3) diskarrays are connected with SAS and multipathd creates device
nodes for the different disks, we have 5 virtual disks in each enclosure.
To set up the system we installed the vanilla rpms