While it looks like you specified multiple interface NIDs for the MGS, you didn't
specify the backup NID for the MDT and OST. You need to use the --servicenode or
--failnode option to register the backup server in advance, so clients will know to check
there when the primary is gone.
Cheers, Andreas
On Feb 18, 2015, at 12:18, Thomas Roth <t.roth(a)gsi.de> wrote:
Hi all,
running Lustre 2.5.3, we have two MDSes at 10.20.0.25@o2ib1 and 10.20.0.20@o2ib1,
accessing some
shared storage.
On 10.20.0.25, I have formatted an MGS and MDT with
> mkfs.lustre --mgs /dev/mapper/mpathb
and
> mkfs.lustre --mdt --fsname=nyx1 --index=0 --mgsnode=10.20.0.25@o2ib1
--mgsnode=10.20.0.20@o2ib1
> /dev/mapper/mpathd
Both mounted cleanly.
On an OSS, I formatted an OST
> mkfs.lustre --reformat --ost --backfstype=zfs --fsname=nyx1
> --mgsnode=10.20.0.25@o2ib1:10.20.0.20@o2ib1 --index=$IND osspool0/ost0 raidz2
/dev/mapper/...
Mounted cleanly, as did the fs on a client.
Then I umounted both MDT and MGS on 10.20.0.25 and mounted them on the failover
10.20.0.20.
This seems to have worked, although ptlrpc_expire_one_request() keeps complaining about
network
errors which only mention the 'dead' nid 10.20.0.25@o2ib1. But the log also
states
> Lustre: nyx1-MDT0000: used disk, loading
and the listing /proc/fs/lustre/devices is complete.
That is, also the OSS reconnected, its log says
> Evicted from MGS (at MGC10.20.0.25@o2ib1_1) after server handle changed
> MGC10.20.0.25@o2ib1: Connection restored to MGS (at 10.20.0.20@o2ib1)
This seems to be o.k.
However the client is stuck. Its log shows the same two messages,
> Evicted from MGS (at MGC10.20.0.25@o2ib1_1) after server handle changed
> MGC10.20.0.25@o2ib1: Connection restored to MGS (at 10.20.0.20@o2ib1)
but followed by
> mgc: cannot find uuid by nid 10.20.0.20@o2ib1
> Process recover log nyx1-cliir error -2
Correspondingly, there is no Lustre access on this client,
> nyx1-MDT0000-mdc-ffff880536afec00: check error: Resource temporarily unavailable
I have of course played around with the specification of nids on the mkfs-commandline,
colon-separated
statement of mgsnodes, adding both nids as servicenodes to the MGS-format, - nada.
The only Jira I could find for this error message "mgc: cannot find uuid by
nid" is LU-5950.
This has "Fix Version/s: Lustre 2.7.0" - no failover prior to that? - hard to
believe ;-)
Any idea where I messed up?
Thanks,
Thomas
--
--------------------------------------------------------------------
Thomas Roth IT-HPC-Linux
Location: SB3 1.262 Phone: +49-6159-71 1453
http://twitter.com/gsi_it
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss