Ok, I finally figured it out. Apparently I somehow got the wrong addresses registered in
the server configuration logs. Shutting everything down, running --writeconf, and
restarting solved the problem.
Here's an example of what was in the config logs, in case anyone else stumbles across
this (via llog_reader):
#70 (088)add_uuid nid=192.168.129.252@o2ib(0x50000c0a881fc) 0: 1:192.168.129.252@o2ib
#71 (088)add_uuid nid=192.168.129.252@tcp(0x20000c0a881fc) 0: 1:192.168.129.252@o2ib
#72 (128)attach 0:lustre_1-OST0000-osc 1:osc 2:lustre_1-clilov_UUID
#73 (144)setup 0:lustre_1-OST0000-osc 1:lustre_1-OST0000_UUID 2:192.168.129.252@o2ib
Note that the tcp address is the same as the o2ib address, which is definitely not
correct.
And from lctl peer_list on a tcp only client, I was seeing these:
12345-192.168.129.249@tcp [1]0.0.0.0->g21-oss3-ib.deepthought.umd.edu:988 #0
12345-192.168.129.251@tcp [1]0.0.0.0->g21-oss1-ib.deepthought.umd.edu:988 #0
12345-192.168.129.252@tcp [1]0.0.0.0->g21-oss2-ib.deepthought.umd.edu:988 #0
Which were also the wrong addresses.
Kevin
-----Original Message-----
From: Patrick Farrell [mailto:paf@cray.com]
Sent: Thursday, January 22, 2015 12:56 PM
To: Kevin M. Hildebrand; Mohr Jr, Richard Frank (Rick Mohr)
Cc: hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Lustre networking issues, multi-homed servers
That¹s normal for an upgraded system. (Sorry, don¹t have further thoughts
on the network issue.)
On 1/22/15, 9:41 AM, "Kevin M. Hildebrand" <kevin(a)umd.edu> wrote:
The only message I see occurs on the MDS at the time I'm mounting
the
client:
Lustre: MGS: non-config logname received: params
Kevin
-----Original Message-----
From: Mohr Jr, Richard Frank (Rick Mohr) [mailto:rmohr@utk.edu]
Sent: Thursday, January 22, 2015 12:37 PM
To: Kevin M. Hildebrand
Cc: hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] Lustre networking issues, multi-homed servers
Are there any Lustre error messages on the server side?
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu
On Jan 22, 2015, at 11:03 AM, Kevin M. Hildebrand <kevin(a)umd.edu> wrote:
> Hello, I just upgraded a Lustre 1.8.7 installation to version 2.5.3.
>
> The Lustre servers are connected via IB and Ethernet, some of the
>clients have both networks, and some of the clients have Ethernet only.
>
> I'm having a problem where the Ethernet-only clients appear to be
>attempting to contact the servers via their IB addresses, and are
>failing to do so. As far as I can tell the NIDS are correct on servers
>and clients, so I'm not sure where things are going wrong. The Ethernet
>networks are 10.100.* and the IB network is 192.168.* below.
>
> MDS/MGS:
> # lctl list_nids
> 192.168.129.250@o2ib
> 10.100.129.250@tcp
>
> OSSes:
> # lctl list_nids
> 192.168.129.252@o2ib
> 10.100.129.252@tcp
> # lctl list_nids
> 192.168.129.249@o2ib
> 10.100.129.249@tcp
> # lctl list_nids
> 192.168.129.251@o2ib
> 10.100.129.251@tcp
>
> Client:
> # lctl list_nids
> 10.100.135.131@tcp
>
> On the client:
> # mount -t lustre 10.100.129.250@tcp:/lustre_1 /lustre
> # df
> <HANGS>
>
> Jan 22 10:49:01 compute-f09-1 kernel: Lustre: Lustre: Build Version:
>2.5.3-RC1--PRISTINE-2.6.32-504.3.3.el6.x86_64
> Jan 22 10:49:01 compute-f09-1 kernel: Lustre: client wants to enable
>acl, but mdt not!
> Jan 22 10:49:01 compute-f09-1 kernel: Lustre: Layout lock feature
>supported.
> Jan 22 10:49:01 compute-f09-1 kernel: Lustre: Mounted lustre_1-client
> Jan 22 10:49:06 compute-f09-1 kernel: Lustre:
>5545:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for sent delay: [sent 1421941741/real 0] req@ffff8808042bac00
>x1491013983010920/t0(0)
>o8->lustre_1-OST000a-osc-ffff880823397400@192.168.129.249@tcp:28/4 lens
>400/544 e 0 to 1 dl 1421941746 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
> Jan 22 10:49:14 compute-f09-1 kernel: LustreError:
>5630:0:(llite_lib.c:1624:ll_statfs_internal()) obd_statfs fails: rc = -5
>
> For some reason the client appears to be trying to connect to the
>192.168 (IB) address, even though it's not one of its networks.
>
> Can someone please shed some light as to what I'm missing?
>
> Thanks,
> Kevin
> ---
> Kevin Hildebrand
> University of Maryland Division of IT
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/hpdd-discuss
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss