If you lctl ping 10.10.X.XX@tcp from both sides it should bring the route
up.
With all of those down routes all is happy?
hmm.
On Thu, Sep 5, 2013 at 1:14 PM, Bob Ball <ball(a)umich.edu> wrote:
That is an interesting mix. Nothing shows up at all on the clients,
even
on those 3 that route to a second NIC. On the OSS, it is quite the mix of
up/down on the 3 routers, with no obvious pattern.
Most of our traffic is on the 10.10 network, with the 3 machines shown
below routing to a small number of clients on a more public network.
FYI, the current situation is one in which all machines are happy, as far
as I can tell.
bob
Running lctl show_route on all machines in lustre_fss.txt
On umdist05.local
net tcp2 hops 1 gw 10.10.1.52@tcp down
net tcp2 hops 1 gw 10.10.1.51@tcp down
net tcp2 hops 1 gw 10.10.1.50@tcp up
Succeeded
On umfs06.local
net tcp2 hops 1 gw 10.10.1.51@tcp down
net tcp2 hops 1 gw 10.10.1.50@tcp up
net tcp2 hops 1 gw 10.10.1.52@tcp up
Succeeded
On umdist01.local
net tcp2 hops 1 gw 10.10.1.52@tcp down
net tcp2 hops 1 gw 10.10.1.51@tcp down
net tcp2 hops 1 gw 10.10.1.50@tcp up
Succeeded
On umdist02.local
net tcp2 hops 1 gw 10.10.1.52@tcp down
net tcp2 hops 1 gw 10.10.1.51@tcp down
net tcp2 hops 1 gw 10.10.1.50@tcp up
Succeeded
On umdist03.local
net tcp2 hops 1 gw 10.10.1.51@tcp down
net tcp2 hops 1 gw 10.10.1.52@tcp up
net tcp2 hops 1 gw 10.10.1.50@tcp up
Succeeded
On umdist04.local
net tcp2 hops 1 gw 10.10.1.52@tcp down
net tcp2 hops 1 gw 10.10.1.51@tcp down
net tcp2 hops 1 gw 10.10.1.50@tcp up
Succeeded
On umdist07.local
net tcp2 hops 1 gw 10.10.1.50@tcp down
net tcp2 hops 1 gw 10.10.1.52@tcp down
net tcp2 hops 1 gw 10.10.1.51@tcp down
Succeeded
On umdist08.local
net tcp2 hops 1 gw 10.10.1.50@tcp down
net tcp2 hops 1 gw 10.10.1.52@tcp down
net tcp2 hops 1 gw 10.10.1.51@tcp down
Succeeded
On 9/5/2013 4:01 PM, Kris Howard wrote:
Might check lctl show_route and look for downed routes.
On Thu, Sep 5, 2013 at 12:56 PM, Bob Ball <ball(a)umich.edu> wrote:
> We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel
> 2.6.32-358.11.1.el6.x86_64. This was an upgrade from Lustre 1.8.4 on SL5.
>
> We have had a few situations lately where a client stops talking to some
> subset of the OST (about 58 of these total on 8 OSS, nearly 500TB in
> total). I have a couple of questions.
>
> 1. "lctl dl" on the OSS shows a smaller count on the affected servers;
> on the client, all OSS showed UP in "lctl dl". Today, I first tried
> rebooting this OSS, but that did not change the situation. I ended up
> rebooting the client before I could get full connectivity. Is there any
> way from the client to get the reconnect, short of rebooting that client?
>
> 2. It used to be the case under Lustre 1.8.4 that I could run "lfs df -h"
> on the client, and see all OST, even those where the connection was not
> working, for whatever reason. That is no longer the case, now the lfs
> command stops at the first, non-talking OST. This seems more like a bug
> than a feature. Is there some other way to see a list of non-communicating
> OST on a client?
>
> Thanks in advance for any help offered.
>
> bob
>
>
>
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss(a)lists.01.org
>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>