Aug 18 13:53:29 prod-0064 kernel: [ 151.261120] LNetError: 485:0:(o2iblnd.c:869:kiblnd_create_conn()) Can't create QP: -12, send_wr: 16191, recv_wr: 254
Aug 18 13:54:05 prod-0064 kernel: [ 187.241154] LNetError: 6:0:(o2iblnd.c:869:kiblnd_create_conn()) Can't create QP: -12, send_wr: 16191, recv_wr: 254
Aug 18 13:54:05 prod-0064 kernel: [ 187.241161] LNetError: 6:0:(o2iblnd.c:869:kiblnd_create_conn()) Skipped 3 previous similar messages
Aug 18 13:54:41 prod-0064 kernel: [ 223.220728] LNetError: 6:0:(o2iblnd.c:869:kiblnd_create_conn()) Can't create QP: -12, send_wr: 16191, recv_wr: 254
On Aug 19, 2015, at 11:17 AM, Chris Horn <hornc@cray.com> wrote:
The o2iblnd driver code forces peer_credits and concurrent_sends to be in a reasonable range of each other:
if (*kiblnd_tunables.kib_concurrent_sends > *kiblnd_tunables.kib_peertxcredits * 2)*kiblnd_tunables.kib_concurrent_sends = *kiblnd_tunables.kib_peertxcredits * 2;
if (*kiblnd_tunables.kib_concurrent_sends < *kiblnd_tunables.kib_peertxcredits / 2)*kiblnd_tunables.kib_concurrent_sends = *kiblnd_tunables.kib_peertxcredits / 2;
The code above ensures that concurrent_sends cannot be larger than 2*peer_credits or smaller than peer_credits/2. I’m not really sure why it allows concurrent_sends to be less than peer_credits.
By changing the value of concurrent_sends after the module has loaded you’re circumventing the above logic.
Chris Horn
On Aug 19, 2015, at 8:01 AM, Ken Jeffries <jeffries@cray.com> wrote:
_______________________________________________Hi Martin and Craig,
This seems to be only a problem on mlx5 and not on mlx4. As Craig says the default values (peer_credits=8 concurrent_sends=8) do work. The values peer_credits=63 concurrent_sends=16also work but the concurrent_sends=16 can not be set via the normal .conf file in modprobe.d/. After the modprobe ko2iblnd but before the module is used, it is possible to chmod/sys/module/ko2iblnd/parameters/concurrent_sends to writeable and then echo 16 into the parameter.
These values are still well short of some generally recommended values and that is concerning. As Martin says, it may be possible to increase other parameters to go beyond these values.
Regards,Ken
From: Martin Hecht <hecht@hlrs.de>
Date: Wednesday, August 19, 2015 at 6:52 AM
To: "Prescott,Craig P" <prescott@rc.ufl.edu>, Kenneth Jeffries <jeffries@cray.com>, "hpdd-discuss@lists.01.org" <hpdd-discuss@ml01.01.org>
Subject: Re: [HPDD-discuss] o2iblnd peer_credits and concurrent_sends
Hi,
we stumbled over the peer_credits as well. It must be set to the same value on all clients and servers.
I also heard from Cray that 63 was the maximum that works. Maybe apart from the limitation of the lnet protocol there are further restrictions, or you have to increase other parameters as well, in order to go beyond 63.
Martin
On 08/19/2015 03:14 AM, Prescott,Craig P wrote:
Hi Ken, No, I never got any answers to that old post. We ended up going with the default values back then - those have actually been ok for our scale/use case. FWIW, I have a hunch that the problem may have been due to limitations of the Connect-IB driver we were using at the time on the clients. Kind of timely that you bring this issue up now, though, as we are bringing up a new file system and already had it on our list to revisit. Cheers, Craig ________________________________ From: HPDD-discuss <hpdd-discuss-bounces@ml01.01.org> on behalf of Ken Jeffries <jeffries@cray.com> Sent: Monday, August 17, 2015 10:01 PM To: hpdd-discuss@lists.01.org Subject: Re: [HPDD-discuss] o2iblnd peer_credits and concurrent_sends Craig, did you ever get an answer to your question? Or pick values that worked? https://lists.01.org/pipermail/hpdd-discuss/2013-July/000358.html Ken
_______________________________________________ HPDD-discuss mailing list HPDD-discuss@lists.01.orghttps://lists.01.org/mailman/listinfo/hpdd-discuss
HPDD-discuss mailing list
HPDD-discuss@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss