I don’t know that there’s a good one-size-fits all solution for how to configure the credits. The recommendations laid out in Cray’s paper are in response to an acute problem seen at large scale to deal with ping storms created by the Lustre pinger. If your system doesn’t experience that problem then the default values may be sufficient. If you have evidence that you’re leaving performance on the table, and LNet is your bottleneck, then experimenting with credits may be worthwhile.

Chris Horn

On Aug 19, 2015, at 11:17 AM, Chris Horn <hornc@cray.com> wrote:

The o2iblnd driver code forces peer_credits and concurrent_sends to be in a reasonable range of each other:

        if (*kiblnd_tunables.kib_concurrent_sends > *kiblnd_tunables.kib_peertxcredits * 2)
                *kiblnd_tunables.kib_concurrent_sends = *kiblnd_tunables.kib_peertxcredits * 2;

        if (*kiblnd_tunables.kib_concurrent_sends < *kiblnd_tunables.kib_peertxcredits / 2)
                *kiblnd_tunables.kib_concurrent_sends = *kiblnd_tunables.kib_peertxcredits / 2;

The code above ensures that concurrent_sends cannot be larger than 2*peer_credits or smaller than peer_credits/2. I’m not really sure why it allows concurrent_sends to be less than peer_credits.

By changing the value of concurrent_sends after the module has loaded you’re circumventing the above logic.

Chris Horn

On Aug 19, 2015, at 8:01 AM, Ken Jeffries <jeffries@cray.com> wrote:

Hi Martin and Craig,

This seems to be only a problem on mlx5 and not on mlx4. As Craig says the default values (peer_credits=8 concurrent_sends=8) do work. The values peer_credits=63 concurrent_sends=16
also work but the concurrent_sends=16 can not be set via the normal .conf file in modprobe.d/.   After the modprobe ko2iblnd but before the module is used, it is possible to chmod 
/sys/module/ko2iblnd/parameters/concurrent_sends to writeable and then echo 16 into the parameter.  

These values are still well short of some generally recommended values and that is concerning. As Martin says, it may be possible to increase other parameters to go beyond these values.

Regards,
Ken

From: Martin Hecht <hecht@hlrs.de>
Date: Wednesday, August 19, 2015 at 6:52 AM
To: "Prescott,Craig P" <prescott@rc.ufl.edu>, Kenneth Jeffries <jeffries@cray.com>, "hpdd-discuss@lists.01.org" <hpdd-discuss@ml01.01.org>
Subject: Re: [HPDD-discuss] o2iblnd peer_credits and concurrent_sends

Hi,

we stumbled over the peer_credits as well. It must be set to the same value on all clients and servers.
I also heard from Cray that 63 was the maximum that works. Maybe apart from the limitation of the lnet protocol there are further restrictions, or you have to increase other parameters as well, in order to go beyond 63.

Martin

On 08/19/2015 03:14 AM, Prescott,Craig P wrote:
Hi Ken,


No, I never got any answers to that old post.  We ended up going with the default values back then - those have actually been ok for our scale/use case.  FWIW, I have a hunch that the problem may have been due to limitations of the Connect-IB driver we were using at the time on the clients.


Kind of timely that you bring this issue up now, though, as we are bringing up a new file system and already had it on our list to revisit.


Cheers,

Craig



________________________________
From: HPDD-discuss <hpdd-discuss-bounces@ml01.01.org> on behalf of Ken Jeffries <jeffries@cray.com>
Sent: Monday, August 17, 2015 10:01 PM
To: hpdd-discuss@lists.01.org
Subject: Re: [HPDD-discuss] o2iblnd peer_credits and concurrent_sends

Craig,

did you ever get an answer to your question? Or pick values that worked?

https://lists.01.org/pipermail/hpdd-discuss/2013-July/000358.html

Ken



_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.orghttps://lists.01.org/mailman/listinfo/hpdd-discuss

_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss