On Aug 14, 2015, at 5:42 PM, Kumar, Amit <ahkumar(a)mail.smu.edu>
wrote:
Hi Rick,
I have about 1100 clients total but only about 800 are mounting it for now.
Looking at the /proc/slabinfo on MDS I see large numbers but as I interpret it is it is
not taking up huge memory though.
ldlm_locks 2980551 3301620 576 7 1 : tunables 54 27 8 : slabdata
471660 471660 0
ldlm_resources 2903161 3120192 320 12 1 : tunables 54 27 8 : slabdata
260016 260016 0
That looks like it is only a couple of GB’s so it shouldn’t be a problem.
Also I see all of my OSS except 2, have used up memory 100%, although
I can some to the conclusion if locks are really using it based on the following
numbers:
ldlm_locks 68298 75096 576 7 1 : tunables 54 27 8 : slabdata
10728 10728 0
ldlm_resources 64722 74844 320 12 1 : tunables 54 27 8 : slabdata
6237 6237 0
Do you have Lustre read/write caching enabled? That could be consuming the memory. You
can check the values by running
lctl get_param obdfilter.*.read_cache_enable
lctl get_param obdfilter.*.writethrough_cache_enable
If they are enabled, you could try disabling them to see how that affects memory usage
(and if it has any effect on your problem).
Also under the cat /proc/fs/lustre/ldlm/namespaces/*/*
I see lru_size = 36000000
While others are all "0" not sure these numbers need to be tuned?
I think the “0” just means the limit is dynamically determined.
On our clients, we set lru_size=10000 and lru_max_age=172800 which is more than enough.
(The dynamically determined values are crazy big…)
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu