Hi Rick,
I have about 1100 clients total but only about 800 are mounting it for now.
Looking at the /proc/slabinfo on MDS I see large numbers but as I interpret it is it is
not taking up huge memory though.
ldlm_locks 2980551 3301620 576 7 1 : tunables 54 27 8 : slabdata
471660 471660 0
ldlm_resources 2903161 3120192 320 12 1 : tunables 54 27 8 : slabdata
260016 260016 0
Also I see all of my OSS except 2, have used up memory 100%, although I can some to the
conclusion if locks are really using it based on the following numbers:
ldlm_locks 68298 75096 576 7 1 : tunables 54 27 8 : slabdata
10728 10728 0
ldlm_resources 64722 74844 320 12 1 : tunables 54 27 8 : slabdata
6237 6237 0
Also under the cat /proc/fs/lustre/ldlm/namespaces/*/*
I see lru_size = 36000000
While others are all "0" not sure these numbers need to be tuned?
And: # cat /proc/fs/lustre/ldlm/namespaces/scratch-MDT0000-lwp-OST0048/pool/limit
3313022
Are these numbers too high causing it to use high memory?
Thank you, Amit H. Kumar,
-----Original Message-----
From: Mohr Jr, Richard Frank (Rick Mohr) [mailto:rmohr@utk.edu]
Sent: Friday, August 14, 2015 2:30 PM
To: Kumar, Amit
Cc: hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] (ost_handler.c:1764:ost_blocking_ast()) Error -2
syncing data on lock cancel
> On Aug 13, 2015, at 1:26 PM, Kumar, Amit <ahkumar(a)mail.smu.edu>
wrote:
>
> Although I did not see any ENOSPC errors on OST’s as reported in one of the
above request that is solved. But as of this past week I had to bring down the
entire file system to resolve from long lockups. I use ltop utility to monitor
IOPS and I noticed that LOCKS held by each OST were in the range of 6,000-
10,000.
How many clients do you have using the file system? If there are quite a few,
you might need to look into limiting the number of locks each client will
cache. By default, Lustre comes up with a number based on the amount of
memory on the system, but this can potentially be a big number for each
client. If there are lots of clients, then the locks on the server side can start
using up quite a bit of memory.
I don’t know if this is actually causing the issue you are seeing, but I have had
a couple of cases of crashing (or slow performing) MDS servers caused by
large numbers of locks, so this is usually one of the first things I check. If you
want to get an idea of how much memory is being used by locks, just look at
/proc/slabinfo on the server and find any lines with “ldlm” in them.
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu