Hi Brock,
Once the anonymous memory was allocated, we were not able to release the
memory without a reboot.
When we were attempting to reproduce the problem, we rebooted the system
in order to ensure a clean system for easy confirmation. I am not sure
if it is required to implement the fix.
On Fri, 27 Feb 2015, Brock Palen wrote:
It does apply to us.
We have now disabled job stats.
In our testing it is not obvious even though we turned this off the memory did not free
up, nodes will need a reboot? drop_caches?
Brock Palen
www.umich.edu/~brockp
Assoc. Director Advanced Research Computing - TS
XSEDE Campus Champion
brockp(a)umich.edu
(734)936-1985
> On Feb 27, 2015, at 4:09 PM, Herbert Yeung <hyeung(a)nas.nasa.gov> wrote:
>
> Hi Brock,
>
> We ran into a problem similar to yours that was fixed in Lustre 2.5.3 for
> the clients. Take a look at LU-5179 to see if it applies to you.
>
> On Fri, 27 Feb 2015, Brock Palen wrote:
>
>> We think we have narrowed down a problem related to our lustre system.
>>
>> We see the following problem on our clients but only for some codes. We slowly
watch systems, compute and login, run out of memory. There are not any processes using
significant amount of memory.
>>
>> For one user who's code (starx) can in one run run the entire system out of
memory only does this when reading it's data from our lustre mount. When using NFS
the issue doesn't happen.
>>
>> Client: 2.5.2
>> Server: 2.5.19 (ddn exascaler 2.1)
>> RHEL6 kernel 2.6.32-431.40.2.el6.x86_64
>>
>> We see all the memory disappear, and this is NOT cache memory this is real
memory, it never comes back, they all show up as anonymous pages, we start swapping and
eventually OOM nodes with no large memory use process.
>>
>> Has anyone seen an issue like this? We have nodes all over our cluster where the
memory footprint (ignoring cache) gets smaller and smaller.
>>
>> Eg here is one of our login nodes:
>> Mem: 49416180k total, 48890424k used, 525756k free, 18300k buffers
>> Swap: 12582908k total, 12582908k used, 0k free, 727252k cached
>>
>> Here is the result of ps -eo pid,args,pmem --sort pmem
>>
http://pastebin.com/tDcBq1HF
>>
>> Here is the result of
>> slabtop -o and cat /proc/meminfo
>>
http://pastebin.com/RAGrHNzd
>>
>> We can reproduce the problem on demand with the specific users code, but as
noted we think it effects other nodes.
>>
>> Thanks!
>>
>> Brock Palen
>>
www.umich.edu/~brockp
>> Assoc. Director Advanced Research Computing - TS
>> XSEDE Campus Champion
>> brockp(a)umich.edu
>> (734)936-1985
>>
>>
>>
>> _______________________________________________
>> HPDD-discuss mailing list
>> HPDD-discuss(a)lists.01.org
>>
https://lists.01.org/mailman/listinfo/hpdd-discuss
>
> --
> Herbert DM Yeung
> System Administrator
> Email: Herbert.Yeung(a)nasa.gov
> Phone #: (650) 604-2246
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss
--
Herbert DM Yeung
System Administrator
Email: Herbert.Yeung(a)nasa.gov
Phone #: (650) 604-2246