We've been seeing a lot of our compute nodes where something is causing ldlm_poold to
start chewing 100% CPU. The nodes are fine for a while, and then at some point something
goes wrong, and ldlm_poold goes nuts. Sometimes there are a bunch of ldlm_bl_ processes
also consuming large amounts of CPU as well.
Once the node gets into this state, it has to be rebooted to fix it. The node won't
even shut down cleanly, as umount of /lustre ends up hanging.
There are no lustre-related messages in syslog while the erroneous behavior is taking
place.
The clients and servers have all been running 2.5.3 for the last few months on RedHat 6
(2.6.32-504.3.3.el6.x86_64)
I saw the bug report LU-5415 that looks similar, but that report says that 2.5.3 includes
a fix for that particular problem.
Has anyone else seen this behavior?
Thanks,
Kevin
---
Kevin Hildebrand
Division of IT
University of Maryland, College Park
Show replies by date
Interestingly, I just happened to come across an old bug report LU-416 which mentions that
flushing the page cache would alleviate the problems. I just tried that on some of my
affected machines, and it does indeed seem to return ldlm_poold back to normal operation.
Kevin
From: HPDD-discuss [mailto:hpdd-discuss-bounces@lists.01.org] On Behalf Of Kevin M.
Hildebrand
Sent: Friday, April 17, 2015 7:55 AM
To: hpdd-discuss(a)lists.01.org
Subject: [HPDD-discuss] ldlm_poold causing high client load
We've been seeing a lot of our compute nodes where something is causing ldlm_poold to
start chewing 100% CPU. The nodes are fine for a while, and then at some point something
goes wrong, and ldlm_poold goes nuts. Sometimes there are a bunch of ldlm_bl_ processes
also consuming large amounts of CPU as well.
Once the node gets into this state, it has to be rebooted to fix it. The node won't
even shut down cleanly, as umount of /lustre ends up hanging.
There are no lustre-related messages in syslog while the erroneous behavior is taking
place.
The clients and servers have all been running 2.5.3 for the last few months on RedHat 6
(2.6.32-504.3.3.el6.x86_64)
I saw the bug report LU-5415 that looks similar, but that report says that 2.5.3 includes
a fix for that particular problem.
Has anyone else seen this behavior?
Thanks,
Kevin
---
Kevin Hildebrand
Division of IT
University of Maryland, College Park