We have seen mds (admittedly with smallish memory) OOM'ing while testing
2.5.3 whereas there was no problem with 2.5.0. It turns out the problem is
that, even though we have lru_size=800 everywhere, the client LDLM lru's
are growing huge so that the MDS unreclaimable ldlm slabs fill memory.
It looks like the root cause is the change to ldlm_cancel_aged_policy() in
commit 0a6c6fcd46 on the 2.5 branch (LU-4786 osc: to not pick busy pages
for ELC) - it has changed the lru_sze != 0 behaviour. Prior to that, the
non-lru_resize behaviour (at least through the early_lock_cancel path which
is what we see being hit) was effectively
cancel lock if (too many in lru cache || lock unused too long)
In 2.5.3, it's
cancel lock if (too many in lru cache && lock unused too long)
Disabling early_lock_cancel doesn't seem to help.
It might be arguable which of the two behaviours is correct but the
lru_size doco suggests the former - the latter makes lru_size != 0
ineffective in practice. It also looks like the change was not actually
necessary for LU-4300?
Cheers
David