On Tue, Oct 07, 2014 at 02:00:16PM +0200, Michael Kluge wrote:
we recently reactivated an OST and now the 450 nodes send this OST
many more I/O requests that they send to the other osts. We have 4
servers and 48 osts. The other oss have a load of about 100. The
server with this ost has 1500 and many "filter_commitrw_write())
scratch-OST002f: slow i_mutex 30s" messages. Thus, I set
qos_prio_free to 0 in
/proc/fs/lustre/lov/scratch-MDT0000-mdtlov
Does anyone have an expectation how long it will take until the load
on this server will go down? Hours? Days?
perhaps the reactivated OST is truly (physically) slow? it may have a
slow disk in it. it happens :)
another explanation for anomalous load is that the other OSTs are
almost full (of data 'lfs df' or inodes 'lfs df -i') so the reactivated
OST is being preferentially chosen for i/o by the load balancing
algorithm. that is tweakable.
cheers,
robin