On Apr 8, 2015, at 5:05 PM, Kumar, Amit <ahkumar(a)mail.smu.edu>
wrote:
We had a power outage and we recovered perfectly fine, except 2 of
the OSS server mounting the OST’s over IB from the DDN storage seem to be dead slow. Read
is perfectly fine I get a pretty good read performance, about 1200 MB/s. But write is like
4MB/s where as other OST’s on other OSS’s are doing perfectly fine about 350MB/s.
No hardware errors on OSS servers, Storage controllers etc. Storage controllers,
connecting these two OSS with issues also, serves two other OSS’s and their performance is
perfectly fine.
Any help or direction to debug this will be very helpful. I am running out of ideas on
what could cause this. Could it be it takes a while to recover since the file system
crashed.
Are all OSTs on those two OSS servers slow? Have you looked at the IB counters to see if
there are any errors?
Another thing you could try would be to look at the performance counters on the DDN
controllers to see if there is anything out of the ordinary like unusually long write
latencies or IO sizes that are smaller than you are expecting.
Have you tried restarting the servers and/or the DDN controllers to see if that clears
anything up?
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu