Using lustre 2.5.3, 1 combined MDS/MDT, 44 OSTs. Currently containing 120TB
data, over 35M files.
On the weekend, our MDS server crashed due to an IO hang. After restarting the
server, we starting hitting the LU-5040 bug during recovery:
kernel BUG at fs/jbd2/transaction.c:1033!
kernel: invalid opcode: 0000 [#1] SMP
I attempted a restart of all OST and MDT mounts with abort_recov, and the
filesystem was able to mount on a client and all OSTs connected on a client. The
first access to any files or metadata caused the MDS to panic and also show
indications of LU-5392.
Is this is indicating a corrupted quota subsystem? I was trying to find a means
of rebuilding the quota records. However, "lfs quotacheck" is no longer
supported as it states "since space accounting is always enabled".
If the quotas are corrupted, how can I recover them. Likewise, how can I
recover from the two bugs mentioned above? I have some time flexibility to
resolve it, if that would assist in getting the bugs addressed and my filesystem
back online.
Any assistance would be appreciated.
Gary.
--
Gary Molenkamp SHARCNET
Systems Administrator University of Western Ontario
Compute/Calcul Canada
http://www.computecanada.org
gary(a)sharcnet.ca
http://www.sharcnet.ca
(519) 661-2111 x88429 (519) 661-4000