If you have done a file-level backup and restore before your upgrade it sounds like the
MDS is rebuilding the Object Index files. I thought from your original email that you had
only changed the NIDs and maybe updated the MDS node.
It would have been much faster to do a device level backup/restore using "dd" in
this case, since the OI scrub wouldn't be needed.
You can check progress via /proc/fs/lustre/osd-ldiskfs/{MDT name}/scrub (I think). It
should tell you the current progress and scanning rate. It should be able to run at tens
if thousands of files per second. That said, few people have so many small files as you
do.
I would be interested to see what your scrub statistics are.
Cheers, Andreas
On 2013-10-18, at 4:06, "Olli Lounela" <olli.lounela(a)helsinki.fi> wrote:
Thanks for the quick reply.
What's preventing the system use is that for some reason the file content doesn't
seem to meet its metadata. The client systems hang connection at login (which I believe is
working as designed,) and when I try listing the mount (/home) first level directories, it
very fast brings up what content it has, but what it has grows very slowly. Yesterday, ls
/home/* hanged but no longer today, and user logins hang, probably because ~/.login and
~/.bashrc contents don't come up. Indeed, I can see the entries in the home
directories, and some subdirectories, though not all and not cat all files. I conjecture
that since directories are just files of a sort, the metadata/content issue affects all
1,5*10^9 files.
Looking with iostat, the MDS is averaging some 0.1 TPS at most and writing maybe a block
a second. As mentioned, there's 13 GB free RAM (ie. buffers) in MDS and the system is
99.9% idle. Plenty of resources and nothing happening. Any ideas how to start tracking the
problem? (NB: see also the zfs issue below.)
Yes, I switched the hardware under MDS, but Centos 6.x tar seems to handle --xattrs, so
in principle the slow progress in rebuilding (whatever is being rebuilt) remains
unexplained. The MDS is quad-core Opteron with 32 TB RAM, OSS's are the same as
earlier, dual Xeon 5130's with 8 GB RAM, which seems sufficient. The disk units are
SAS-attached shelves of up to 24 disks. SAS-controllers are standard LSI ones, and
I've seen them performing at or in excess of 100 MBps.
I have seen similar behaviour earlier with zfs, where writing just does not happen at any
reasonable speed after about 20 TiB, but I had unfortunately turned on confounding factors
like compression and dedup, which are known to be borken. Hence I did not follow it up,
especially since it seems longstanding/nontrivial issue, and since it seems zfs developers
are busier integrating into Lustre (and yes, Lustre 2.3 latest didn't compile cleanly
with the zfs stuff turned off.) I did suspect that there is some sort of combination of
write throtting and wait-for-flush/commit that explodes after unusually large dataset (ie.
20+ TiB,) but no tunable fixed anything, and eventually it seemed better option just to
give up zfs. We now have ldiskfs. And yes, our dataset will no doubt exceed 70 TiB before
the year is out.
The major reason for 2.3 was that 2.4 did not yet exist and 2.3 was the first to allow
for big OST slices. With modern disks and nobody wanting to fund required computing
hardware (we do consider ours an HPC cluster,) running 4-disk RAID-6's was deemed
unacceptable waste. In theory, and especially if it's deemed necessary, I could
upgrade to 2.4, but our informaticians have been out of work for more than a week now, and
a week or two more for the upgrade is really not a good idea.
Thankfully yours,
Olli
Quoting "Dilger, Andreas" <andreas.dilger(a)intel.com>:
> On 2013/10/17 5:34 AM, "Olli Lounela" <olli.lounela(a)helsinki.fi>
wrote:
>
>> Hi,
>>
>> We run four-node Lustre 2.3, and I needed to both change hardware
>> under MGS/MDS and reassign an OSS ip. Just the same, I added a brand
>> new 10GE network to the system, which was the reason for MDS hardware
>> change.
>
> Note that in Lustre 2.4 there is a "lctl replace_nids" command that
> allows you to change the NIDs without running --writeconf. That doesn't
> help you now, but possibly in the future.
>
>> I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre Manual,
>> and everything mounts fine. Log regeneration apparently works, since
>> it seems to do something, but exceedingly slowly. Disks show all but
>> no activity, CPU utilization is zero across the board, and memory
>> should be no issue. I believe it works, but currently it seems the
>> 1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My
>> boss isn't happy when I can't even predict how long this will take, or
>> even say for sure that it really works.
>
> The --writeconf information is at most a few kB and should only take
> seconds to complete. What "reindexing" operation are you referencing?
> It should be possible to mount the filesystem immediately (MGS first,
> then MDS and OSSes) after running --writeconf.
>
> You didn't really explain what is preventing you from using the filesystem,
> since you said it mounted properly?
>
>> Two questions: is there a way to know how fast it is progressing
>> and/or where it is at, or even that it really works, and is there a
>> way to speed up whatever is slowing it down? Seems all diagnostic
>> /proc entries have been removed from 2.3. I have tried mounting the
>> Lustre partitions with -o nobarrier (yes, I know it's dangerous, but
>> I'd really need to speed things up) but I don't know if that does
>> anything at all.
>
> I doubt that the "-o nobarrier" is helping you much.
>
>> We run Centos 6.x in Lustre servers, where Lustre has been installed
>> from rpm's from Whamcloud/Intel build bot, and Ubuntu 10.04 in clients
>> with hand compiled kernel and Lustre. One MGC/MGS with twelve 15k-RPM
>> SAS disks in RAID-10 as MDT that is all but empty, and six variously
>> build RAID-6's in SAS-attached shelves in three OSS's.
--
Olli Lounela
IT specialist and administrator
DNA sequencing and genomics
Institute of Biotechnology
University of Helsinki