Andreas, Yong, apologies for resend, but I wasn't subscribed to
hpdd-discuss. Resending to fix this.
Quoting Olli Lounela <olli.lounela(a)helsinki.fi>:
First, my apologies for not noticing that the mail client dropped the
CC's.
The system still isn't up, it's excruciatingly slow to recover.
Quoting "Dilger, Andreas" <andreas.dilger(a)intel.com>:
> If you have done a file-level backup and restore before your
> upgrade it sounds like the MDS is rebuilding the Object Index
> files. I thought from your original email that you had only changed
> the NIDs and maybe updated the MDS node.
No, I had to switch the host. The previous one was not compatible
with the 10GE hardware. Also, the new one is much faster, larger and
more reliable in multiple ways, so I needed to do it eventually
anyway. I did expect a recovery period, but nothing in excess of a
several days.
> It would have been much faster to do a device level backup/restore
> using "dd" in this case, since the OI scrub wouldn't be needed.
OK, good to know. Well, I should be able to go back since I still
have the original MDS untouched, but have been loath to do so since
at least something is happening. Should I still do that?
> You can check progress via /proc/fs/lustre/osd-ldiskfs/{MDT
> name}/scrub (I think). It should tell you the current progress and
> scanning rate. It should be able to run at tens if thousands of
> files per second. That said, few people have so many small files as
> you do.
>
> I would be interested to see what your scrub statistics are.
AFAICT, it should have been complete in a bit over nine hours, and
it's now been nearly a week:
[root@dna-lustre-mds mdd]# cat
/proc/fs/lustre/osd-ldiskfs/g4data-MDT0000/filestotal
1497235456
[root@dna-lustre-mds mdd]# cat
/proc/fs/lustre/osd-ldiskfs/g4data-MDT0000/oi_scrub
name: OI scrub
magic: 0x4c5fd252
oi_files: 64
status: completed
flags:
param:
time_since_last_completed: 591570 seconds
time_since_latest_start: 592140 seconds
time_since_last_checkpoint: 591570 seconds
latest_start_position: 11
last_checkpoint_position: 1497235457
first_failure_position: N/A
checked: 25871866
updated: 25871728
failed: 0
prior_updated: 0
noscrub: 0
igif: 0
success_count: 1
run_time: 569 seconds
average_speed: 45469 objects/sec
real-time_speed: N/A
current_position: N/A
[root@dna-lustre-mds mdd]# echo '1497235457/45469/60^2'|bc -l
9.14686353461821363028
If 'updated' is where it's at, and if it updates all, doesn't that
mean it's just 1.73% done in 6 days?!? Oops, "only" a year to go..?
This seems to be the typical top speed based on iostat:
10/21/2013 11:09:01 AM
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.10 0.00 1.20 0 24
The disk subsystem is pretty fast, though:
[root@dna-lustre-mds mdd]# dd if=/dev/sda of=/dev/null bs=1024k count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 10.5425 s, 1.0 GB/s
I dare not write there unless told how; as is the case with most
HPC/bioinformatics labs, we cannot make backups of significant
amount of data since there is just too much.
MDS is 99.9-100% idle and no memory pressure:
[root@dna-lustre-mds mdd]# free
total used free shared buffers cached
Mem: 32864824 32527208 337616 0 13218200 114504
-/+ buffers/cache: 19194504 13670320
Swap: 124999672 0 124999672
I cannot explain the slowness in any way, for all practical purposes
there's nothing happening at all. If the system was physically hard
pressed to cope, I would be much happier, at least I'd know what to
do...
Thanks again,
Olli
> On 2013-10-18, at 4:06, "Olli Lounela" <olli.lounela(a)helsinki.fi>
wrote:
>
>> Thanks for the quick reply.
>>
>> What's preventing the system use is that for some reason the file
>> content doesn't seem to meet its metadata. The client systems hang
>> connection at login (which I believe is working as designed,) and
>> when I try listing the mount (/home) first level directories, it
>> very fast brings up what content it has, but what it has grows
>> very slowly. Yesterday, ls /home/* hanged but no longer today, and
>> user logins hang, probably because ~/.login and ~/.bashrc contents
>> don't come up. Indeed, I can see the entries in the home
>> directories, and some subdirectories, though not all and not cat
>> all files. I conjecture that since directories are just files of a
>> sort, the metadata/content issue affects all 1,5*10^9 files.
>>
>> Looking with iostat, the MDS is averaging some 0.1 TPS at most and
>> writing maybe a block a second. As mentioned, there's 13 GB free
>> RAM (ie. buffers) in MDS and the system is 99.9% idle. Plenty of
>> resources and nothing happening. Any ideas how to start tracking
>> the problem? (NB: see also the zfs issue below.)
>>
>> Yes, I switched the hardware under MDS, but Centos 6.x tar seems
>> to handle --xattrs, so in principle the slow progress in
>> rebuilding (whatever is being rebuilt) remains unexplained. The
>> MDS is quad-core Opteron with 32 TB RAM, OSS's are the same as
>> earlier, dual Xeon 5130's with 8 GB RAM, which seems sufficient.
>> The disk units are SAS-attached shelves of up to 24 disks.
>> SAS-controllers are standard LSI ones, and I've seen them
>> performing at or in excess of 100 MBps.
>>
>> I have seen similar behaviour earlier with zfs, where writing just
>> does not happen at any reasonable speed after about 20 TiB, but I
>> had unfortunately turned on confounding factors like compression
>> and dedup, which are known to be borken. Hence I did not follow it
>> up, especially since it seems longstanding/nontrivial issue, and
>> since it seems zfs developers are busier integrating into Lustre
>> (and yes, Lustre 2.3 latest didn't compile cleanly with the zfs
>> stuff turned off.) I did suspect that there is some sort of
>> combination of write throtting and wait-for-flush/commit that
>> explodes after unusually large dataset (ie. 20+ TiB,) but no
>> tunable fixed anything, and eventually it seemed better option
>> just to give up zfs. We now have ldiskfs. And yes, our dataset
>> will no doubt exceed 70 TiB before the year is out.
>>
>> The major reason for 2.3 was that 2.4 did not yet exist and 2.3
>> was the first to allow for big OST slices. With modern disks and
>> nobody wanting to fund required computing hardware (we do consider
>> ours an HPC cluster,) running 4-disk RAID-6's was deemed
>> unacceptable waste. In theory, and especially if it's deemed
>> necessary, I could upgrade to 2.4, but our informaticians have
>> been out of work for more than a week now, and a week or two more
>> for the upgrade is really not a good idea.
>>
>> Thankfully yours,
>> Olli
>>
>> Quoting "Dilger, Andreas" <andreas.dilger(a)intel.com>:
>>
>>> On 2013/10/17 5:34 AM, "Olli Lounela"
<olli.lounela(a)helsinki.fi> wrote:
>>>
>>>> Hi,
>>>>
>>>> We run four-node Lustre 2.3, and I needed to both change hardware
>>>> under MGS/MDS and reassign an OSS ip. Just the same, I added a brand
>>>> new 10GE network to the system, which was the reason for MDS hardware
>>>> change.
>>>
>>> Note that in Lustre 2.4 there is a "lctl replace_nids" command
that
>>> allows you to change the NIDs without running --writeconf. That doesn't
>>> help you now, but possibly in the future.
>>>
>>>> I ran tunefs.lustre --writeconf as per chapter 14.4 in Lustre Manual,
>>>> and everything mounts fine. Log regeneration apparently works, since
>>>> it seems to do something, but exceedingly slowly. Disks show all but
>>>> no activity, CPU utilization is zero across the board, and memory
>>>> should be no issue. I believe it works, but currently it seems the
>>>> 1,5*10^9 files (some 55 TiB of data) won't be indexed in a week. My
>>>> boss isn't happy when I can't even predict how long this will
take, or
>>>> even say for sure that it really works.
>>>
>>> The --writeconf information is at most a few kB and should only take
>>> seconds to complete. What "reindexing" operation are you
referencing?
>>> It should be possible to mount the filesystem immediately (MGS first,
>>> then MDS and OSSes) after running --writeconf.
>>>
>>> You didn't really explain what is preventing you from using the
>>> filesystem,
>>> since you said it mounted properly?
>>>
>>>> Two questions: is there a way to know how fast it is progressing
>>>> and/or where it is at, or even that it really works, and is there a
>>>> way to speed up whatever is slowing it down? Seems all diagnostic
>>>> /proc entries have been removed from 2.3. I have tried mounting the
>>>> Lustre partitions with -o nobarrier (yes, I know it's dangerous, but
>>>> I'd really need to speed things up) but I don't know if that
does
>>>> anything at all.
>>>
>>> I doubt that the "-o nobarrier" is helping you much.
>>>
>>>> We run Centos 6.x in Lustre servers, where Lustre has been installed
>>>> from rpm's from Whamcloud/Intel build bot, and Ubuntu 10.04 in
clients
>>>> with hand compiled kernel and Lustre. One MGC/MGS with twelve 15k-RPM
>>>> SAS disks in RAID-10 as MDT that is all but empty, and six variously
>>>> build RAID-6's in SAS-attached shelves in three OSS's.
>>
>> --
>> Olli Lounela
>> IT specialist and administrator
>> DNA sequencing and genomics
>> Institute of Biotechnology
>> University of Helsinki
--
Olli Lounela
IT specialist and administrator
DNA sequencing and genomics
Institute of Biotechnology
University of Helsinki