All clients were likewise shutdown, and the filesystem mounted only
once servers were up -- of course apart from the snail pace recovery
issue.
I do understand that FID-in-dirent might be the issue, but really,
what is causing the stalls, this extreme slowness, why is the system
not doing recovery at full speed?
Now, your mail seems a bit short of suggestions on how to pursue the
problem. However, this seems the second time I'm suggested that 2.4
has features that would be desirable. It was months long process to
get 2.3 compile properly in Ubuntu 10.04, but can I realistically
expect 2.4 to be all good and piece of cake? I do realize that zfs
inclusion is a big part of the problem in 2.3, but then, I have no
idea where 2.4 is currently at.
Maybe I should mention again that I have observed similar
idle-but-stalled-writing behaviour in zfs. The potential implications
seem unsettling, knowing that zfs is (was?) being integrated to Lustre
2.3. Also, it may be worth emphasizing, that for servers I'm using
latest 2.3 rpm's from Jenkins, and for clients (which AFAICT are not
the problem) I use same source, hand compiled since the .deb's don't
work properly for Ubuntu without zfs (kernel 2.6.32.60+drm33.26 ie.
prisitine source just patched with Lustre.)
Provided this issue should be traced -- and please let me know if this
is not so -- I have questions: how to find out what's happening,
what's being recovered, and how to get any kind of diagnostics to help
solve the issue? Up until now, I've presumed that people want it
solved, since unknown problems have a way of popping up at unexpected
places, and contaminating new versions too. I'm trying to read section
28.2 of the manual for hints, knowing what the thread is, but the
learning curve seems rather steep as I'm a sysadmin and not a
professional codewright:
[root@dna-lustre-mds mdd]# ps axf | grep 1478
1478 ? S 0:04 \_ [ll_mgs_0002]
I note that if this is the stalling thread, then 4 seconds of time
spent really does tell that in the week it has done nothing really.
Then again, I do presume the problem is in MDS/MGS, since log on OSS
is as follows (10.10.10.230 is currently the only active client) and
AFAICT, there's no indication of anything wrong.
Lustre: g4data-OST0000: Bulk IO read error with
f4f71ac3-133b-4b3a-b072-61160c86e688 (at 10.10.10.230@tcp), client
will retry: rc -110
Lustre: g4data-OST0002: Client f4f71ac3-133b-4b3a-b072-61160c86e688
(at 10.10.10.230@tcp) reconnecting
Lustre: Skipped 14 previous similar messages
LustreError: 1956:0:(events.c:443:server_bulk_callback()) event type
5, status -5, desc ffff880222c06800
Lustre: g4data-OST0000: Client f4f71ac3-133b-4b3a-b072-61160c86e688
(at 10.10.10.230@tcp) refused reconnection, still busy with 1 active
RPCs
LustreError: 2003:0:(ldlm_lib.c:2784:target_bulk_io()) @@@ Reconnect
on bulk PUT req@ffff88021a768400 x1448977372949418/t0(0)
o3->f4f71ac3-133b-4b3a-b072-61160c86e688@10.10.10.230@tcp:0/0 lens
488/432 e 0 to 0 dl 1382432574 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: g4data-OST0000: Bulk IO read error with
f4f71ac3-133b-4b3a-b072-61160c86e688 (at 10.10.10.230@tcp), client
will retry: rc -110
LustreError: 1956:0:(events.c:443:server_bulk_callback()) event type
5, status -5, desc ffff880229233e00
The client still stalls/hangs.
Any pointers or help appreciated,
Yours,
Olli
Quoting "Yong, Fan" <fan.yong(a)intel.com>:
Another possibility of causing very slow directory traversal is that
FID-in-dirent information was lost during your file-level
backup/restore. LFSCK-1.5 can rebuild such attribute, but it is only
available since Lustre-2.4.
For an old client mounted to the system before the backup, it may
cached some stale FIDs information, which may cause some strange
behavior after the restore. But for a new mounted client after the
OI scrub finished, the operation (such as "ls", "du", "wc",
and
etc.) on such client can be slow, but should not hung. As for the
lock race and recovery, needs to analysis Lustre debug log case by
case.
--
Cheers,
Nasf
> -----Original Message-----
> From: Olli Lounela [mailto:olli.lounela@helsinki.fi]
> Sent: Monday, October 21, 2013 11:03 PM
> To: Yong, Fan
> Cc: Dilger, Andreas; hpdd-discuss(a)lists.01.org
> Subject: Re: [Lustre-discuss] Speeding up configuration log regeneration?
>
> Hi, and thanks again!
>
> Here you are:
> [root@dna-lustre-mds mdd]# lctl get_param -n mdt.g4data-
> MDT0000.identity_upcall
> /usr/sbin/l_getidentity
> [root@dna-lustre-mds mdd]# ls -l /usr/sbin/l_getidentity -rwxr-xr-x 1 root
> root 24384 Oct 20 2012 /usr/sbin/l_getidentity
>
> This is what OSS seems to say in this kind of situation:
>
> Lustre: g4data-OST0002: Client f4f71ac3-133b-4b3a-b072-61160c86e688
> (at 10.10.10.230@tcp) reconnecting
> Lustre: Skipped 6 previous similar messages
> Lustre: g4data-OST0000: haven't heard from client
> f4f71ac3-133b-4b3a-b072-61160c86e688 (at 10.10.10.230@tcp) in 244
> seconds. I think it's dead, and I am evicting it. exp ffff88021a822800, cur
> 1382364252 expire 1382364102 last 1382364008
> Lustre: g4data-OST0002: Client f4f71ac3-133b-4b3a-b072-61160c86e688
> (at 10.10.10.230@tcp) reconnecting
> Lustre: Skipped 7 previous similar messages
> Lustre: g4data-OST0002: Client f4f71ac3-133b-4b3a-b072-61160c86e688
> (at 10.10.10.230@tcp) refused reconnection, still busy with 1 active RPCs
> LustreError: 2003:0:(ldlm_lib.c:2784:target_bulk_io()) @@@ Reconnect on
> bulk PUT req@ffff8801bc0b0800 x1448977372213663/t0(0)
> o3->f4f71ac3-133b-4b3a-b072-61160c86e688@10.10.10.230@tcp:0/0 lens
> 488/432 e 0 to 0 dl 1382364740 ref 1 fl Interpret:/2/0 rc 0/0
> Lustre: g4data-OST0002: Bulk IO read error with
> f4f71ac3-133b-4b3a-b072-61160c86e688 (at 10.10.10.230@tcp), client will
> retry: rc -110
> Lustre: g4data-OST0000: Client f4f71ac3-133b-4b3a-b072-61160c86e688
> (at 10.10.10.230@tcp) reconnecting
> Lustre: g4data-OST0002: Client f4f71ac3-133b-4b3a-b072-61160c86e688
> (at 10.10.10.230@tcp) reconnecting
>
> I seem to have two messages possibly related to this in MDS, and I notice
> that the first one is missing trace. However, task name is tgt_recov, which
> seems meaningful:
>
> Oct 20 00:41:49 dna-lustre-mds kernel: LNet: Service thread pid 1478 was
> inactive for 0.00s. The thread might be hung, or it might only be
> slow and will
> resume later. Dumping the stack trace for debugging
> purposes:
> Oct 20 00:41:49 dna-lustre-mds kernel: Pid: 1478, comm: ll_mgs_0002 Oct 20
> 00:41:49 dna-lustre-mds kernel:
> Oct 20 00:41:49 dna-lustre-mds kernel: Call Trace:
> Oct 20 00:41:49 dna-lustre-mds kernel: LNet: Service thread pid 1478 was
> inactive for 0.00s. The thread might be hung, or it might only be
> slow and will
> resume later. Dumping the stack trace for debugging
> purposes:
> Oct 20 00:41:49 dna-lustre-mds kernel: Pid: 1478, comm: ll_mgs_0002 Oct 20
> 00:41:49 dna-lustre-mds kernel:
> Oct 20 00:41:49 dna-lustre-mds kernel: Call Trace:
> Oct 20 00:41:49 dna-lustre-mds kernel: LNet: Service thread pid 1478
> completed after 0.00s. This indicates the system was overloaded (too many
> service threads, or there were not enough hardware resources).
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffffa057a8fc>] ?
> lustre_msg_get_transno+0x8c/0x100 [ptlrpc] Oct 20 00:41:49 dna-lustre-
> mds kernel: [<ffffffffa02b55f1>] ?
> libcfs_debug_msg+0x41/0x50 [libcfs]
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffffa02a577e>] ?
> cfs_waitq_wait+0xe/0x10 [libcfs]
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffffa0581127>] ?
> ptlrpc_wait_event+0x297/0x2a0 [ptlrpc]
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffff81060250>] ?
> default_wake_function+0x0/0x20
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffffa058ad1b>] ?
> ptlrpc_main+0x7fb/0x19e0 [ptlrpc]
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffffa058a520>] ?
> ptlrpc_main+0x0/0x19e0 [ptlrpc]
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffff8100c14a>] ?
> child_rip+0xa/0x20
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffffa058a520>] ?
> ptlrpc_main+0x0/0x19e0 [ptlrpc]
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffffa058a520>] ?
> ptlrpc_main+0x0/0x19e0 [ptlrpc]
> Oct 20 00:41:49 dna-lustre-mds kernel: [<ffffffff8100c140>] ?
> child_rip+0x0/0x20
>
> I also have plenty of blocked thread messages in MDS, but these are nothing
> new:
>
> Oct 15 16:24:47 dna-lustre-mds kernel: INFO: task tgt_recov:1497 blocked
> for more than 120 seconds.
> Oct 15 16:24:47 dna-lustre-mds kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 15 16:24:47 dna-lustre-mds kernel: tgt_recov D
> 0000000000000001 0 1497 2 0x00000080
> Oct 15 16:24:47 dna-lustre-mds kernel: ffff880806795e00
> 0000000000000046 0000000000000000 ffffffff810305de Oct 15 16:24:47
> dna-lustre-mds kernel: ffff880806795d70
> ffffffff8102a8b9 ffff880806795d80 ffffffff8104f218 Oct 15 16:24:47 dna-
> lustre-mds kernel: ffff880808a2a5f8
> ffff880806795fd8 000000000000fb88 ffff880808a2a5f8 Oct 15 16:24:47
> dna-lustre-mds kernel: Call Trace:
> Oct 15 16:24:47 dna-lustre-mds kernel: [<ffffffff810305de>] ?
> physflat_send_IPI_mask+0xe/0x10
> Oct 15 16:24:47 dna-lustre-mds kernel: [<ffffffff8102a8b9>] ?
> native_smp_send_reschedule+0x49/0x60
> Oct 15 16:24:47 dna-lustre-mds kernel: [<ffffffff8104f218>] ?
> resched_task+0x68/0x80
> Oct 15 16:24:47 dna-lustre-mds kernel: [<ffffffffa053a050>] ?
> check_for_clients+0x0/0x90 [ptlrpc]
> Oct 15 16:24:47 dna-lustre-mds kernel: [<ffffffffa053ba0d>]
> target_recovery_overseer+0x9d/0x230 [ptlrpc] Oct 15 16:24:47 dna-lustre-
> mds kernel: [<ffffffffa0539e50>] ?
> exp_connect_healthy+0x0/0x20 [ptlrpc]
> Oct 15 16:24:47 dna-lustre-mds kernel: [<ffffffff810920d0>] ?
> autoremove_wake_function+0x0/0x40
> Oct 15 16:24:47 dna-lustre-mds kernel: [<ffffffffa0542b8e>]
> target_recovery_thread+0x58e/0x1990 [ptlrpc] Oct 15 16:24:47 dna-lustre-
> mds kernel: [<ffffffffa0542600>] ?
> target_recovery_thread+0x0/0x1990 [ptlrpc] Oct 15 16:24:47 dna-lustre-mds
> kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Oct 15 16:24:47 dna-lustre-
> mds kernel: [<ffffffffa0542600>] ?
> target_recovery_thread+0x0/0x1990 [ptlrpc] Oct 15 16:24:47 dna-lustre-mds
> kernel: [<ffffffffa0542600>] ?
> target_recovery_thread+0x0/0x1990 [ptlrpc] Oct 15 16:24:47 dna-lustre-mds
> kernel: [<ffffffff8100c140>] ?
> child_rip+0x0/0x20
>
> We have pretty vanilla system, where most tunables were never changed,
> and those that were, well, changes should be lost in tunefs --writeconfig. I
> care not about that, since performance would be good enough without
> tuning, provided the servers actually served.
>
> The difference in my earlier message is that root has home dir in root
> partition, all others reside in Lustre. This might be related to
> metadata not
> meeting the object. I have understood that it is normal to block the process
> that's trying to access some file from OST that isn't readily available.
>
> Some symptoms:
> * directory contents may not be there (process hangs for long,
> undeterminate time)
> * file contents may not be there even if the directory entry is (ditto)
> * the amount working seems to increase over time, although extremely
> slowly, ie. contents magically appear some later time
>
> Case in point, before Thursday, I could not list the home
> directories at all.
> After that, I intermittently got them, and on Friday I could
> consistently list
> them. Same with subdirs, and files, there are many files where trying to see
> inside hangs, but others that work fine (this is in a client, i
> successfully got in
> in about five hours once something returned error to the client)
>
> olounela@dna-analyzer:~$ wc -l sij.txt
> 13 sij.txt
> olounela@dna-analyzer:~$ wc -l .bashrc
> (hangs)
>
> Similarly:
>
> olounela@dna-analyzer:~$ lfs df -h
> UUID bytes Used Available Use% Mounted on
> g4data-MDT0000_UUID 2.1T 4.0G 1.9T 0% /home[MDT:0]
> g4data-OST0000_UUID : Input/output error
> g4data-OST0001_UUID 13.6T 12.5T 469.0G 96% /home[OST:1]
> g4data-OST0002_UUID 13.6T 12.3T 710.5G 95% /home[OST:2]
> g4data-OST0003_UUID 13.6T 12.4T 561.7G 96% /home[OST:3]
> g4data-OST0004_UUID 32.7T 530.4M 31.1T 0% /home[OST:4]
> g4data-OST0005_UUID 32.7T 504.2M 31.1T 0% /home[OST:5]
>
> filesystem summary: 106.4T 37.2T 63.9T 37% /home
>
> I waited for over 20 min for OST0002, OST0000 came up in a few minutes. All
> the rest came fast. Next attempt 30 min later, OST0000 responded without
> delay, OST0002 still hangs.
>
> However, MDC/MDS thinks all are up:
>
> [root@dna-lustre-mds mdd]# lctl dl
> 0 UP mgs MGS MGS 15
> 1 UP mgc MGC10.10.10.240@tcp 813de5f1-ea4a-6254-766a-
> 242e83ca9e58 5
> 2 UP lov g4data-MDT0000-mdtlov g4data-MDT0000-mdtlov_UUID 4
> 3 UP mdt g4data-MDT0000 g4data-MDT0000_UUID 7
> 4 UP mds mdd_obd-g4data-MDT0000 mdd_obd_uuid-g4data-MDT0000 3
> 5 UP osc g4data-OST0000-osc-MDT0000 g4data-MDT0000-mdtlov_UUID 5
> 6 UP osc g4data-OST0001-osc-MDT0000 g4data-MDT0000-mdtlov_UUID 5
> 7 UP osc g4data-OST0004-osc-MDT0000 g4data-MDT0000-mdtlov_UUID 5
> 8 UP osc g4data-OST0002-osc-MDT0000 g4data-MDT0000-mdtlov_UUID 5
> 9 UP osc g4data-OST0003-osc-MDT0000 g4data-MDT0000-mdtlov_UUID 5
> 10 UP osc g4data-OST0005-osc-MDT0000 g4data-MDT0000-mdtlov_UUID
> 5
>
> And the OSS in question agrees:
>
> [root@dna-lustre-01 src]# lctl dl
> 0 UP mgc MGC10.10.10.240@tcp a2df980a-343a-bf66-7159-
> b460cb520ab9 5
> 1 UP ost OSS OSS_uuid 3
> 2 UP obdfilter g4data-OST0000 g4data-OST0000_UUID 9
> 3 UP obdfilter g4data-OST0002 g4data-OST0002_UUID 9
>
> Just in case, I probably should mention somewhere that SElinux has been
> disabled as are the firewalls, since we're in a closed network.
> Also, hardware seems fine, and the RAID packs report no error.
>
> Feels as if there's some write/locking or race issue in a recovery,
> that locks
> until lengthy timeout. I have no idea what it is and what recovery might be
> running, and finding out is not easy since nothing seems happening (idle, no
> memory pressure, little disk activity, network unburdened.)
>
> If there's anything at all I can tell, check, or test, please let me know.
>
> Thanks,
> Olli
>
> Quoting "Yong, Fan" <fan.yong(a)intel.com>:
>
> > Hi Olli,
> >
> > I am not sure I understand your issues clearly or not. What does your
> > "Login" means? You want to say that the client can mount to the new
> > system as "root" user, and operate successfully as "root"
user.
> > But when you operate as non-root user, then the operation will hung
> > there, right?
> >
> > Do you have any error message/dmesg when you hit the trouble? What is
> > the output for " lctl get_param -n
> > mdt.${FSNAME}-MDT0000.identity_upcall"? More related information and
> > descriptions are helpful.
> >
> > --
> > Cheers,
> > Nasf
> >
> >> -----Original Message-----
> >> From: Olli Lounela [mailto:olli.lounela@helsinki.fi]
> >> Sent: Monday, October 21, 2013 6:14 PM
> >> To: Yong, Fan
> >> Cc: Dilger, Andreas; hpdd-discuss(a)lists.01.org
> >> Subject: Re: [Lustre-discuss] Speeding up configuration log regeneration?
> >>
> >> Hi,
> >>
> >> Okay, my bad: I presumed MDT filestotal mean how many files we got.
> >> Again presuming, if each OST filestotal-filesfree gives a close
> >> estimate, I get just 25.5 million files in four OSTs (the two others
> >> are recently added and all but empty.) This makes the almost a week
> >> spent in still incomplete recovery even more baffling.
> >>
> >> Login still hangs for all but root.
> >>
> >> Yours, Olli
> >>
> >> Quoting "Yong, Fan" <fan.yong(a)intel.com>:
> >>
> >> >> -----Original Message-----
> >> >> From: Olli Lounela [mailto:olli.lounela@helsinki.fi]
> >> >> Sent: Monday, October 21, 2013 4:59 PM
> >> >> To: Dilger, Andreas
> >> >> Cc: Yong, Fan; hpdd-discuss(a)lists.01.org
> >> >> Subject: Re: [Lustre-discuss] Speeding up configuration log
> regeneration?
> >> >>
> >> >> First, my apologies for not noticing that the mail client
> >> dropped the CC's.
> >> >>
> >> >> The system still isn't up, it's excruciatingly slow to
recover.
> >> >>
> >> >> Quoting "Dilger, Andreas"
<andreas.dilger(a)intel.com>:
> >> >>
> >> >> > If you have done a file-level backup and restore before your
> >> >> > upgrade it sounds like the MDS is rebuilding the Object Index
> >> >> > files. I thought from your original email that you had only
> >> >> > changed the NIDs and maybe updated the MDS node.
> >> >>
> >> >> No, I had to switch the host. The previous one was not compatible
> >> >> with the 10GE hardware. Also, the nee one is much faster, larger
> >> >> and more reliable in multiple ways, so I needed to do it
> >> >> eventually anyway. I did expect a recovery period, but nothing in
> >> >> excess of a several days.
> >> >>
> >> >> > It would have been much faster to do a device level
> >> >> > backup/restore using "dd" in this case, since the OI
scrub
> wouldn't be
> needed.
> >> >>
> >> >> OK, good to know. Well, I should be able to go back since I still
> >> >> have the original MDS untouched, but have been loath to do so
> >> >> since at least something is happening. Should I still do that?
> >> >>
> >> >> > You can check progress via /proc/fs/lustre/osd-ldiskfs/{MDT
> >> >> > name}/scrub (I think). It should tell you the current
progress
> >> >> > and scanning rate. It should be able to run at tens if
> >> >> > thousands of files per second. That said, few people have so
> >> >> > many small files as you
> >> do.
> >> >> >
> >> >> > I would be interested to see what your scrub statistics are.
> >> >>
> >> >> AFAICT, it should have been complete in a bit over nine hours, and
> >> >> it's now been nearly a week:
> >> >>
> >> >> [root@dna-lustre-mds mdd]# cat
> >> >> /proc/fs/lustre/osd-ldiskfs/g4data-MDT0000/filestotal
> >> >> 1497235456
> >> >> [root@dna-lustre-mds mdd]# cat
> >> >> /proc/fs/lustre/osd-ldiskfs/g4data-MDT0000/oi_scrub
> >> >> name: OI scrub
> >> >> magic: 0x4c5fd252
> >> >> oi_files: 64
> >> >> status: completed
> >> >> flags:
> >> >> param:
> >> >> time_since_last_completed: 591570 seconds
> >> >> time_since_latest_start: 592140 seconds
> >> >> time_since_last_checkpoint: 591570 seconds
> >> >> latest_start_position: 11
> >> >> last_checkpoint_position: 1497235457
> >> >> first_failure_position: N/A
> >> >> checked: 25871866
> >> >> updated: 25871728
> >> >> failed: 0
> >> >> prior_updated: 0
> >> >> noscrub: 0
> >> >> igif: 0
> >> >> success_count: 1
> >> >> run_time: 569 seconds
> >> >> average_speed: 45469 objects/sec
> >> >> real-time_speed: N/A
> >> >> current_position: N/A
> >> >> [root@dna-lustre-mds mdd]# echo '1497235457/45469/60^2'|bc
-l
> >> >> 9.14686353461821363028
> >> >>
> >> >> If 'updated' is where it's at, and if it updates all,
doesn't that
> >> >> mean it's just 1.73% done in 6 days?!? Oops, "only" a
year to go..?
> >> >>
> >> > [Nasf] The OI scrub has already rebuilt your OI files successfully.
> >> > It took 569 seconds ("run_time"), and checked 25871866
("checked")
> >> > objects (or inodes), and rebuilt 25871728 items ("updated")
within
> >> > the
> >> > 25871866 items. The 1497235457 (" last_checkpoint_position")
stands
> >> > for the last object's index (or say that it is the last
inode's
> >> > ino#), which does NOT mean you have 1497235457 objects (or inodes).
> >> >
> >> > --
> >> > Cheers,
> >> > Nasf
> >> >
> >> >> This seems to be the typical top speed based on iostat:
> >> >>
> >> >> 10/21/2013 11:09:01 AM
> >> >> Device: tps Blk_read/s Blk_wrtn/s Blk_read
Blk_wrtn
> >> >> sda 0.10 0.00 1.20 0
24
> >> >>
> >> >> The disk subsystem is pretty fast, though:
> >> >>
> >> >> [root@dna-lustre-mds mdd]# dd if=/dev/sda of=/dev/null bs=1024k
> >> >> count=10240
> >> >> 10240+0 records in
> >> >> 10240+0 records out
> >> >> 10737418240 bytes (11 GB) copied, 10.5425 s, 1.0 GB/s
> >> >>
> >> >> I dare not write there unless told how; as is the case with most
> >> >> HPC/bioinformatics labs, we cannot make backups of significant
> >> >> amount of data since there is just too much.
> >> >>
> >> >> MDS is 99.9-100% idle and no memory pressure:
> >> >>
> >> >> [root@dna-lustre-mds mdd]# free
> >> >> total used free shared buffers
> >> cached
> >> >> Mem: 32864824 32527208 337616 0
> 13218200 114504
> >> >> -/+ buffers/cache: 19194504 13670320
> >> >> Swap: 124999672 0 124999672
> >> >>
> >> >> I cannot explain the slowness in any way, for all practical
> >> >> purposes there's nothing happening at all. If the system was
> >> >> physically hard pressed to cope, I would be much happier, at least
> >> >> I'd know what to do...
> >> >>
> >> >> Thanks again,
> >> >> Olli
> >> >>
> >> >> > On 2013-10-18, at 4:06, "Olli Lounela"
> >> <olli.lounela(a)helsinki.fi> wrote:
> >> >> >
> >> >> >> Thanks for the quick reply.
> >> >> >>
> >> >> >> What's preventing the system use is that for some
reason the
> >> >> >> file content doesn't seem to meet its metadata. The
client
> >> >> >> systems hang connection at login (which I believe is
working as
> >> >> >> designed,) and when I try listing the mount (/home) first
level
> >> >> >> directories, it very fast brings up what content it has,
but
> >> >> >> what it has grows very slowly. Yesterday, ls /home/*
hanged but
> >> >> >> no longer today, and user logins hang, probably because
> >> >> >> ~/.login and ~/.bashrc contents don't come up. Indeed,
I can
> >> >> >> see the entries in the home directories, and some
> >> >> >> subdirectories, though not all and not cat all files. I
> >> >> >> conjecture that since directories are just files of a
sort, the
> metadata/content issue affects all 1,5*10^9 files.
> >> >> >>
> >> >> >> Looking with iostat, the MDS is averaging some 0.1 TPS at
most
> >> >> >> and writing maybe a block a second. As mentioned,
there's 13 GB
> >> >> >> free RAM (ie. buffers) in MDS and the system is 99.9%
idle.
> >> >> >> Plenty of resources and nothing happening. Any ideas how
to
> >> >> >> start tracking the problem? (NB: see also the zfs issue
below.)
> >> >> >>
> >> >> >> Yes, I switched the hardware under MDS, but Centos 6.x
tar
> >> >> >> seems to handle --xattrs, so in principle the slow
progress in
> >> >> >> rebuilding (whatever is being rebuilt) remains
unexplained. The
> >> >> >> MDS is quad-core Opteron with 32 TB RAM, OSS's are the
same as
> >> >> >> earlier, dual Xeon 5130's with 8 GB RAM, which seems
sufficient.
> >> >> >> The disk units are SAS-attached shelves of up to 24
disks.
> >> >> >> SAS-controllers are standard LSI ones, and I've seen
them
> >> >> >> performing
> >> at or in excess of 100 MBps.
> >> >> >>
> >> >> >> I have seen similar behaviour earlier with zfs, where
writing
> >> >> >> just does not happen at any reasonable speed after about
20
> >> >> >> TiB, but I had unfortunately turned on confounding factors
like
> >> >> >> compression and dedup, which are known to be borken. Hence
I
> >> >> >> did not follow it up, especially since it seems
> >> >> >> longstanding/nontrivial issue, and since it seems zfs
> >> >> >> developers are busier integrating into Lustre (and yes,
Lustre
> >> >> >> 2.3 latest didn't compile cleanly with the zfs stuff
turned
> >> >> >> off.) I did suspect that there is some sort of combination
of
> >> >> >> write throtting and wait-for-flush/commit that explodes
after
> >> >> >> unusually large dataset (ie. 20+ TiB,) but no tunable
fixed
> >> >> >> anything, and eventually it seemed better option just to
give
> >> >> >> up zfs. We now have ldiskfs. And yes, our dataset will no
doubt
> >> >> >> exceed 70 TiB before the year is out.
> >> >> >>
> >> >> >> The major reason for 2.3 was that 2.4 did not yet exist
and 2.3
> >> >> >> was the first to allow for big OST slices. With modern
disks
> >> >> >> and nobody wanting to fund required computing hardware (we
do
> >> >> >> consider ours an HPC cluster,) running 4-disk RAID-6's
was
> >> >> >> deemed
> >> unacceptable waste.
> >> >> >> In theory, and especially if it's deemed necessary, I
could
> >> >> >> upgrade to 2.4, but our informaticians have been out of
work
> >> >> >> for more than a week now, and a week or two more for the
> >> >> >> upgrade is really not a good idea.
> >> >> >>
> >> >> >> Thankfully yours,
> >> >> >> Olli
> >> >> >>
> >> >> >> Quoting "Dilger, Andreas"
<andreas.dilger(a)intel.com>:
> >> >> >>
> >> >> >>> On 2013/10/17 5:34 AM, "Olli Lounela"
> >> >> >>> <olli.lounela(a)helsinki.fi>
> >> wrote:
> >> >> >>>
> >> >> >>>> Hi,
> >> >> >>>>
> >> >> >>>> We run four-node Lustre 2.3, and I needed to both
change
> >> >> >>>> hardware under MGS/MDS and reassign an OSS ip.
Just the same,
> >> >> >>>> I added a brand new 10GE network to the system,
which was the
> >> >> >>>> reason for MDS hardware change.
> >> >> >>>
> >> >> >>> Note that in Lustre 2.4 there is a "lctl
replace_nids" command
> >> >> >>> that allows you to change the NIDs without running
--writeconf.
> >> >> >>> That doesn't help you now, but possibly in the
future.
> >> >> >>>
> >> >> >>>> I ran tunefs.lustre --writeconf as per chapter
14.4 in Lustre
> >> >> >>>> Manual, and everything mounts fine. Log
regeneration
> >> >> >>>> apparently works, since it seems to do something,
but
> exceedingly slowly.
> >> >> >>>> Disks show all but no activity, CPU utilization is
zero
> >> >> >>>> across the board, and memory should be no issue. I
believe it
> >> >> >>>> works, but currently it seems the
> >> >> >>>> 1,5*10^9 files (some 55 TiB of data) won't be
indexed in a week.
> >> >> >>>> My boss isn't happy when I can't even
predict how long this
> >> >> >>>> will take, or even say for sure that it really
works.
> >> >> >>>
> >> >> >>> The --writeconf information is at most a few kB and
should
> >> >> >>> only take seconds to complete. What
"reindexing" operation
> >> >> >>> are you
> >> referencing?
> >> >> >>> It should be possible to mount the filesystem
immediately (MGS
> >> >> >>> first, then MDS and OSSes) after running --writeconf.
> >> >> >>>
> >> >> >>> You didn't really explain what is preventing you
from using
> >> >> >>> the filesystem, since you said it mounted properly?
> >> >> >>>
> >> >> >>>> Two questions: is there a way to know how fast it
is
> >> >> >>>> progressing and/or where it is at, or even that it
really
> >> >> >>>> works, and is there a way to speed up whatever is
slowing it
> >> >> >>>> down? Seems all diagnostic /proc entries have been
removed
> >> >> >>>> from 2.3. I have tried mounting the Lustre
partitions with
> >> >> >>>> -o nobarrier (yes, I know it's dangerous, but
I'd really need
> >> >> >>>> to speed things up) but I don't know if that
does
> anything at all.
> >> >> >>>
> >> >> >>> I doubt that the "-o nobarrier" is helping
you much.
> >> >> >>>
> >> >> >>>> We run Centos 6.x in Lustre servers, where Lustre
has been
> >> >> >>>> installed from rpm's from Whamcloud/Intel
build bot, and
> >> >> >>>> Ubuntu
> >> >> >>>> 10.04 in clients with hand compiled kernel and
Lustre. One
> >> >> >>>> MGC/MGS with twelve 15k-RPM SAS disks in RAID-10
as MDT that
> >> >> >>>> is all but empty, and six variously build
RAID-6's in
> >> >> >>>> SAS-attached
> >> >> shelves in three
> >> >> OSS's.
> >> >> >>
> >> >> >> --
> >> >> >> Olli Lounela
> >> >> >> IT specialist and administrator
> >> >> >> DNA sequencing and genomics
> >> >> >> Institute of Biotechnology
> >> >> >> University of Helsinki
> >> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Olli Lounela
> >> >> IT specialist and administrator
> >> >> DNA sequencing and genomics
> >> >> Institute of Biotechnology
> >> >> University of Helsinki
> >>
> >>
> >> --
> >> Olli Lounela
> >> IT specialist and administrator
> >> DNA sequencing and genomics
> >> Institute of Biotechnology
> >> University of Helsinki
>
>
> --
> Olli Lounela
> IT specialist and administrator
> DNA sequencing and genomics
> Institute of Biotechnology
> University of Helsinki
--
Olli Lounela
IT specialist and administrator
DNA sequencing and genomics
Institute of Biotechnology
University of Helsinki