Hi Mi,
Yes, there are tuneables for threads, and other factors as well. Lustre, however, does
auto tuning of OSS/MDS threads itself, and from what I know it does a pretty good job.
It would be helpful to know about the CPU core configuration on the OSS's as well as
about the threads running:
lctl get_param {service}.thread_{min,max,started}
You know, "Lustre" often seems to have the finger poked at it first, but in
actuality, Lustre typically runs atop a complex environment (nodes, network
infrastructure, etc) and is driven by demanding applications.
Based on the emails from yesterday and today, we're seeing client/OST connections come
and go. And we're seeing at least one "Bulk IO write" error on an OSS.
Trying to piece these two events together, and not knowing the exact cause of either,
seems to leave lots of room for opinions/possibilities. AFIAK, we cannot yet rule out
intermittent network disruptions. These seem like an easy (and common) explanation for
the problems seen. Nor can we rule out the application causing the client to be resource
constrained and unable to effectively "keep alive" with the OST, deliver
RPC's, or respond to OSS's. Another common condition with running HPC
applications.
Lots of possibilities and few data points. Coming back to the network, yes again, is
there monitoring in place that would track disruptions? Or on the client node, is there
monitoring that would indicate that the network stack (or memory) was resource starved
(OOM's, timeouts, etc.)? Same is true for the OSS?
Lastly, are the OST's in question on the same OSS? Looks like they were 0009, 000a,
and 000b.
--
Brett Lee
Sr. Systems Engineer
Intel High Performance Data Division
-----Original Message-----
From: hpdd-discuss-bounces(a)lists.01.org [mailto:hpdd-discuss-
bounces(a)lists.01.org] On Behalf Of Mi Zhou
Sent: Tuesday, May 07, 2013 9:49 AM
To: hpdd-discuss(a)lists.01.org
Subject: Re: [HPDD-discuss] OST refused connection from client
Hi,
Thanks everyone for the input.
I do see "connection to ... was lost" on the client side, but I did not see
messages like "waiting_locks_callback()".
Below is another instance:
Error on client:
May 7 01:33:18 nodem14 kernel: Lustre:
2519:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367908387/real 0] req@ffff880b5d73ac00
x1434028094681259/t0(0)
o101->scratch-OST000a-osc-ffff880c43e21400@192.168.100.4@o2ib:28/4 lens
296/352 e 0 to 1 dl 1367908398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May 7 01:33:18
nodem14 kernel: Lustre:
scratch-OST000a-osc-ffff880c43e21400: Connection to scratch-OST000a (at
192.168.100.4@o2ib) was lost; in progress operations using this service will
wait for recovery to complete May 7 01:33:18 nodem14 kernel: Lustre:
2519:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for sent delay: [sent 1367908387/real 0] req@ffff8809a750b000
x1434028094681262/t0(0)
o101->scratch-OST000b-osc-ffff880c43e21400@192.168.100.4@o2ib:28/4 lens
296/352 e 0 to 1 dl 1367908398 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 May 7 01:33:18
nodem14 kernel: Lustre:
scratch-OST000b-osc-ffff880c43e21400: Connection to scratch-OST000b (at
192.168.100.4@o2ib) was lost; in progress operations using this service will
wait for recovery to complete May 7 01:33:20 nodem14 kernel: LustreError:
11-0: an error occurred while communicating with 192.168.100.4@o2ib. The
ost_connect operation failed with -16 May 7 01:33:20 nodem14 kernel:
LustreError: 11-0: an error occurred while communicating with
192.168.100.4@o2ib. The ost_connect operation failed with -16 May 7
01:33:43 nodem14 kernel: Lustre:
scratch-OST0009-osc-ffff880c43e21400: Connection restored to
scratch-OST0009 (at 192.168.100.4@o2ib)
Error on OSS:
May 7 01:33:20 lustre-oss04 kernel: Lustre: scratch-OST000a: Bulk IO write
error with 77b5db75-5d82-1976-0116-5ef24f9febee (at
192.168.102.14@o2ib), client will retry: rc -110 May 7 01:33:43 lustre-oss04
kernel: Lustre: scratch-OST0009: Client 77b5db75-5d82-1976-0116-
5ef24f9febee (at 192.168.102.14@o2ib) reconnecting
I agree it is caused by some I/O intensive application, at least partially. I
wonder if there is anything we can do on Lustre side to alleviate the problem.
Like, lower the number of threads, etc.
Thanks
Mi
Email Disclaimer:
www.stjude.org/emaildisclaimer Consultation Disclaimer:
www.stjude.org/consultationdisclaimer
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss(a)lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss