Are you sure that the machine with address 192.168.1.101 is using interface p1p2 for
accessing the network?
Doug
On Jul 3, 2015, at 6:22 AM, Sean Caron
<scaron@umich.edu<mailto:scaron@umich.edu>> wrote:
Hi Jerome,
Regular TCP connectivity between the MGS and OSS machines appears to work fine; you can
ping; traceroute; SSH and so forth between the MGS server and the OSS servers, no problem.
The machines are all on the same L2 broadcast domain and there's no software firewall
(i.e. iptables) running on any of the MGS or OSS machines. SELinux has been completely
disabled and all the machines have been rebooted after that change was made ...
I haven't yet looked at things with TCPdump but only because I suspect it's more a
problem of configuration or I'm missing a module or something ... I don't think
really any Lustre traffic is actually hitting the network ... but I can go check. I was
hoping someone would just recognize it as a silly error and come back to me and say, oh,
you just need to load this module, or you're missing this in your LNET configuration
:O
Best,
Sean
On Fri, Jul 3, 2015 at 4:00 AM, Jérôme BECOT
<jerome.becot@inserm.fr<mailto:jerome.becot@inserm.fr>> wrote:
Hello,
Can the servers ping eachother via the system ping command ?
Have you checked that there's not any firewall running on the machines ?
Have you checked if SELinux is disabled/properly configured on the machines ?
Have you tried to run a tcpdump on both machines to see if any lustre traffic pass through
the network interface ?
Le 02/07/2015 23:03, Sean Caron a écrit :
OK, I did a little more research and I found that I could increase the verbosity of the
LNET debugging output by doing the following:
echo +neterror > /proc/sys/lnet/printk
So, I did that and tried one of the failing "lctl ping" commands again:
[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
[root@lustre-mgs ~]#
Here's what I see now in dmesg:
[174224.584669] LNet: 3900:0:(lib-socket.c:626:lnet_sock_connect()) Error -113 connecting
0.0.0.0/1023<http://0.0.0.0/1023> ->
192.168.1.101/988<http://192.168.1.101/988>
[174224.584692] LNet: 3900:0:(acceptor.c:114:lnet_connect_console_error()) Connection to
192.168.1.101@tcp at host 192.168.1.101 was unreachable: the network or that node may be
down, or Lustre may be misconfigured.
[174224.584711] LNet: 3900:0:(socklnd_cb.c:424:ksocknal_txlist_done()) Deleting packet
type 2 len 0 192.168.1.100@tcp->192.168.1.101@tcp
I understand tcp is just a synonym for tcp0 so I think that's okay ... Network
configuration on each of these machines is very simple; only one interface on any of them
is up and running; one port on an Intel X520 10 Gig NIC; I have LNET configured in
/etc/modprobe.d/lustre.conf on i.e. the MGS as so:
options lnet networks=tcp0(p1p2)
That's correct, yes? In this case, p1p1 and p1p2 are the two 10 Gig NIC ports ... I
don't know why RHEL uses such funky names ... But very basic, no routing, not even
multiple interfaces ...
Continuing to research ... I assume error -113 in this case is just a generic
"connection failure" type error although if something could be deduced from
that, it would certainly be great :O
Thanks,
Sean
On Thu, Jul 2, 2015 at 4:47 PM, Sean Caron
<scaron@umich.edu<mailto:scaron@umich.edu>> wrote:
Thanks; Rick; I'm just starting out getting my bearings with Lustre so it's not
clear to me, all the various diagnostic tools at hand and the mechanisms available for
troubleshooting, so it's helpful that you mentioned "lctl"; I tried that on
the MDS and it shows LNET as up consistent with my configuration:
[root@lustre-mgs ~]# lctl list_nids
192.168.1.100@tcp
[root@lustre-mgs ~]#
If I do kind of a loop-back Lustre ping on the MDS, it appears to work ... doesn't
give me an error message back:
[root@lustre-mgs ~]# lctl ping 192.168.1.100@tcp0
12345-0@lo
12345-192.168.1.100@tcp
[root@lustre-mgs ~]#
Now, on the OSS machines, "lctl" also shows Lustre networking being up and
running consistently with how I have it configured:
[root@lustre-oss1 ~]# lctl list_nids
192.168.1.101@tcp
[root@lustre-oss1 ~]#
I can do the same loop-back ping on the OSS and it seems to "work":
[root@lustre-oss1 log]# lctl ping 192.168.1.101@tcp0
12345-0@lo
12345-192.168.1.101@tcp
[root@lustre-oss1 log]#
However, if I try to do the ping, it gives me an I/O error!
[root@lustre-oss1 ~]# lctl ping 192.168.1.100@tcp0
failed to ping 192.168.1.100@tcp: Input/output error
[root@lustre-oss1 ~]#
It seems to fail consistently in both directions with the same error message; I tried it
also on the MGS:
[root@lustre-mgs ~]# lctl ping 192.168.1.101@tcp0
failed to ping 192.168.1.101@tcp: Input/output error
[root@lustre-mgs ~]#
Am I missing a module somewhere that I need to be loading? I don't see any messages in
dmesg or /var/log/messages corresponding to my attempt to run "lctl ping" that
might help to point in the direction of what's going wrong.
Of course, normal TCP ping between the hosts works fine; they're on the same switch so
same L2 broadcast domain, etc.
Nothing in /etc/hosts to go awry; there's just the one entry for localhost.localdomain
at 127.0.0.1.
Any thoughts?
Thanks,
Sean
On Tue, Jun 30, 2015 at 11:51 PM, Mohr Jr, Richard Frank (Rick Mohr)
<rmohr@utk.edu<mailto:rmohr@utk.edu>> wrote:
On Jun 30, 2015, at 5:07 PM, Sean Caron
<scaron@umich.edu<mailto:scaron@umich.edu>> wrote:
So that all seems okay, but then I go over to my first OSS node ... I first try to run
mkfs.lustre, that seems to complete okay:
mkfs.lustre --fsname=lustre --mgsnode=192.168.1.100@tcp0 --ost --index=1 --reformat
/dev/md2
But then if I try to actually mount that, it pauses for a moment, then gives me a timeout
error:
Have you tried running “lctl ping 192.168.1.100@tcp0” from the OSS node to make sure it
has LNet connectivity to the MDS node? You can also try running “lctl list_nids” on the
MDS node to make sure that it has the 192.168.1.100@tcp0 nid configured.
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu<http://www.nics.tennessee.edu/>
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss
--
Jérome BECOT
Administrateur Systèmes et Réseaux
Molécules à visée Thérapeutique par des approches in Silico (MTi)
Univ Paris Diderot, UMRS973 Inserm
Case 013
Bât. Lamarck A, porte 412
35, rue Hélène Brion 75205 Paris Cedex 13
France
Tel : 01 57 27 83 82
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org<mailto:HPDD-discuss@lists.01.org>
https://lists.01.org/mailman/listinfo/hpdd-discuss