I see, this is an issue indeed. For me it worked at 4xQDR, so 40Gbps. I can just suggest to check and try to fix it with the openib vendor stack if you can't get the Mellanox one to compile with your kernel.

 

And be sure to check Changelogs on https://wiki.hpdd.intel.com/display/PUB/HPDD+Wiki+Front+Page to know which version of OFED should work with your lustre kernel. If it is a custom kernel this will not help indeed. For clients kernel you have to check the Mellanox OFED release notes.

 

Regards

Enrico

 

On Tuesday 07 May 2013 16:59:47 linux freaker wrote:

Just to inform you that while I use the default openib which comes with RHEL 6.3, it just showed me 10Gbps and not 40Gbps which QDR should show.





On Tue, May 7, 2013 at 4:57 PM, linux freaker <linuxfreaker@gmail.com> wrote:

Hi Enrico,


I did ran mlnx_add_kernel_support.sh first but still it dint work.

Then I tried recompiling lustre-modules with o2ib but no luck.


I can try it once more but really I feel its painful.


Cant I have 40Gbps with openib (which comes by default in RHEL 6.3)?



On Tue, May 7, 2013 at 4:46 PM, Enrico Tagliavini <enrico.tagliavini@ichec.ie> wrote:

Hi,

your procedure is correct, but you first need to recompile Mellanox OFED against your kernel using mlnx_add_kernel_support.sh from the .iso.

 

This will generate a new iso which will install correctly against the lustre kernel. After that you have to recompile (at least) lustre-modules using --with-o2ib=/usr/src/ofa_kernel configure switch as you did.

 

I don't want to question your choice, but since this procedure is a pain [you have to redo everything again for security updates] you should consider again if you can work with the stock IB stack. On my side I never saw a performance difference between the two, but if someone did, feel free to share :).

 

Kind regards

Enrico Tagliavini

 

On Tuesday 07 May 2013 16:34:41 linux freaker wrote:

Hi,


I installed RHEL 6.3 which by default comes with openib (Infiniband support software). I needed MLNX_OFED_LINUX-1.5.3-3.1.0-rhel6.3-x86_64.iso to be installed and so I extracted and ran ./mlnxofedinstall script. It removed the old Mellanox installables and drivers for a while but couldnt install due to lustre kernel. In curious, I installed it on default kernel. It went fine on default RHEL kernel. and While I rebooted and tried to get lustre up through lctl network up it threw error on MDS:


[root@oss2 ~]# modprobe lustre

WARNING: Error inserting fld (/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/fld.ko): Input/output error

WARNING: Error inserting fid (/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/fid.ko): Input/output error

WARNING: Error inserting mdc (/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/mdc.ko): Input/output error

WARNING: Error inserting osc (/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/osc.ko): Input/output error

WARNING: Error inserting lov (/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/lov.ko): Input/output error

FATAL: Error inserting lustre (/lib/modules/2.6.32-279.14.1.el6_lustre.x86_64/updates/kernel/fs/lustre/lustre.ko): Input/output error

[root@oss2 ~]# modprobe lnet

[root@oss2 ~]# lctl network up

LNET configure error 100: Network is down


Now it seemed I need to build OFED for lustre and I got this thread: http://thr3ads.net/lustre-discuss/2012/12/2164117-problem-with-installing-lustre-and-ofed


It suggested to do the following steps and I followed it line by line:


2.   boot into the lustre kernel
3.   in our /usr/src/lustre-2.1.2 directory built lustre against the 
Mellanox "Module.symvers" information (which is why you see the 
"Input/Output" errors on fid.ko, mdc.ko, osc.ko, lov.ko and because of the aforementioned items, the lustre.ko.   The MLNX version 1.8.5 that we needed was in the /usr/src/ofa_kernel directory (with the 
Module.symvers etc....)  We used the defaults other than the o2ib so 
our command in the /usr/src/lustre-2.1.2 directory looked like 
"./configure --with-o2ib=/usr/src/ofa_kernel" 
4.   next we issued "make"
5.   next we chose to run a "make rpms" command so that we could have rpms for our system for cluster re-building 

But even this failed to get my lustre up.modprobe lnet work but lctl network up doesnt. 

 lctl list_nids
IOC_LIBCFS_GET_NI error 100: Network is down



My ifconfig shows:


ib0       Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00

          inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0

          inet6 addr: fe80::202:c903:b:8b85/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1

          RX packets:2628 errors:0 dropped:0 overruns:0 frame:0

          TX packets:20 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1024

          RX bytes:155819 (152.1 KiB)  TX bytes:3503 (3.4 KiB)


uname -arn

Linux oss2 2.6.32-279.14.1.el6_lustre.x86_64 #1 SMP Fri Dec 14 23:22:17 PST 2012 x86_64 x86_64 x86_64 GNU/Linux




I did tried running kernel support script under MLNX directory and it did installed RPM but no luck with lctl list_nids . Can anyone suggest how to fix it?