Thanks for all your answers, they were really helpful. I think, I
understood how the communication for a LNetGet should work. At least, a
LNetGet does not cause LNet to crash any more.
But, the selftest does not work, it stops with an unknown RPC error. As
I said before, the LNetPut messages work without problems.
The selftest works this far: set up a session, add two nodes as a group
to the session, and add a test batch. Then, when a test is added to the
batch, the first LNetGet goes over the wire and the RPC error occurs.
I think, I copy the wrong memory ranges, but I could be wrong. So, maybe
someone here can help me.
Our network can do RDMA, the kernel API for the network accepts physical
addresses to describe where it should start reading and writing to the
memory.
On the initiator lnd_send is called with an lnet message with only a
memory descriptor attached, so the LND checks if this points to a iov or
kiov and attaches the iov.iov_base or maps the kiov page, and attaches
the physical address of the page (And cheks for offsets etc. to be
handled correctly).
On the target, lnd_recv is called with an kiov attached. So the LND
copies the data from the address of the mapped kiov page on the target
to the address it got from the initiator.
After the copy is done, both nodes get a notification from the network
device, and the LND calls lnet_finalize for the lnet messages on both nodes.
That's what I got from reading the o2iblnd code. Did I miss anything? At
the moment, I think that the LND reads or writes from/to the wrong
address. But I don't see where I go wrong, so maybe someone is able to
tell me where I mess up.
Thanks again for your help and kind regards
Tobias Groschup