[mOS configuration]
by Min-Woo Ahn
Hello users,
I tried to configure mOS for "Intel Xeon CPU E7-8890 v4, 2.20GHz" which has
96 cores(4 NUMA nodes) and I failed to use yod.
As I configured before(when I was using Intel Xeon CPU E5-2640, 16 cores),
I use core 0 as isolated and destination of syscall delegation. And remain
cores are used as lwkcpus.
When I booted mOS and tried to run simple command with yod and it shows
message "Killed". How can I solve this problem?
My GRUB_CMDLINE_LINUX is as below,
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 selinux=0
rd.lvm.lv=cl/root
rd.lvm.lv=cl/swap intel_idle.max_cstate=1 intel_pstate=disable
nmi_watchdog=0 mce=ignore_ce tsc=reliable transparent_hugepage=never
isolcpus=0 lwkcpus=0.1-95 lwkmem=0:32G,1:32G,2:32G,3:32G lwkmem_debug=0"
Thank you.
Minwoo.
---------------------------------------------
Minwoo Ahn
Researcher/M.S. Candidate
Computer Systems Laboratory
Sungkyunkwan University
More information: http://csl.skku.edu/People/MWAhn
---------------------------------------------
2 years, 11 months
[Does mOS work with MVAPICH2?]
by Min-Woo Ahn
Hello users,
I'm researcher at Computer Systems Laboratory(CSL), Sungkyunkwan Univ., at
South Korea.
Since August of 2017, I'm running MPI+OpenMP benchmarks(e.g. Linpack) for
HPC on mOS (v0.4, not recently released), and I have some questions about
it.
First of all, my experimental environment is,
------Hardware------
4 servers(Intel Xeon CPU E5-2640 v3 2.60GHz), 64 cores total.
------Configuration-----
For each server, I configured as "isolcpus=0, lwkcpus=0.1-15,
lwkmem=0:32G,1:32G"
------MPI Library------
MVAPICH2, correctly configured to use RoCE (I'm sure because benchmark run
correctly on Linux kernel 4.13)
My question is that, when I tried to run Linpack on 4 servers, it occurs
segmentation faults. How can I fix it? Does mOS can work with MVAPICH2? I'm
not sure because each threads' mapping information shows weird result. When
I run Linpack on single node (it works), I set number of threads per
process to 5 and number of process to 3 (total 15 contexts which same as
number of lwkcores), and MVAPICH2 shows mapping result as,
-------------CPU AFFINITY-------------
RANK: 0 CPU_SET: 11 12 13 14 15
RANK: 1 CPU_SET: 5 7 9
RANK: 2 CPU_SET: 10
-------------------------------------
why number of threads of each rank are different?
Anyway, error message when execute on multi-node is as,
[nfs1:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
(signal 11)
[nfs3:mpi_rank_2][error_sighandler] Caught error: Segmentation fault
(signal 11)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 6195 RUNNING AT [IP ADDRESS OF NFS3]
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@nfs1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909):
assert (!closed) failed
[proxy:0:0@nfs1] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@nfs1] main (pm/pmiserv/pmip.c:206): demux engine error waiting
for event
[proxy:0:1@nfs2] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909):
assert (!closed) failed
[proxy:0:1@nfs2] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@nfs2] main (pm/pmiserv/pmip.c:206): demux engine error waiting
for event
[proxy:0:3@nfs4] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909):
assert (!closed) failed
[proxy:0:3@nfs4] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3@nfs4] main (pm/pmiserv/pmip.c:206): demux engine error waiting
for event
[mpiexec@nfs1] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec@nfs1] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec@nfs1] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion
[mpiexec@nfs1] main (ui/mpich/mpiexec.c:344): process manager error waiting
for completion
nfs1~4 are hostname of my server. And my executed command is(no -R option
is needed since 1 process per node, am I right?)
$mpiexec -ppn 1 -np 4 -f [hostfile] yod ./xhpl
Moreover, dmesg about mOS shows messages multiple sequence of "(!)
deallocate_block addr=[some value] is not 2m aligned"
Is there any solution for this problem?
Thank you,
Minwoo.
---------------------------------------------
Minwoo Ahn
Researcher/M.S. Candidate
Computer Systems Laboratory
Sungkyunkwan University
More information: http://csl.skku.edu/People/MWAhn
---------------------------------------------
3 years, 2 months