Fwd: Execution time breakdowns
by Vincent Cavé
Hi,
I'm forwarding Nicholas' email since I don't remember the admin password to approve it myself.
One concern we have is that although the timing gets bad, most of the time is spent trying to steal which would indicate there's not enough work. Could it be a granularity problem ?
Another thing is that every run seems to break around 16 cores/workers, we're wondering if there could be something hard-coded in ocrInit or somewhere else.
Nicholas, can you send us the code you're using for both pthread and ocr so that we can have a look ?
Best,
Vincent
Begin forwarded message:
> From: ocr-dev-owner(a)lists.01.org
> Subject: OCR-dev post from nicholas.p.carter(a)intel.com requires approval
> Date: January 29, 2013 9:45:27 AM CST
> To: ocr-dev-owner(a)lists.01.org
>
> As list administrator, your authorization is requested for the
> following mailing list posting:
>
> List: OCR-dev(a)lists.01.org
> From: nicholas.p.carter(a)intel.com
> Subject: Execution time breakdowns
> Reason: Message body is too big: 1549440 bytes with a limit of 1024 KB
>
> At your convenience, visit:
>
> https://lists.01.org/mailman/admindb/ocr-dev
>
> to approve or deny the request.
>
> From: "Carter, Nicholas P" <nicholas.p.carter(a)intel.com>
> Subject: Execution time breakdowns
> Date: January 29, 2013 4:03:52 PM CST
> To: "ocr-dev(a)lists.01.org" <ocr-dev(a)lists.01.org>
>
>
> Benoit’s question about execution time breakdowns got me thinking about how to script Vtune to generate the sort of data he was looking for, and it turned out to not be too hard. (Meaning that it took a while to figure out but is pretty easy once you know how.)
>
> I wrote some scripts to sweep over the different array and chunk sizes, generating execution times by function, and other scripts to process the data and plot the fraction of execution time spent in each of the 10 functions that were the biggest contributors to execution time across the sweep. Hopefully, they’ll provide some data about where to look when the time comes for performance tuning. Also, these scripts should be pretty easily portable to other programs.
>
> -Nick
>
>
>
> From: ocr-dev-request(a)lists.01.org
> Subject: confirm 8706aff9a3fdfd1a5cc5268df732bcc44da4b2c6
> Date: January 29, 2013 9:45:27 AM CST
>
>
> If you reply to this message, keeping the Subject: header intact,
> Mailman will discard the held message. Do this if the message is
> spam. If you reply to this message and include an Approved: header
> with the list password in it, the message will be approved for posting
> to the list. The Approved: header can also appear in the first line
> of the body of the reply.
>
9 years, 1 month
Limits on how much memory can be allocated using ocrDbCreate?
by Carter, Nicholas P
I've been modifying my mergesort code to use datablocks to pass data and arguments rather than shared memory, and have run into some problems when I try to run larger tests. Everything works fine with smaller arrays, but when I increase the array size above 512K elements, ocrDbCreate starts returning error codes, followed shortly by segfaults. The same-sized tests ran fine when I was using C++ new and delete to allocate memory, so I'm suspecting it's something in the datablock code. I've done some quick checking that my code is destroying its datablocks when it's done with them, but could have missed something.
So, question #1 is "is there a limit on the amount of memory ocrDbCreate can allocate other than the system RAM?" The crashes definitely seem to be linked to the total amount of memory I'm allocating, as reducing the number of arguments passed to my EDTs let me run bigger tests.
If there is a limit, is there any way to change it? I tried changing the size of the DRAM region in the machine XML file, but that doesn't seem to have any effect.
I talked to management, and everyone agreed that the best way to let you guys look at the code I'm writing was to check it into the X-Stack repository. If you check out the repository, you'll see an "ocr" directory at the top level, with examples/parallel_mergesort underneath, which is where my code is. It's not tremendously self-documenting, but the top level of that directory is the OCR version, with sub-directories containing other versions. Romain should be able to set up repository access for any of you who need it.
Finally, I ran my program under Valgrind to see if it could identify any memory leaks for me. It didn't find anything that looked big enough to be causing this problem, but it did find a number of leaks coming from within hc_event_register_if_not_ready, assuming I'm reading the trace correctly. Here's an example leak message:
=22435== 470,272 bytes in 29,392 blocks are definitely lost in loss record 64 of 89
==22435== at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22435== by 0x402FB5E: hc_event_register_if_not_ready (ocr-edf-hc.c:132)
==22435== by 0x402FAFC: hc_task_iterate_waiting_frontier (ocr-edf-hc.c:217)
==22435== by 0x402FBB6: hc_task_schedule (ocr-edf-hc.c:225)
==22435== by 0x402B2CB: ocrEdtSchedule (ocr-edt.c:65)
==22435== by 0x401721: splitter(unsigned int, unsigned long*, void**, unsigned int, ocrEdtDep_t*) (parallel_mergesort.cpp:251)
==22435== by 0x402FA43: hc_task_execute (ocr-edf-hc.c:258)
==22435== by 0x402DBD5: worker_computation_routine (ocr-low-workers-hc.c:181)
==22435== by 0x54F8E99: start_thread (pthread_create.c:308)
==22435== by 0x5225CBC: clone (clone.S:112)
==22435==
-Nick
9 years, 3 months
Fwd: Execution time breakdowns
by Vincent Cavé
Hi,
I'm forwarding Nicholas' email since I don't remember the admin password to approve it myself and my previous fwd with the result has been blocked too… Can someone approve it ?
One concern we have is that although the timing gets bad, most of the time is spent trying to steal which would indicate there's not enough work. Could it be a granularity problem ?
Another thing is that every run seems to break around 16 cores/workers, we're wondering if there could be something hard-coded in ocrInit or somewhere else.
Nicholas, can you send us the code you're using for both pthread and ocr so that we can have a look ?
Best,
Vincent
> Benoit’s question about execution time breakdowns got me thinking about how to script Vtune to generate the sort of data he was looking for, and it turned out to not be too hard. (Meaning that it took a while to figure out but is pretty easy once you know how.)
>
> I wrote some scripts to sweep over the different array and chunk sizes, generating execution times by function, and other scripts to process the data and plot the fraction of execution time spent in each of the 10 functions that were the biggest contributors to execution time across the sweep. Hopefully, they’ll provide some data about where to look when the time comes for performance tuning. Also, these scripts should be pretty easily portable to other programs.
>
> -Nick
9 years, 3 months