Ah, that makes sense. Malloc is always high on my list of potential performance culprits,
so it was one of the first things started thinking about when looking at this code.
One thing that seemed a bit odd is that, since paramv is defined as a **, an EDT has to do
two mallocs for each EDT it creates, one for the paramv array, and one for the pointer to
the paramv array. Having not been part of the design discussions, it wasn't clear to
me what advantage you get by passing a pointer to a pointer to paramv, but this might be
someplace where a custom allocator might speed things up. There are certainly things I
could have done to tune that part of my code (re-use the input paramv for each EDT, do one
malloc of all the paramv space each EDT needs and hand-generate the sub-pointers, etc.),
but this seems like something users aren't going to be excited about doing
Figuring out how to implement recursion in EDTs wasn't a big deal. It took a bit of
thinking to figure out how to do it and a bit more playing with the dependencies to get
them passing around right, but now that I've done it once, it wouldn't take long
to do again. Some sort of template or example in the documentation would be good,
The competitive performance was definitely encouraging, particularly since this is a good
problem for Pthreads and since the OCR-code overhead was a significant percentage of
execution time. Once you've had some time to tune it, I'm betting that OCR will
turn in some impressive numbers.
From: ocr-dev-bounces(a)lists.01.org [mailto:email@example.com] On Behalf Of
Sent: Monday, January 21, 2013 11:36 AM
To: Technical discussion about OCR
Subject: Re: [OCR-dev] Initial Eperiences with OCR
This is really helpful. We had to resort to malloc from our in-house memory allocator
(heavily influenced by Cilk allocator) because of licensing issues for the SC release. We
intend to ship OCR with a scalable allocator for the next release.
There will an allocation for each EDT, as the task has to live in the heap. We also plan
to cut the number of allocations down for the next release. We implicitly build a class
hierarchy (remnants of our initial object-oriented design) and incur allocations on the
way. However the base classes are static interfaces and we will restructure that code to
avoid unnecessary allocation/replication.
Regarding recursion, I can cite the usual cop-out and say you have the luxury to build the
reduction tree as you like. Nevertheless, this point was also raised by CnC users and
probably most papers which criticize data-flow models. We may consider delivering some
prepackaged way to build a recursion tree to flatten the learning curve.
I am personally glad the performance is competitive though we did not have the time to
tune for performance. This is rather encouraging.
Thanks for your time,
 For those who recall the Chord-CnC effort with Intel's Mayur Naik
On Jan 21, 2013, at 12:11 PM, "Cledat, Romain E"
Thanks for the feedback. This is very detailed and useful. I think some take-aways are:
- definitely have to focus on some of the memory overhead and see if some things
can be optimized
- there seems to be a scaling issue as the number of cores go up. We should
probably investigate this.
[mailto:firstname.lastname@example.org<mailto:email@example.com>] On Behalf Of
Carter, Nicholas P
Sent: Friday, January 18, 2013 5:35 PM
Subject: [OCR-dev] Initial Eperiences with OCR
(Not sure if this list or OCR-discuss would be better, but I suspect there's very
little difference in readership at the moment)
I've been starting to look into OCR as a framework for future research,
and wrote a parallel mergesort in order to learn how to code with EDTs. I also did some
performance analysis, which is in the attached slides. Given how new OCR is, the OCR
version of mergesort was impressively close to a pthreads version as long as I didn't
try to use tasks that were too small, and managed to beat the pthreads version in some
cases. It's a promising start, particularly given that mergesort is a good program
for a pthreads-style implementation (very regular, only log2(num processes) barriers).
One thing that turned up in the analysis is that there's fairly high
malloc/free overhead in OCR programs. Based on the examples, my code was calling malloc
twice per EDT I spawned, and it looks like OCR is calling malloc/free a fair amount
internally. Not surprisingly, the overhead really starts to show up when you try to use
very fine-grained tasks.
As an experiment, I tried using Google's tcmalloc instead of the base
malloc/free that came with my Linux distributions, and there were some significant
improvements. For extremely fine-grained tasks, performance improved by greater than 20x.
For more reasonable task sizes, the impact of task size on performance got much smaller
with tcmalloc. Tcmalloc didn't help much for the larger task sizes, which isn't
At any rate, I thought I'd share the results with everyone and see if
it starts some discussion.
OCR-dev mailing list