Hi Nick,

This is really helpful. We had to resort to malloc from our in-house memory allocator (heavily influenced by Cilk allocator) because of licensing issues for the SC release. We intend to ship OCR with a scalable allocator for the next release.

There will an allocation for each EDT, as the task has to live in the heap. We also plan to cut the number of allocations down for the next release. We implicitly build a class hierarchy (remnants of our initial object-oriented design) and incur allocations on the way. However the base classes are static interfaces and we will restructure that code to avoid unnecessary allocation/replication.

Regarding recursion, I can cite the usual cop-out and say you have the luxury to build the reduction tree as you like. Nevertheless, this point was also raised by CnC users[1] and probably most papers which criticize data-flow models. We may consider delivering some prepackaged way to build a recursion tree to flatten the learning curve.

I am personally glad the performance is competitive though we did not have the time to tune for performance. This is rather encouraging.
Thanks for your time,

[1] For those who recall the Chord-CnC effort with Intel's Mayur Naik

On Jan 21, 2013, at 12:11 PM, "Cledat, Romain E" <romain.e.cledat@intel.com> wrote:

Thanks for the feedback. This is very detailed and useful. I think some take-aways are:
-          definitely have to focus on some of the memory overhead and see if some things can be optimized
-          there seems to be a scaling issue as the number of cores go up. We should probably investigate this.
From: ocr-dev-bounces@lists.01.org [mailto:ocr-dev-bounces@lists.01.org] On Behalf Of Carter, Nicholas P
Sent: Friday, January 18, 2013 5:35 PM
To: ocr-dev@lists.01.org
Subject: [OCR-dev] Initial Eperiences with OCR
(Not sure if this list or OCR-discuss would be better, but I suspect thereís very little difference in readership at the moment)
Hello all,
               Iíve been starting to look into OCR as a framework for future research, and wrote a parallel mergesort in order to learn how to code with EDTs.  I also did some performance analysis, which is in the attached slides.  Given how new OCR is, the OCR version of mergesort was impressively close to a pthreads version as long as I didnít try to use tasks that were too small, and managed to beat the pthreads version in some cases.  Itís a promising start, particularly given that mergesort is a good program for a pthreads-style implementation (very regular, only log2(num processes) barriers).
               One thing that turned up in the analysis is that thereís fairly high malloc/free overhead in OCR programs.  Based on the examples, my code was calling malloc twice per EDT I spawned, and it looks like OCR is calling malloc/free a fair amount internally.  Not surprisingly, the overhead really starts to show up when you try to use very fine-grained tasks.
               As an experiment, I tried using Googleís tcmalloc instead of the base malloc/free that came with my Linux distributions, and there were some significant improvements.  For extremely fine-grained tasks, performance improved by greater than 20x.  For more reasonable task sizes, the impact of task size on performance got much smaller with tcmalloc.  Tcmalloc didnít help much for the larger task sizes, which isnít surprising.
               At any rate, I thought Iíd share the results  with everyone and see if it starts some discussion.
OCR-dev mailing list