Hi Nick,
You can add flags to the variable "OCR_CFLAGS" defined in src/Makefile.am
This should be accessible to all ocr submodules.
If you've already built ocr, you can do the following:
cd compileTree
make clean
make OCR_CFLAGS="your flags"
This just allows you to override flags from the command line.
Let me know if it does the trick.
Best,
Vincent
On Jan 22, 2013, at 5:34 PM, "Carter, Nicholas P"
<nicholas.p.carter(a)intel.com> wrote:
I didn’t use the data blocks in this version. I allocated one array
for the data to be sorted and a second array of temporary space for use in merging, and
then passed the indices of the regions each splitter/merger was responsible for.
My first pass at an OCR version was similar to what you’re suggesting. I replaced each
pthread with an EDT, and used pthreads barriers for synchronization. I didn’t do much
analysis of this version, because it wasn’t really in the OCR philosophy. You’re right
that I could build the merger tree explicitly rather than using the recursive splitters,
though this also seems somewhat not in the OCR philosophy. The splitters were such a
small component of the overall execution time that I’m not convinced it would make much
difference, though.
Generating time breakdowns for all of the different runs I did would be somewhat of a
pain, since I use Vtune for that and currently have to go through a GUI for each run. I
might take a look at ways to script vtune, since having some sort of generic Vtune-based
analyzer for OCR would be useful. At the moment, I can’t run the versions of the code
that use tcmalloc through Vtune, because loading tcmalloc via LD_PRELOAD causes Vtune to
crash. Romain, Sagnak, would it be possible for one of you to give me some help figuring
out how to pass –ltcmalloc to the compiler in the OCR build process?
For the timing numbers, I only included the parallel portion of the program, not any of
the init/tear-down. The sequence was:
1) Start OCR, allocate and initialize data arrays.
2) Start = gettimeofday()
3) Create and schedule first EDT.
4) Last EDT calls end = gettimeofday(), prints end-start as run time.
5) Check result, call ocr_cleanup
-Nick
From: Benoit Meister [mailto:meister@reservoir.com]
Sent: Monday, January 21, 2013 9:46 AM
To: Carter, Nicholas P
Subject: Re: [OCR-dev] Initial Eperiences with OCR
Hi Nick,
thanks for sharing! I also have mitigated performance results on my end (programs
generated by R-Stream), but I haven't done this analysis yet.
I have a few questions & comments (I didn't know if the rest of the list would
want to see this):
- when splitting the input array, did you make a data block of each element, or did you
just use pointers to the input array ?
- If you used the input array directly, it looks like the same code structure as the
pthread programs could be implemented in OCR, meaning that you may not need explicit
splitters (only pointers to the array and size parameters), your events being the arrows
that you represented on slide 4.
- typo on slide 16: Excuction
- It would probably be more tedious, but if you broke down the run times on the graphs of
slides 8-9-10 and 15-16-17, we'd see the cause of the not-so-good scaling more
straightforwardly.
- Did you count the init and finish times of the OCR runtime in your execution times ?
Thanks,
- Benoit
On Fri, Jan 18, 2013 at 8:34 PM, Carter, Nicholas P <nicholas.p.carter(a)intel.com>
wrote:
(Not sure if this list or OCR-discuss would be better, but I suspect there’s very little
difference in readership at the moment)
Hello all,
I’ve been starting to look into OCR as a framework for future research,
and wrote a parallel mergesort in order to learn how to code with EDTs. I also did some
performance analysis, which is in the attached slides. Given how new OCR is, the OCR
version of mergesort was impressively close to a pthreads version as long as I didn’t try
to use tasks that were too small, and managed to beat the pthreads version in some cases.
It’s a promising start, particularly given that mergesort is a good program for a
pthreads-style implementation (very regular, only log2(num processes) barriers).
One thing that turned up in the analysis is that there’s fairly high
malloc/free overhead in OCR programs. Based on the examples, my code was calling malloc
twice per EDT I spawned, and it looks like OCR is calling malloc/free a fair amount
internally. Not surprisingly, the overhead really starts to show up when you try to use
very fine-grained tasks.
As an experiment, I tried using Google’s tcmalloc instead of the base
malloc/free that came with my Linux distributions, and there were some significant
improvements. For extremely fine-grained tasks, performance improved by greater than 20x.
For more reasonable task sizes, the impact of task size on performance got much smaller
with tcmalloc. Tcmalloc didn’t help much for the larger task sizes, which isn’t
surprising.
At any rate, I thought I’d share the results with everyone and see if it
starts some discussion.
-Nick
_______________________________________________
OCR-dev mailing list
OCR-dev(a)lists.01.org
https://lists.01.org/mailman/listinfo/ocr-dev
_______________________________________________
OCR-dev mailing list
OCR-dev(a)lists.01.org
https://lists.01.org/mailman/listinfo/ocr-dev