(Not sure if this list or OCR-discuss would be better, but I suspect there’s very little difference in readership at the moment)
I’ve been starting to look into OCR as a framework for future research, and wrote a parallel mergesort in order to learn how to code with EDTs. I also did some performance analysis, which is in the attached slides. Given how new OCR is, the OCR version of mergesort was impressively close to a pthreads version as long as I didn’t try to use tasks that were too small, and managed to beat the pthreads version in some cases. It’s a promising start, particularly given that mergesort is a good program for a pthreads-style implementation (very regular, only log2(num processes) barriers).
One thing that turned up in the analysis is that there’s fairly high malloc/free overhead in OCR programs. Based on the examples, my code was calling malloc twice per EDT I spawned, and it looks like OCR is calling malloc/free a fair amount internally. Not surprisingly, the overhead really starts to show up when you try to use very fine-grained tasks.
As an experiment, I tried using Google’s tcmalloc instead of the base malloc/free that came with my Linux distributions, and there were some significant improvements. For extremely fine-grained tasks, performance improved by greater than 20x. For more reasonable task sizes, the impact of task size on performance got much smaller with tcmalloc. Tcmalloc didn’t help much for the larger task sizes, which isn’t surprising.
At any rate, I thought I’d share the results with everyone and see if it starts some discussion.