What I don't understand is why OCR scalability worsen past 16 cores whereas pthread
scalability kind of plateau beyond 30 cores.
I'm wondering if it could be a side effect of the stealing protocol that locks the bus
So yes, It would be great to get the code see what we can find out from our end.
I don't think I have access to the X-Stack repo yet, but we're part of the effort
so we should be able to arrange that.
Rob, Romain, can you contact me privately regarding that matter ?
On Jan 31, 2013, at 3:14 PM, "Carter, Nicholas P"
Actually, I think that the scaling issues are due to the algorithm at
this point. After a conversation with Doug yesterday, I started thinking more about the
way the pthread version of the code worked, and realized that there's effectively an
O(N) serial section involved in merging the sub-arrays that each pthread sorts. In a
4-core run, core 0 merges the N/4 -length sorted lists that cores 0 and 1 generate, and
then merges the N/2-length result list with one on core 3, so it does O(N + N/2) serial
work. As you add more cores, the merge tree gets deeper, and the length of the serial
section converges to O(2N). That limits your run time to approximately O(2N) +
O(N(logN-logP))/P, so as P becomes comparable to log2(N), the serial section will
dominate. The sweeps I did have N between 2^^20 and 2^^30, so it makes sense that the
serial section becomes a real limiter at around 16 cores.
The OCR version has basically the same problem once you get high enough in the merge tree
that there are fewer ready tasks than processors. I could easily believe that the growth
in cycles in the steal code is due to cores spinning trying to find work. It might be
interesting to look into ways to eliminate/reduce that spinning because it has the
potential to waste a lot of energy. Some hardware support might be a big help there.
I've been thinking a bit about better parallel merge algorithms, and should have some
time next week to put one together. I'll re-run the analysis once I have that working
and let you guys know how it went.
As far as sending the code, I'm happy to share it but need to make sure I comply with
Intel's rules about releasing code. Do you (Rice team) have access to the X-Stack
repository? If I could put the code there and have it covered by whatever NDA exists for
X-Stack, it'd save me the effort of getting it stamped as publicly releasable. Rob,
Romain, would that work?
From: ocr-dev-bounces(a)lists.01.org [mailto:firstname.lastname@example.org] On Behalf Of
Sent: Thursday, January 31, 2013 12:08 PM
To: Technical discussion about OCR
Subject: [OCR-dev] Fwd: Execution time breakdowns
I'm forwarding Nicholas' email since I don't remember the admin password to
approve it myself and my previous fwd with the result has been blocked too. Can someone
approve it ?
One concern we have is that although the timing gets bad, most of the time is spent
trying to steal which would indicate there's not enough work. Could it be a
granularity problem ?
Another thing is that every run seems to break around 16 cores/workers, we're
wondering if there could be something hard-coded in ocrInit or somewhere else.
Nicholas, can you send us the code you're using for both pthread and ocr so that we
can have a look ?
> Benoit's question about execution time breakdowns got me thinking about how to
script Vtune to generate the sort of data he was looking for, and it turned out to not be
too hard. (Meaning that it took a while to figure out but is pretty easy once you know
> I wrote some scripts to sweep over the different array and chunk sizes, generating
execution times by function, and other scripts to process the data and plot the fraction
of execution time spent in each of the 10 functions that were the biggest contributors to
execution time across the sweep. Hopefully, they'll provide some data about where to
look when the time comes for performance tuning. Also, these scripts should be pretty
easily portable to other programs.
OCR-dev mailing list
OCR-dev mailing list