|
|||||
|
|
#1 |
|
|
I wonder if there is a tool out there that I could use to spot bottlenecks in a C-program except gprof? ***ume that I have a media type of application and I wanna build an Application Specific Processor for that application. So I wanna find out how to speed up these programs by SIMD, VLIW or operation fusion techniques. Can anyone suggest me some useful tools for this task? Cheers, Patrick |
|
|
#2 |
|
|
> Hi > > I wonder if there is a tool out there that I could use to spot > bottlenecks in a C-program except gprof? Sure there are thousands of them: http://www.google.com/search?hl=en&q=C+profiler > ***ume that I have a media type of application and I wanna build an > Application Specific > Processor for that application. So I wanna find out how to speed up > these programs > by SIMD, VLIW or operation fusion techniques. Can anyone suggest me > some useful > tools for this task? Improving the algorithm is a lot more important than cheesy little tricks. After you have profiled to find the slow spots, improve the algorithm or use a better one instead. That is the right way to get a large performance gain. |
|
|
#3 |
|
|
> Improving the algorithm is a lot more important than cheesy little > tricks. After you have profiled to find the slow spots, improve the > algorithm or use a better one instead. That is the right way to get a > large performance gain. Just changing the algorithm isn't the only solution. Changing data structures sometimes helps more, e.g. from an unsorted array to a tree-based set. If you use some datatype abstraction in the form of accessor functions, then if they turn out to be major bottlenecks you either they have slow algorithms (improving them may require changing the data structure), or the data is being accessed too much. (E.g. The best heap implementation in the world won't save you if you are malloc()ing and free()ing in the middle of your tightest loop when you could do so a few loops further out). The other note is to *measure* your optimisations. Run them several times in a loop and time it, if nothing else. If your program 10% goes faster but the code has turned to spaghetti then it is probably not a worthwhile optimisation. Finally, do not insult your compiler and library writers. They (usually) know far more about the task their code does than you do. Example insult for almost any compiler you will use: y = (x << 4) + (x << 2); /* faster than y = x * 10 */ -- Simon Richard Clarkstone: s.r.cl?rkst?n?@durham.ac.uk/s?m?n.cl?rkst?n?@hotmail.com "August 9 - I just made my signature file. Its only 6 pages long. I will have to work on it some more." -- _Diary of an AOL User_ |
|
|
#4 |
|
|
Simon Richard Clarkstone wrote:
> .... snip ... > > Finally, do not insult your compiler and library writers. They > (usually) know far more about the task their code does than you > do. Example insult for almost any compiler you will use: > y = (x << 4) + (x << 2); /* faster than y = x * 10 */ And, if intended to implement y = x * 10; wrong. :-) -- <http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt> <http://www.securityfocus.com/columnists/423> <http://www.aaxnet.com/editor/edit043.html> cbfalconer at maineline dot net -- Posted via a free Usenet account from http://www.teranews.com |
|
|
#5 |
|
|
On Jun 14, 5:54 pm, Patrick <Patrick.Batema...@gmx.at> wrote:
> Hi > > I wonder if there is a tool out there that I could use to spot > bottlenecks in a C-program except gprof? > ***ume that I have a media type of application and I wanna build an > Application Specific > Processor for that application. So I wanna find out how to speed up > these programs > by SIMD, VLIW or operation fusion techniques. Can anyone suggest me > some useful > tools for this task? If you're looking for low-level insight into your codes performance, Intel's V-Tune is hard to beat, although it doesn't much like AMD processors (it works, but only in basic mode), and it's a bit pricy. You might want to look at Intel's Integrated Performance Primitives library, which has very fast implementations of many basic media functions. |
|
|
#6 |
|
|
On Jun 14, 6:12 pm, Simon Richard Clarkstone
<s.r.clarkst...@durham.ac.uk> wrote: > user923005 wrote: > > Improving the algorithm is a lot more important than cheesy little > > tricks. After you have profiled to find the slow spots, improve the > > algorithm or use a better one instead. That is the right way to get a > > large performance gain. > > Just changing the algorithm isn't the only solution. Changing data > structures sometimes helps more, e.g. from an unsorted array to a > tree-based set. If you use some datatype abstraction in the form of > accessor functions, then if they turn out to be major bottlenecks you > either they have slow algorithms (improving them may require changing > the data structure), or the data is being accessed too much. (E.g. The > best heap implementation in the world won't save you if you are > malloc()ing and free()ing in the middle of your tightest loop when you > could do so a few loops further out). I should have been more specific. I meant combination of new data structures and /or new algorithms. E.g. changing from btree to hash table. > The other note is to *measure* your optimisations. Run them several > times in a loop and time it, if nothing else. If your program 10% goes > faster but the code has turned to spaghetti then it is probably not a > worthwhile optimisation. > > Finally, do not insult your compiler and library writers. They > (usually) know far more about the task their code does than you do. > Example insult for almost any compiler you will use: > y = (x << 4) + (x << 2); /* faster than y = x * 10 */ My compiler emits: ; 10 : return (x << 3) + (x << 1); /* faster than y = x * 10 */ mov eax, DWORD PTR _x$[esp-4] lea eax, DWORD PTR [eax+eax*4] add eax, eax .... ; 15 : return x*10; /* y = x * 10 */ mov eax, DWORD PTR _x$[esp-4] lea eax, DWORD PTR [eax+eax*4] add eax, eax |
|
|
#7 |
|
|
Thanks for the comments guys. But perhaps I should mention once
more that I dont wanna optimize the code on a algorithmic level. I wanna speed up some established algorithms with dedicated hardware support, e.g. SIMD or VLIW or combined operations. So the tool should in some way help me where operations could be executed in parallel or that it would make sense to add a custom instruction to the hardware architecture. Cheers, Patrick |
|
|
#8 |
|
|
user923005 wrote:
> On Jun 14, 6:12 pm, Simon Richard Clarkstone > <s.r.clarkst...@durham.ac.uk> wrote: > >>Finally, do not insult your compiler and library writers. They >>(usually) know far more about the task their code does than you do. What I meant by this, in case it was unclear, is that the compiler will usually be far better at micro-optimisations than you are, because the micro-optimiser was written by experts on the processor. Silly tricks will make your code less clear, and may actually make the compiler less able to optimise code, by adding extra ***umptions or being unidiomatic. [snip] > My compiler emits: > > ; 10 : return (x << 3) + (x << 1); /* faster than y = x * 10 */ > > mov eax, DWORD PTR _x$[esp-4] > lea eax, DWORD PTR [eax+eax*4] > add eax, eax > > ; 15 : return x*10; /* y = x * 10 */ > > mov eax, DWORD PTR _x$[esp-4] > lea eax, DWORD PTR [eax+eax*4] > add eax, eax Thank you for this excellent example of compiler intelligence; I will file it away for future reference. AFAICT, they are roughly like this C (***uming that sizeof(long) == 4): register int t = x; t = (int)(t + (long*)t); /* pointer arithmetic does: t = 4*t + t */ t = t + t; return t; The compiler knows best indeed. -- Simon Richard Clarkstone: s.r.cl?rkst?n?@durham.ac.uk/s?m?n.cl?rkst?n?@hotmail.com "August 9 - I just made my signature file. Its only 6 pages long. I will have to work on it some more." -- _Diary of an AOL User_ |
|
|
#9 |
|
|
On Jun 15, 9:12 am, Simon Richard Clarkstone
<s.r.clarkst...@durham.ac.uk> wrote: > user923005 wrote: > > On Jun 14, 6:12 pm, Simon Richard Clarkstone > > <s.r.clarkst...@durham.ac.uk> wrote: > > >>Finally, do not insult your compiler and library writers. They > >>(usually) know far more about the task their code does than you do. > > What I meant by this, in case it was unclear, is that the compiler will > usually be far better at micro-optimisations than you are, because the > micro-optimiser was written by experts on the processor. Silly tricks > will make your code less clear, and may actually make the compiler less > able to optimise code, by adding extra ***umptions or being unidiomatic. Right, that's why I posted the example. > [snip] > > > My compiler emits: > > > ; 10 : return (x << 3) + (x << 1); /* faster than y = x * 10 */ > > > mov eax, DWORD PTR _x$[esp-4] > > lea eax, DWORD PTR [eax+eax*4] > > add eax, eax > > > ; 15 : return x*10; /* y = x * 10 */ > > > mov eax, DWORD PTR _x$[esp-4] > > lea eax, DWORD PTR [eax+eax*4] > > add eax, eax > > Thank you for this excellent example of compiler intelligence; I will > file it away for future reference. AFAICT, they are roughly like this C > (***uming that sizeof(long) == 4): > > register int t = x; > t = (int)(t + (long*)t); /* pointer arithmetic does: t = 4*t + t */ > t = t + t; > return t; > > The compiler knows best indeed. I agree. I went to an ***embly optimization site and they had a long discussion about the best possible way to multiply by 10 across a broad range of processors. The final solution was exactly what was emitted by MS VC++. > -- > Simon Richard Clarkstone: > s.r.cl?rkst...@durham.ac.uk/s?m?n.cl?rkst...@hotmail.com > "August 9 - I just made my signature file. Its only 6 pages long. > I will have to work on it some more." -- _Diary of an AOL User_ |
|
|
#10 |
|
|
On Jun 15, 2:47 am, Patrick <Patrick.Batema...@gmx.at> wrote:
> Thanks for the comments guys. But perhaps I should mention once > more that I dont wanna optimize the code on a algorithmic level. > I wanna speed up some established algorithms with dedicated hardware > support, e.g. SIMD or VLIW or combined operations. So the tool should > in some way help me where operations could be executed in parallel > or that it would make sense to add a custom instruction to the > hardware > architecture. My compiler has a little dropdown that says: "Enable Enhanced Instruction Set": 'Not Set' 'Streaming SIMD Extensions (/arch:SSE)' 'Streaming SIMD Extensions 2 (/arch:SSE2)' The outcome of coosing one of those options besides 'Not Set' is a program that will run 5% faster on chips that support the extension, and that fail completely on those that don't. I think you are barking up the wrong tree if you want to improve your program. Now, that is not to say that there is no benefit from inline ***embly using SIMD or VLIW instructions. Just that these cases are very rare and will be uncovered by profiling. Since you are not interested in improving the algorithm, you have no hope at all for big gains. Any sort of gain you get will be very modest (by a small, linear factor) and come at a price of portabilty. IMO-YMMV. |