PC Review
Forums
Newsgroups
Hardware
Processors
AMD SSE2 scalar performance
Forums
Newsgroups
Hardware
Processors
AMD SSE2 scalar performance
![]() |
AMD SSE2 scalar performance |
|
|
Thread Tools | Rate Thread |
|
|
#1 |
|
Guest
Posts: n/a
|
Hi all
I came across a rather interesting set of numbers a little while back and was wondering if anyone had any thoughts on the matter. To start with, here are the benchmarks (part of a Prescott review): http://techreport.com/reviews/2004q...t/index.x?pg=14 In particular have a look at the BLAS DGEMM numbers for double-precision floating point. The P4 scores are mostly what you might expect: compiled C x87 code is fairly slow, assembly code for x87 is about a fair bit faster and SSE2 vectorized code is a lot faster again. Starting to look at the Athlon64's performance things are mostly fairly normal as well. SSE2 vector performance is lower than the P4's, but given that the Athlon64 and the P4 have the exact same maximum theoretical performance per clock and the P4 runs at much higher clock speeds this is to be expected. The Athlon64 shows rather impressive compiled C code (significantly faster than the P4 here), but again this isn't too surprising, especially given that the tests are compiled with Microsoft's Visual.Net compiler rather than Intel's C compiler. Where things really get a bit odd though is with the SSE2 scalar code. The simple expectation would be that SSE2 scalar code should perform at roughly half the speed of SSE2 vector code, give or take a bit for memory subsystem issues. However, the numbers are REAL different here. On the P4 system SSE2 scalar code performs at only about 1/3 of the SSE2 vector code and it's even slower than x87 C code. On the Athlon64 it's a TOTALLY different story. Here SSE2 scalar code performs exactly on-par with the SSE2 vector code. x87 assembly code also offers essentially identical performance. This doesn't seem to be a case of the thing being bandwidth limited as the Athlon64 3400+ (64-bit memory interface) is within 3% of the performance of the equal-clock speed Athlon64 FX 51 (128-bit memory interface). It might be some sort of cache bandwidth limited, though my understanding of the test is that the working set is 18MB in size, so that should blow all caches out of the picture (someone feel free to correct me if I'm wrong on this one). So, basically the question for you all to ponder is simply what is the difference between AMD and Intel's SSE2 scalar implementation? This could have some VERY interesting implications for one important reason: In AMD64 Long Mode (64-bit code), AMD has specifically stated that x86, 3DNow! and MMX are deprecated in favor of SSE2 code. According to AMD's vision of things, ALL FPU code should always be handled the SSE2 unit. Given that the bulk of x86 floating point code in existence today is scalar code (for better or for worse) that could mean that SSE2 scalar performance could have a MAJOR impact on a lot of applications. ------------- Tony Hill hilla <underscore> 20 <at> yahoo <dot> ca |
|
|
|
#2 |
|
Guest
Posts: n/a
|
"Tony Hill" <hilla_nospam_20@yahoo.ca> wrote in message
news:3e19cc8fef1a06fb32a561fd28ef6f34@news.1usenet.com... > So, basically the question for you all to ponder is simply what is the > difference between AMD and Intel's SSE2 scalar implementation? This > could have some VERY interesting implications for one important > reason: In AMD64 Long Mode (64-bit code), AMD has specifically stated > that x86, 3DNow! and MMX are deprecated in favor of SSE2 code. > According to AMD's vision of things, ALL FPU code should always be > handled the SSE2 unit. Given that the bulk of x86 floating point code > in existence today is scalar code (for better or for worse) that could > mean that SSE2 scalar performance could have a MAJOR impact on a lot > of applications. It's likely that in AMD's floating point implementation, all floating point calcs (regardless whether it is x87, 3DNow, or SSE) go through the same pipeline. Whereas in the P4, with its funky micro-ops conversion mechanism each type of instruction goes through a separate pipeline for at least a part of its journey. Yousuf Khan |
|
|
|
#3 |
|
Guest
Posts: n/a
|
On Mon, 23 Feb 2004 21:42:16 GMT, "Yousuf Khan"
<news.20.bbbl67@spamgourmet.com> wrote: >"Tony Hill" <hilla_nospam_20@yahoo.ca> wrote in message >news:3e19cc8fef1a06fb32a561fd28ef6f34@news.1usenet.com... >> So, basically the question for you all to ponder is simply what is the >> difference between AMD and Intel's SSE2 scalar implementation? This >> could have some VERY interesting implications for one important >> reason: In AMD64 Long Mode (64-bit code), AMD has specifically stated >> that x86, 3DNow! and MMX are deprecated in favor of SSE2 code. >> According to AMD's vision of things, ALL FPU code should always be >> handled the SSE2 unit. Given that the bulk of x86 floating point code >> in existence today is scalar code (for better or for worse) that could >> mean that SSE2 scalar performance could have a MAJOR impact on a lot >> of applications. > >It's likely that in AMD's floating point implementation, all floating point >calcs (regardless whether it is x87, 3DNow, or SSE) go through the same >pipeline. Whereas in the P4, with its funky micro-ops conversion mechanism >each type of instruction goes through a separate pipeline for at least a >part of its journey. I did a bit more research and found the following on Ace's Hardware message forum, posted by Gipsel: <quoting> The Athlon uses the FPU units for SSE2, the reason the performance is the same with vector and scalar instructions (and theoretical also x87), you can always do 2 FLOPs per cycle. You have to realize the Athlon64 can't do vector SSE2 directly. A vector SSE2 instruction is broken down to 2 scalar MacroOps. The P4 is different, the core has only one issue port for FP calculations (Athlon has two). This port has to be shared between all x87 and SSE/2 instructions. That means you have a different limit for scalar SSE2/x87 and vector SSE2 instructions of 1 and 2 FLOPs per cycle. <end quote> So basically yes, what you were saying is more or less right-on. This does explain some things, in particular how AMD's SSE2 scalar and SSE2 vector performance should offer pretty similar performance in a BLAS type application. It still doesn't quite explain the P4's relatively poor scalar SSE2 performance. From the above it's scalar SSE2 BLAS algorithm should perform at roughly half the speed of it's SSE2 vector algorithm. However in this test it ended up that the vector code was at least 2.7 times faster for the Northwood and up to 3.3 times faster with the Prescott. Anyway, I guess the big question will end up being whether or not x86-64 long mode ends up ONLY using SSE2 for floating point or not. ------------- Tony Hill hilla <underscore> 20 <at> yahoo <dot> ca |
|
|
|
#4 |
|
Guest
Posts: n/a
|
"Tony Hill" <hilla_nospam_20@yahoo.ca> wrote in message
news:08a62443942e6ddf09f8ee9d48b77126@news.1usenet.com... > So basically yes, what you were saying is more or less right-on. This > does explain some things, in particular how AMD's SSE2 scalar and SSE2 > vector performance should offer pretty similar performance in a BLAS > type application. It still doesn't quite explain the P4's relatively > poor scalar SSE2 performance. From the above it's scalar SSE2 BLAS > algorithm should perform at roughly half the speed of it's SSE2 vector > algorithm. However in this test it ended up that the vector code was > at least 2.7 times faster for the Northwood and up to 3.3 times faster > with the Prescott. Sounds like it has something to do with the pipeline stages in the P4. Namely, there's likely a stage where all SSE vector operations are pipelined together into a single continuous sequence. Not just one vector, but as many vectors as are coming in a row. A Northwood could only have a small number of these vector operations in flight at once, whereas Prescott with its 50% bigger pipeline stages may be able to string together a larger number of the operations into a row. > Anyway, I guess the big question will end up being whether or not > x86-64 long mode ends up ONLY using SSE2 for floating point or not. Sounds like Microsoft's OS is turning off the x87 unit, allowing only SSE through. The Linux boys seem to not have enforced this, and so they allow everything to go through. Not sure why Microsoft is doing this, as the instructions to save and restore x87 through to SSE3 registers is the same FXSAVE and FXRESTOR commands. It doesn't take any extra instructions to save x87 status than it does to save SSE status. Anyways, I'll be out of town for the next month. I'll try and see what discussions are playing here off and on again, but if I don't then I'll see you guys after next month. Yousuf Khan |
|
|
|
#5 |
|
Guest
Posts: n/a
|
On Tue, 24 Feb 2004 07:05:44 GMT, "Yousuf Khan"
<news.20.bbbl67@spamgourmet.com> wrote: >"Tony Hill" <hilla_nospam_20@yahoo.ca> wrote in message >news:08a62443942e6ddf09f8ee9d48b77126@news.1usenet.com... >> So basically yes, what you were saying is more or less right-on. This >> does explain some things, in particular how AMD's SSE2 scalar and SSE2 >> vector performance should offer pretty similar performance in a BLAS >> type application. It still doesn't quite explain the P4's relatively >> poor scalar SSE2 performance. From the above it's scalar SSE2 BLAS >> algorithm should perform at roughly half the speed of it's SSE2 vector >> algorithm. However in this test it ended up that the vector code was >> at least 2.7 times faster for the Northwood and up to 3.3 times faster >> with the Prescott. > >Sounds like it has something to do with the pipeline stages in the P4. >Namely, there's likely a stage where all SSE vector operations are pipelined >together into a single continuous sequence. Not just one vector, but as many >vectors as are coming in a row. A Northwood could only have a small number >of these vector operations in flight at once, whereas Prescott with its 50% >bigger pipeline stages may be able to string together a larger number of the >operations into a row. That could help explain why the Prescott did better than the Northwood in SSE2 vector operations, but it doesn't suggest at all why it does WORSE on SSE2 scalar operations. I did a quick search through the Intel optimization guide but didn't come up with much. The only thing that struck me as possible is that in SSE2 vector code they might be doing everything as: add -> multiply -> add -> multiply, etc. while in SSE2 scalar they are doing: add -> add -> multiply -> multiply. This seems like a bit of a trivial optimization, and the author of the benchmark (Tim Wilkens) seems like a fairly smart cookie, so I would guess that he would have thought of this. Everything else on Intel's optimization guides seems to point to SSE2 scalar being able to run at half the speed of SSE2 vector for this sort of application, and at the very least it should be faster than x87 compiled code. Several times in their guide they suggest using SSE2 scalar code instead of x87 code as the former should be faster unless you need specific x87 functionality (which a basic BLAS implementation wouldn't AFAIK). >> Anyway, I guess the big question will end up being whether or not >> x86-64 long mode ends up ONLY using SSE2 for floating point or not. > >Sounds like Microsoft's OS is turning off the x87 unit, allowing only SSE >through. The Linux boys seem to not have enforced this, and so they allow >everything to go through. Not sure why Microsoft is doing this, as the >instructions to save and restore x87 through to SSE3 registers is the same >FXSAVE and FXRESTOR commands. It doesn't take any extra instructions to save >x87 status than it does to save SSE status. I don't know quite what's going on here. I also wonder if Intel's new x86-64 implementation will throw a monkey wrench into any of this. As best as I can tell Intel is NOT specifying that SSE2 is the one true floating point unit in 64-bit mode. In fact, they don't seem to make ANY mention of this possibility at all. >Anyways, I'll be out of town for the next month. I'll try and see what >discussions are playing here off and on again, but if I don't then I'll see >you guys after next month. Have a good trip! ------------- Tony Hill hilla <underscore> 20 <at> yahoo <dot> ca |
|
|
|
#6 |
|
Guest
Posts: n/a
|
Tony Hill <hilla_nospam_20@yahoo.ca> wrote in message news:<15b9449c7391107836c70dc91cd92ad7@news.1usenet.com>...
> That could help explain why the Prescott did better than the Northwood > in SSE2 vector operations, but it doesn't suggest at all why it does > WORSE on SSE2 scalar operations. > > I did a quick search through the Intel optimization guide but didn't > come up with much. The only thing that struck me as possible is that > in SSE2 vector code they might be doing everything as: add -> multiply > -> add -> multiply, etc. while in SSE2 scalar they are doing: add -> > add -> multiply -> multiply. This seems like a bit of a trivial > optimization, and the author of the benchmark (Tim Wilkens) seems like > a fairly smart cookie, so I would guess that he would have thought of > this. If vector operations are just multiple back-to-back scalar operations (or rather if you prefer, if scalar operations are just one-dimensional vector operations) then a lot of micro-op instructions can remain in flight when doing vector rather than scalar. So it does you good to have a lot of instructions in flight on a P4 of any kind, but even more so with Prescott with its bigger pipeline. > >Sounds like Microsoft's OS is turning off the x87 unit, allowing only SSE > >through. The Linux boys seem to not have enforced this, and so they allow > >everything to go through. Not sure why Microsoft is doing this, as the > >instructions to save and restore x87 through to SSE3 registers is the same > >FXSAVE and FXRESTOR commands. It doesn't take any extra instructions to save > >x87 status than it does to save SSE status. > > I don't know quite what's going on here. I also wonder if Intel's new > x86-64 implementation will throw a monkey wrench into any of this. As > best as I can tell Intel is NOT specifying that SSE2 is the one true > floating point unit in 64-bit mode. In fact, they don't seem to make > ANY mention of this possibility at all. Intel isn't saying it, but Microsoft is. AMD's x87 was great, but Intel's wasn't. AMD's SSE was great, and so was Intel's. So the lowest common denominator rules here. What language runs great on both machines? > >Anyways, I'll be out of town for the next month. I'll try and see what > >discussions are playing here off and on again, but if I don't then I'll see > >you guys after next month. > > Have a good trip! Thanks, already am. It's warmer than Canada here. Greetings from Bangladesh. Yousuf Khan |
|
|
|
#7 |
|
Guest
Posts: n/a
|
Black Jack wrote:
>>Have a good trip! > > > Thanks, already am. It's warmer than Canada here. Probably not snowing either. Again. > Greetings from Bangladesh. Now why did I think you had gone to Pakistan ? > > Yousuf Khan |
|
|
|
#8 |
|
Guest
Posts: n/a
|
On 3 Mar 2004 22:20:41 -0800, news.yaya.bbbl67@spamgourmet.com (Black
Jack) wrote: >Tony Hill <hilla_nospam_20@yahoo.ca> wrote in message news:<15b9449c7391107836c70dc91cd92ad7@news.1usenet.com>... >> That could help explain why the Prescott did better than the Northwood >> in SSE2 vector operations, but it doesn't suggest at all why it does >> WORSE on SSE2 scalar operations. >> >> I did a quick search through the Intel optimization guide but didn't >> come up with much. The only thing that struck me as possible is that >> in SSE2 vector code they might be doing everything as: add -> multiply >> -> add -> multiply, etc. while in SSE2 scalar they are doing: add -> >> add -> multiply -> multiply. This seems like a bit of a trivial >> optimization, and the author of the benchmark (Tim Wilkens) seems like >> a fairly smart cookie, so I would guess that he would have thought of >> this. > >If vector operations are just multiple back-to-back scalar operations >(or rather if you prefer, if scalar operations are just >one-dimensional vector operations) then a lot of micro-op instructions >can remain in flight when doing vector rather than scalar. So it does >you good to have a lot of instructions in flight on a P4 of any kind, >but even more so with Prescott with its bigger pipeline. Regardless of whether you're using vectors or scalars BLAS should always have a fairly constant stream of FP Adds and FP Mults, at least according to my understanding of the benchmark. The test is just a large 2D matrix that is being solved, so things like branches should be more or less non-existent. I suppose it could be an issue with getting the data from memory and this is causing more stalls in scalar mode than vector mode. I dunno. >> I don't know quite what's going on here. I also wonder if Intel's new >> x86-64 implementation will throw a monkey wrench into any of this. As >> best as I can tell Intel is NOT specifying that SSE2 is the one true >> floating point unit in 64-bit mode. In fact, they don't seem to make >> ANY mention of this possibility at all. > >Intel isn't saying it, but Microsoft is. Microsoft's line is directly along with what AMD is saying as well though, Intel is the odd-man out here. Of course, as is usually the case, I would imagine that MS will end up with the last word here. > AMD's x87 was great, but >Intel's wasn't. AMD's SSE was great, and so was Intel's. So the lowest >common denominator rules here. What language runs great on both >machines? Ahh, but that's the question I was getting at here. It seems that Intel's SSE2 implementation is NOT great, at least not when dealing with scalar operations. >> >Anyways, I'll be out of town for the next month. I'll try and see what >> >discussions are playing here off and on again, but if I don't then I'll see >> >you guys after next month. >> >> Have a good trip! > >Thanks, already am. It's warmer than Canada here. Greetings from >Bangladesh. Sounds nice! It has been reasonable warm here in Ottawa though... mind you, it's also been cloudy and rainy. Reminds me of last winter when I was in Ireland :> ------------- Tony Hill hilla <underscore> 20 <at> yahoo <dot> ca |
|
|
|
#9 |
|
Guest
Posts: n/a
|
Tony Hill <hilla_nospam_20@yahoo.ca> wrote in message news:<a7me40tf6k5at822mtn6au1u6m62amdi2u@4ax.com>...
> Regardless of whether you're using vectors or scalars BLAS should > always have a fairly constant stream of FP Adds and FP Mults, at least > according to my understanding of the benchmark. The test is just a > large 2D matrix that is being solved, so things like branches should > be more or less non-existent. > > I suppose it could be an issue with getting the data from memory and > this is causing more stalls in scalar mode than vector mode. I dunno. It's likely that the Intel implementation is always operating on the full 128-bit register width whether it is using scalars or vectors, so if you use scalars half of the register is wasted. > >> I don't know quite what's going on here. I also wonder if Intel's new > >> x86-64 implementation will throw a monkey wrench into any of this. As > >> best as I can tell Intel is NOT specifying that SSE2 is the one true > >> floating point unit in 64-bit mode. In fact, they don't seem to make > >> ANY mention of this possibility at all. > > > >Intel isn't saying it, but Microsoft is. > > Microsoft's line is directly along with what AMD is saying as well > though, Intel is the odd-man out here. Of course, as is usually the > case, I would imagine that MS will end up with the last word here. One would think that Microsoft is trying to aid Intel here by preferring SSE over x87. I certainly can't see AMD objecting one way or another whether Microsoft decided to prefer x87 over SSE -- it's got an answer for either front. > > AMD's x87 was great, but > >Intel's wasn't. AMD's SSE was great, and so was Intel's. So the lowest > >common denominator rules here. What language runs great on both > >machines? > > Ahh, but that's the question I was getting at here. It seems that > Intel's SSE2 implementation is NOT great, at least not when dealing > with scalar operations. Well, okay Intel's SSE implementation isn't consistently good, but it's still better than its x87 implementation. Yousuf Khan |
|
|
|
#10 |
|
Guest
Posts: n/a
|
On 7 Mar 2004 06:05:07 -0800, news.yaya.bbbl67@spamgourmet.com (Black
Jack) wrote: >Tony Hill <hilla_nospam_20@yahoo.ca> wrote in message news:<a7me40tf6k5at822mtn6au1u6m62amdi2u@4ax.com>... >> Regardless of whether you're using vectors or scalars BLAS should >> always have a fairly constant stream of FP Adds and FP Mults, at least >> according to my understanding of the benchmark. The test is just a >> large 2D matrix that is being solved, so things like branches should >> be more or less non-existent. >> >> I suppose it could be an issue with getting the data from memory and >> this is causing more stalls in scalar mode than vector mode. I dunno. > >It's likely that the Intel implementation is always operating on the >full 128-bit register width whether it is using scalars or vectors, so >if you use scalars half of the register is wasted. Yup, that is the case for AMD's implementation as well. SSE/SSE2 will always tend to operate better with vector operations than with scalar stuff as long as your code is decently written for both. >> Microsoft's line is directly along with what AMD is saying as well >> though, Intel is the odd-man out here. Of course, as is usually the >> case, I would imagine that MS will end up with the last word here. > >One would think that Microsoft is trying to aid Intel here by >preferring SSE over x87. I certainly can't see AMD objecting one way >or another whether Microsoft decided to prefer x87 over SSE -- it's >got an answer for either front. Well if this one BLAS test is of any indication, AMD's SSE2 scalar performance is head and shoulders ahead of Intel's. >> Ahh, but that's the question I was getting at here. It seems that >> Intel's SSE2 implementation is NOT great, at least not when dealing >> with scalar operations. > >Well, okay Intel's SSE implementation isn't consistently good, but >it's still better than its x87 implementation. In this particular test it wasn't as good as x87, that's the problem. The SSE2 vector code was great and super-fast, but it's SSE2 scalar code was quite slow, slower than even compiled x87 code and definitely slower than x87 optimized assembly code. I can see no good reason for this to be the case, and in fact Intel does say in all of their optimization guides that SSE2 scalar code SHOULD be faster than x87 code. In any case, it may simply be that this test is a bit of an anomaly. ------------- Tony Hill hilla <underscore> 20 <at> yahoo <dot> ca |
|
![]() |
|
| Thread Tools | |
| Rate This Thread | |
|
|

Main Page 

