|
Welcome to the Web Hosting Forum - Hosting Reviews, Web Hosting Discussion Forum forums. You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content etc. By registering you have access to many other special features, Like personal blogs, your own personal forum, extended profiles, posting of your resume, free links, photo galleries, auctions etc. We are Web 2.0 Compliant . We also reward our posters and referals with free hosting, domains, prizes etc. Even earn points for reading posts. We offer contests, and events that are sure to please anyone in the hosting industry, web developer or SEO at heart. Registration is fast, simple and absolutely free so please, join our community today! If you have any problems with the registration process or your account login, please contact contact us. |
|
|
|||||||
| Intel Pentium and Above Chipsets The Intel Chipset (including xeon, and duocore) |
|
|
LinkBack | Thread Tools |
|
|||
|
RDTSC performance on different x86 archs
not sure where to go with this question. I saw you listing in comp.sys.intel I'm having strange behavior with the RDTSC instruction On Jul 10 1994, 12:19 am, g...@ichips.intel.com (Andy Glew) wrote: > > > (3) RDMSR(MSR=10h) versus RDTSC: yes, indeed, MSR=10h is the TimeStamp > Counter (TSC). However, accessing this via RDMSR and WRMSR is *not* > portable. > RDTSC is the *portable*, architectural, way of accessing the > timestamp counter. It's faster, and it has certain other conveniences. > Please avoid using RDMSR(MSR=10). > There is no portable way of writing the TSC. WRMSR(MSR=10h) works > to a degree, but is non-portable. Moreover, arbitrary writeability is > *not* guaranteed - it may not be possible to write any arbitrary bit > pattern to the counter. > OK, I am having some unusual results from rdtsc I have a small C program with some inline assembly (in gnu style) #include <stdio.h> int main(void) { unsigned long long int t0, t1; int result; unsigned int ret0[2]; unsigned int ret1[2]; __asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1])); __asm__ ("xorl %ecx, %ecx \n\t" "L1: \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "rdtsc \n\t" "addl $16, %ecx \n\t" "cmpl $8192, %ecx \n\t" "jne L1"); __asm__ __volatile__("rdtsc" : "=a"(ret1[0]), "=d"(ret1[1])); t0 = *(unsigned long long int*)ret0; t1 = *(unsigned long long int*)ret1; result = (t1-t0)/8192; printf("ticks per rdtsc %d \n",result); return result; } This compiles and runs fine with both Intel and GNU compilers 3.3, 4.0, etc. when I compile this and execute under Cygwin (running on Windows XP) and an AMD 4200+ I get ../a.exe ticks per rdtsc 6 which isn't 1 or 2, but I can live with 6 clock ticks to process a seldom called op. if I compile and run this under Mac OS X (new Apple MacBookPro) Intel Core 2 I get 65 ?!?! if I compile and run this on Suse Linux on a Xeon processor, I get 85 ?!?! (Intel or GNU compiler) I'm not even putting in serializing. does that look right to anyone ? Can anyone verify they get the same results on their x86 machines ? Brian VS |
|
|||
|
Re: RDTSC performance on different x86 archs
Yes, similar results. 6 ticks on athlon64, 11 on athlon, around 80 on
intel p4 and 30 ticks on intel p3. Looks ok. Since rdtsc cannot be used to time very short instruction sequencies anyway, since it isn't serializing, this does not matter much anyway. Daniel On Tue, 14 Aug 2007 04:36:10 +0200, <meta.x.gdb@gmail.com> wrote: > > > not sure where to go with this question. I saw you listing in > comp.sys.intel > > I'm having strange behavior with the RDTSC instruction > > On Jul 10 1994, 12:19 am, g...@ichips.intel.com (Andy Glew) wrote: >> >> >> (3) RDMSR(MSR=10h) versus RDTSC: yes, indeed, MSR=10h is the TimeStamp >> Counter (TSC). However, accessing this via RDMSR and WRMSR is *not* >> portable. >> RDTSC is the *portable*, architectural, way of accessing the >> timestamp counter. It's faster, and it has certain other conveniences.. >> Please avoid using RDMSR(MSR=10). >> There is no portable way of writing the TSC. WRMSR(MSR=10h) works >> to a degree, but is non-portable. Moreover, arbitrary writeability is >> *not* guaranteed - it may not be possible to write any arbitrary bit >> pattern to the counter. >> > > > OK, I am having some unusual results from rdtsc > > I have a small C program with some inline assembly (in gnu style) > > #include <stdio.h> > > int main(void) > { > unsigned long long int t0, t1; > int result; > unsigned int ret0[2]; > unsigned int ret1[2]; > __asm__ __volatile__("rdtsc" : "=a"(ret0[0]), "=d"(ret0[1])); > __asm__ ("xorl %ecx, %ecx \n\t" > "L1: \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "rdtsc \n\t" > "addl $16, %ecx \n\t" > "cmpl $8192, %ecx \n\t" > "jne L1"); > __asm__ __volatile__("rdtsc" : "=a"(ret1[0]), "=d"(ret1[1])); > t0 = *(unsigned long long int*)ret0; > t1 = *(unsigned long long int*)ret1; > result = (t1-t0)/8192; > printf("ticks per rdtsc %d \n",result); > return result; > } > > This compiles and runs fine with both Intel and GNU compilers 3.3, > 4.0, etc. > > when I compile this and execute under Cygwin (running on Windows XP) > and an AMD 4200+ I get > ./a.exe > ticks per rdtsc 6 > > which isn't 1 or 2, but I can live with 6 clock ticks to process a > seldom called op. > > if I compile and run this under Mac OS X (new Apple MacBookPro) Intel > Core 2 I get 65 ?!?! > > if I compile and run this on Suse Linux on a Xeon processor, I get > 85 ?!?! (Intel or GNU compiler) > > I'm not even putting in serializing. does that look right to > anyone ? Can anyone verify they get the same results on their x86 > machines ? > > Brian VS > |
|
|||
|
Re: RDTSC performance on different x86 archs
On Aug 14, 12:13 am, Daniel Spångberg <dani...@mkem.uu.se> wrote:
> Yes, similar results. 6 ticks on athlon64, 11 on athlon, around 80 on > intel p4 and 30 ticks on intel p3. Looks ok. Since rdtsc cannot be used to > time very short instruction sequencies anyway, since it isn't serializing, > this does not matter much anyway. > Daniel > At issue for me is the overall cost to the code of having it instrumented with profilers. There are lots of examples on the web of people using rdtsc to time routines that are on the order of 200 cycles. In cases like this, an instrumented version of the code will be significantly impacted by the act of measuring. It seems the P4 is a particularly pokey implementation of this instruction. It's not a show stopper for me. It would be nice if someone with access to an IA64 chip could make a similar test. the Intel IA64 compiler doesn't support inline assembly, but it does have a built-in intrinsic __rdtsc() that should do the same thing. It would also be nice if future Intel designs were more aware of measuring needs. It is frustrating that I need to be running in privileged mode to read the hardware counters. Folks in national labs don't get patch their kernels or run in privilege = 0 that often. the IBM Power chip manages a similar instruction in 2 cycles. I was quite impressed. Brian Van Straalen |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|
All times are GMT -5. The time now is 08:55 PM.