| |
Fast Copy and Clear loop for Mode 13H 320x200 256 color.
by Charlie Wallace..
I`ve started to add timing information for various
memory copy and clears. All routines tested using two 64kb buffers on a
Pentium 90, I hope to have at least two different sets for each using main
ram to main ram and main ram to video ram. I need to do a bit of checking
on the source below, i think there may be a couple of typos.. You`ll also
notice i don`t use modify and parm in the #pragma aux snippets, feel free
to do this if you want. Its not necessary for the way i use them.
Type of Copy and Clear, Main2Main
, Mem2Vid, Other Info
Watcom C++ 10.6 using a memcpy and memset, /otexan /5r |
Click here |
Click here |
Profiler Information |
My original copy and clear routine (cpu) |
|
Click here |
|
FPU copy and CPU clear, version 1 |
|
|
|
FPU copy and CPU clear, version 2 |
|
|
|
FPU copy and CPU clear, version 3 |
|
|
|
** these figures aren`t concrete yet, i think the Main2Main watcom
routine may only be the memcpy, interestingly Watcom doesn`t seem to have
an inline version of memset..
I`ve pulled this out of a .h file and translated it to html via netscape
gold, its not pretty everything should be ok, check it yourself if you`re
not sure.
There are in watcom pragma aux formats easily translatable to other
implementations for MsVC its
_asm {
mov ecx,dword ptr abuf
..etc
}
I believe Borland C is similar, obviously for .asm just remove
the overhead, watch out for register overwriting, i prefer to control my
own registers and not use modify exact [] in the aux pragma, this way i
can chain together #pragmas without extra overhead ..
use
fpu_setupcopy();
fpu_copyscr();
or
setupcopy();
copyscr();
or for a rep movsd
copyscr();
I`ve tested these out, the FPU version came out fastest on a pentium
pro 90 compaq lte5100 laptop with cirrus svga chip in vanilla mode 0x13
(320x200x256). Your mileage may vary, then next is the 4 at a time, and
then the rep movsd ( slow ) .
The FPU version was based on intels version , i don't believe its optimum
yet, but it is faster than my last one, I need to spend some more time
with it , i only just wrote it. abuf is a linear buffer somewhere, i use
a 64000 byte buffer. its all fixed for that currently perhaps i`ll make
it cleaner.
Later on i`ll post the bench results from a sample session for each
of these .. ..
Remember to align abuf to a decent block, i use a paragraph usually.
Comments/Bug fixes/Improvements are always welcome at cwallace@dreamworks.com
I`lll apologise now for the crap HTML ..
void setupcopy(void);
void fpu_copyscr(void);
void fpu_setupcopy(void);
unsigned long rdtsc(void);
/* just a simple rdtsc call use unsigned long counter = rdtsc(); */
#pragma aux rdtsc = \
".586P" \
"rdtsc";
/* this is the fpu 4 quadword move and 4 dword clear */
/* unrolling this once or twice will increase the speed */
/* i should think of a better way of clearing, needs to be benched properly
*/
#pragma aux fpu_setupcopy = \
".586P" \
"mov ecx,dword ptr abuf" \
"mov eax,0xa0000" \
"add ecx,32" \
"xor edx,edx" ;
#pragma aux fpu_copyscr = \
".586P" \
"fpu_copy:" \
"add eax,32" \
"fld qword ptr [ecx - 32]" \
"mov [ecx-28],edx" \
"mov [ecx-32],edx" \
"fld qword ptr [ecx - 24]" \
"mov [ecx-20],edx" \
"mov [ecx-24],edx" \
"fld qword ptr [ecx - 16]" \
"mov [ecx-12],edx" \
"mov [ecx-16],edx" \
"fld qword ptr [ecx - 8]" \
"mov [ecx-4],edx" \
"mov [ecx-8],edx" \
"fstp qword ptr [eax - 8]" \
"fstp qword ptr [eax - 16]" \
"fstp qword ptr [eax - 24]" \
"fstp qword ptr [eax - 32]" \
"add ecx,32" \
"cmp eax,0xaff00" \ /* change this to your buffer end, it
probably isn`t 0xaff00 for you, its source_base+length*/
"jb fpu_copy" ;
/* this is the 4 dword copy and clear, unrolled */
#pragma aux setupcopy = \
".586P"\
"mov esi,abuf" \
"mov edi,0xa0000" \
"mov ecx,2000" \
"xor ebp,ebp";
#pragma aux copyscr = \
".586P" \
"lop1: mov eax,[esi]"\
"mov ebx,[esi+4]"\
"mov edx,[esi+8]"\
"mov [edi],eax"\
"mov eax,[esi+12]"\
"mov dword ptr [esi],ebp"\
"mov dword ptr [esi+4],ebp"\
"mov dword ptr [esi+8],ebp"\
"mov dword ptr [esi+12],ebp"\
"mov [edi+4],ebx"\
"mov [edi+8],edx"\
"mov [edi+12],eax"\
"mov eax,[esi+16]"\
"mov ebx,[esi+4+16]"\
"mov edx,[esi+8+16]"\
"mov [edi+16],eax"\
"mov eax,[esi+12+16]"\
"mov dword ptr [esi+16],ebp"\
"mov dword ptr [esi+4+16],ebp"\
"mov dword ptr [esi+8+16],ebp"\
"mov dword ptr [esi+12+16],ebp"\
"mov [edi+4+16],ebx"\
"mov [edi+8+16],edx"\
"mov [edi+12+16],eax"\
"add esi,32"\
"add edi,32"\
"dec ecx"\
"jnz lop1"\
modify[edi esi eax ebx ecx] ;
/* this is a simple rep movsd style copy and clear*/
#pragma aux copyscr = \
".586P" \
"mov esi,abuf" \
"mov edi,0xa0000" \
"mov ecx,16000" \
"rep movsd" \
"xor eax,eax"\
"mov edi,abuf"\
"mov ecx,16000"\
"rep stosd" \
modify[edi esi ecx] ;
James Shaw sent me this one, i haven`t profiled it yet. Thanks Jim.
It needs more work to get it to a #pragma, the fild,fist notation is
incorrect it should be fild qword [esi] etc..
copy_screen:
push esi edi ecx
mov esi,[source_screen] ;work screen address
mov edi,0a0000h ;visible screen address
mov ecx,64000
@@1:
mov eax,4
@@2:
fild [QWORD esi]
fild [QWORD esi+020h]
fild [QWORD esi+040h]
fild [QWORD esi+060h]
fild [QWORD esi+080h]
fild [QWORD esi+0a0h]
fild [QWORD esi+0c0h]
fild [QWORD esi+0e0h]
fxch
fistp [QWORD edi+0c0h]
fistp [QWORD edi+0e0h]
fistp [QWORD edi+0a0h]
fistp [QWORD edi+080h]
fistp [QWORD edi+060h]
fistp [QWORD edi+040h]
fistp [QWORD edi+020h]
fistp [QWORD edi]
add esi,8
add edi,8
dec eax jne
@@2
add esi,224
add edi,224
sub ecx,256
jne @@1
pop ecx edi esi
ret
History:
1.0 Created HTML version, includes change suggested by Fabio Bizzetti
( hope i spelt it correctly )
2.0 Added copy sent by James Shaw, jim@curved-logic.com
3.0 Started adding some of the timing info, added pentprof info..
Charlie Wallace - November 14 1996
|