Fast Copy and Clear loop for Mode 13H 320x200 256 color.

by Charlie Wallace..

I`ve started to add timing information for various memory copy and clears. All routines tested using two 64kb buffers on a Pentium 90, I hope to have at least two different sets for each using main ram to main ram and main ram to video ram. I need to do a bit of checking on the source below, i think there may be a couple of typos.. You`ll also notice i don`t use modify and parm in the #pragma aux snippets, feel free to do this if you want. Its not necessary for the way i use them.

Type of Copy and Clear, Main2Main , Mem2Vid, Other Info
Watcom C++ 10.6 using a memcpy and memset, /otexan /5r	Click here	Click here	Profiler Information
My original copy and clear routine (cpu)		Click here
FPU copy and CPU clear, version 1
FPU copy and CPU clear, version 2
FPU copy and CPU clear, version 3

** these figures aren`t concrete yet, i think the Main2Main watcom routine may only be the memcpy, interestingly Watcom doesn`t seem to have an inline version of memset..

I`ve pulled this out of a .h file and translated it to html via netscape gold, its not pretty everything should be ok, check it yourself if you`re not sure.

There are in watcom pragma aux formats easily translatable to other implementations for MsVC its

_asm {

mov ecx,dword ptr abuf

..etc

}

I believe Borland C is similar, obviously for .asm just remove the overhead, watch out for register overwriting, i prefer to control my own registers and not use modify exact [] in the aux pragma, this way i can chain together #pragmas without extra overhead ..

use

fpu_setupcopy();
fpu_copyscr();

setupcopy();
copyscr();

or for a rep movsd

copyscr();

I`ve tested these out, the FPU version came out fastest on a pentium pro 90 compaq lte5100 laptop with cirrus svga chip in vanilla mode 0x13 (320x200x256). Your mileage may vary, then next is the 4 at a time, and then the rep movsd ( slow ) .

The FPU version was based on intels version , i don't believe its optimum yet, but it is faster than my last one, I need to spend some more time with it , i only just wrote it. abuf is a linear buffer somewhere, i use a 64000 byte buffer. its all fixed for that currently perhaps i`ll make it cleaner.

Later on i`ll post the bench results from a sample session for each of these .. ..

Remember to align abuf to a decent block, i use a paragraph usually.

Comments/Bug fixes/Improvements are always welcome at cwallace@dreamworks.com

I`lll apologise now for the crap HTML ..

void setupcopy(void);
void fpu_copyscr(void);
void fpu_setupcopy(void);

unsigned long rdtsc(void);

/* just a simple rdtsc call use unsigned long counter = rdtsc(); */
#pragma aux rdtsc = \
".586P" \
"rdtsc";

/* this is the fpu 4 quadword move and 4 dword clear */
/* unrolling this once or twice will increase the speed */
/* i should think of a better way of clearing, needs to be benched properly */

#pragma aux fpu_setupcopy = \
".586P" \
"mov ecx,dword ptr abuf" \
"mov eax,0xa0000" \
"add ecx,32" \
"xor edx,edx" ;

#pragma aux fpu_copyscr = \
".586P" \
"fpu_copy:" \
"add eax,32" \
"fld qword ptr [ecx - 32]" \
"mov [ecx-28],edx" \
"mov [ecx-32],edx" \
"fld qword ptr [ecx - 24]" \
"mov [ecx-20],edx" \
"mov [ecx-24],edx" \
"fld qword ptr [ecx - 16]" \
"mov [ecx-12],edx" \
"mov [ecx-16],edx" \
"fld qword ptr [ecx - 8]" \
"mov [ecx-4],edx" \
"mov [ecx-8],edx" \
"fstp qword ptr [eax - 8]" \
"fstp qword ptr [eax - 16]" \
"fstp qword ptr [eax - 24]" \
"fstp qword ptr [eax - 32]" \
"add ecx,32" \
"cmp eax,0xaff00" \ /* change this to your buffer end, it probably isn`t 0xaff00 for you, its source_base+length*/
"jb fpu_copy" ;

/* this is the 4 dword copy and clear, unrolled */

#pragma aux setupcopy = \
".586P"\
"mov esi,abuf" \
"mov edi,0xa0000" \
"mov ecx,2000" \
"xor ebp,ebp";

#pragma aux copyscr = \
".586P" \
"lop1: mov eax,[esi]"\
"mov ebx,[esi+4]"\
"mov edx,[esi+8]"\
"mov [edi],eax"\
"mov eax,[esi+12]"\
"mov dword ptr [esi],ebp"\
"mov dword ptr [esi+4],ebp"\
"mov dword ptr [esi+8],ebp"\
"mov dword ptr [esi+12],ebp"\
"mov [edi+4],ebx"\
"mov [edi+8],edx"\
"mov [edi+12],eax"\
"mov eax,[esi+16]"\
"mov ebx,[esi+4+16]"\
"mov edx,[esi+8+16]"\
"mov [edi+16],eax"\
"mov eax,[esi+12+16]"\
"mov dword ptr [esi+16],ebp"\
"mov dword ptr [esi+4+16],ebp"\
"mov dword ptr [esi+8+16],ebp"\
"mov dword ptr [esi+12+16],ebp"\
"mov [edi+4+16],ebx"\
"mov [edi+8+16],edx"\
"mov [edi+12+16],eax"\
"add esi,32"\
"add edi,32"\
"dec ecx"\
"jnz lop1"\
modify[edi esi eax ebx ecx] ;

/* this is a simple rep movsd style copy and clear*/

#pragma aux copyscr = \
".586P" \
"mov esi,abuf" \
"mov edi,0xa0000" \
"mov ecx,16000" \
"rep movsd" \
"xor eax,eax"\
"mov edi,abuf"\
"mov ecx,16000"\
"rep stosd" \
modify[edi esi ecx] ;

James Shaw sent me this one, i haven`t profiled it yet. Thanks Jim.

It needs more work to get it to a #pragma, the fild,fist notation is incorrect it should be fild qword [esi] etc..

copy_screen:
push esi edi ecx
mov esi,[source_screen] ;work screen address
mov edi,0a0000h ;visible screen address
mov ecx,64000
@@1:
mov eax,4
@@2:
fild [QWORD esi]
fild [QWORD esi+020h]
fild [QWORD esi+040h]
fild [QWORD esi+060h]
fild [QWORD esi+080h]
fild [QWORD esi+0a0h]
fild [QWORD esi+0c0h]
fild [QWORD esi+0e0h]
fxch
fistp [QWORD edi+0c0h]
fistp [QWORD edi+0e0h]
fistp [QWORD edi+0a0h]
fistp [QWORD edi+080h]
fistp [QWORD edi+060h]
fistp [QWORD edi+040h]
fistp [QWORD edi+020h]
fistp [QWORD edi]
add esi,8
add edi,8
dec eax jne
@@2
add esi,224
add edi,224
sub ecx,256
jne @@1
pop ecx edi esi
ret

History:

1.0 Created HTML version, includes change suggested by Fabio Bizzetti ( hope i spelt it correctly )
2.0 Added copy sent by James Shaw, jim@curved-logic.com
3.0 Started adding some of the timing info, added pentprof info..

Charlie Wallace - November 14 1996