destA = ((srcA >> rightShift) << leftShift) & mask;
That is, we shift component to the "right" position within the (d)word and mask it. There are maaaany combinations of shifts/masks, depending on source/destination format. Possible methods of conversion are:
Now here's yet another idea -- runtime-compiled blitters. It combines the best features of first two approaches and minimalizes their flaws.
In the practice it's better to combine stages 2&3 (unpack/pack) together.
; __cdecl assumed! Arguments passed via stack! _ASM_StdBlit_Prologue: push ebp mov ebp, esp pushad mov edi, [ebp + 8] ; dst mov esi, [ebp + 12] ; src mov ecx, [ebp + 16] ; # of pixels _ASM_StdBlit_Prologue_End: _ASM_StdBlit_Load32: loop32: mov eax, [esi] _ASM_StdBlit_Load32_End: ASM_StdBlit_UnpackPackA: mov ebx, eax ; EBX = ARGB oshr: shr eax, FORCE8 mov edx, ebx ; EDX = ARGB oandr: and eax, FORCE32 ; EAX = R oshg: shr ebx, FORCE8 ; mov ebp, edx ; EBP = ARGB oshb: shr edx, FORCE8 ; oandg: and ebx, FORCE32 ; EBX = G osha: shr ebp, FORCE8 oandb: and edx, FORCE32 ; EDX = B or eax, ebx ; EAX = RG oanda: and ebp, FORCE32 ; EBP = A or eax, edx ; EAX = RGB or eax, ebp ; EAX = ARGB _ASM_StdBlit_UnpackPackA_End: _ASM_StdBlit_Store32: mov [edi], eax _ASM_StdBlit_Store32_End: _ASM_StdBlit_IncPtrs: oincsrc: add esi, FORCE8 oincdst: add edi, FORCE8 _ASM_StdBlit_IncPtrs_End: _ASM_StdBlit_EndLoop: dec ecx oloop: jnz NEAR loop32 _ASM_StdBlit_EndLoop_End: _ASM_StdBlit_Epilogue: popad pop ebp ret _ASM_StdBlit_Epilogue_End:
This are stages for converting multiple pixels, so we additionally have increasing src/dst pointers, looping and function prologue/epilogue. Now, as you can see all the shifts, masks, pixel-sizes are "hardcoded" to FORCE8/FORCE16/FORCE32. These are conversion "attributes", which should be modified during blitter compilation. To make this possible we need to export information about them. Below for example you can find configuration table for all attributes of UnpackPackA (A = with alpha component, I use special code paths for pixelformats without alpha, it's faster) stage:
_ASM_StdBlit_UnpackPackA_Attr: dd SHIFT_B_OPCODE, oshb + 1, 1 dd SHIFT_R_OPCODE, oshr + 1, 1 dd SHIFT_G_OPCODE, oshg + 1, 1 dd SHIFT_A_OPCODE, osha + 1, 1 dd SHIFT_R_VALUE, oshr + 2, 1 dd SHIFT_G_VALUE, oshg + 2, 1 dd SHIFT_B_VALUE, oshb + 2, 1 dd SHIFT_A_VALUE, osha + 2, 1 dd AND_R_VALUE, oandr + 1, 4 dd AND_G_VALUE, oandg + 2, 4 dd AND_B_VALUE, oandb + 2, 4 dd AND_A_VALUE, oanda + 2, 4 dd NONE, 0, 0 ; Loop is quite interesting, so I put it here also. _ASM_StdBlit_EndLoop_Attr: dd GO_JUMP0, oloop + 2, 4 dd NONE, 0, 0
NONE marks end of attribute table. Getting valid offsets (second entry) and sizes (third) may be a little tricky, especially for jumps/loops, good CPU/assembler manual helps tremendously here. I also found Hacker's View program very valuable. I just dumped generated code to binary file and examined under HV. This way I could check if all the attributtes fall into right positions, if the loop addresses are OK and so on. For generic blitter optimized for 80x86 architecture we have the following attributes:
struct Attr { enum { // Max number of jumps in one code fragment MAX_JUMPS = 4, }; enum Code { NONE, // NOTE: There are MAX_JUMPS possible jump 'slots'. SET_JUMP0, // Set address of jump #0 // Go to address of jump #0 GO_JUMP0 = SET_JUMP0 + MAX_JUMPS, SHIFT_R_OPCODE = GO_JUMP0 + MAX_JUMPS, SHIFT_G_OPCODE, SHIFT_B_OPCODE, SHIFT_A_OPCODE, SHIFT_R_VALUE, SHIFT_G_VALUE, SHIFT_B_VALUE, SHIFT_A_VALUE, AND_R_VALUE, AND_G_VALUE, AND_B_VALUE, AND_A_VALUE, SRCPIX_SIZE, DSTPIX_SIZE, ATTRIBUTE_MAX }; mu32 code; mu32 offs; mu32 bytes; };
struct Stage { const mu8* pStart; const mu8* pEnd; const Attr* pAttrs; // Copies code to given address (pCompiled, copying is only done // when this is not NULL). pJumps is an array of at least // Attr::MAX_JUMPS items -- jump adresses are stored there. mu32 Copy(mu8*& pCompiled, const Compiler::Config& cfg, mu32* pJumps); };
I also use some macros to make declarations easier:
#define STAGE_COMMON(X, ATTR) \ extern "C" mu8 ASM_StdBlit_##X; \ extern "C" mu8 ASM_StdBlit_##X##_End; \ static struct Stage s_##X##_stage = \ { \ &ASM_StdBlit_##X, \ &ASM_StdBlit_##X##_End, \ ATTR \ } #define STAGE(X) STAGE_COMMON(X, 0) //----------------------------------------------------------------------------- // STAGE_ATTR - declaration for stage containing attributes //----------------------------------------------------------------------------- #define STAGE_ATTR(X) \ extern "C" Attr ASM_StdBlit_##X##_Attr; \ STAGE_COMMON(X, &ASM_StdBlit_##X##_Attr)
Included declarations are:
STAGE(Prologue); STAGE_ATTR(Load32); #if (RT_LITTLE_ENDIAN) STAGE_ATTR(Load24Little); #else STAGE_ATTR(Load24Big); #endif STAGE_ATTR(Load16); STAGE_ATTR(UnpackDepack); STAGE_ATTR(UnpackDepackA); // Version for alpha STAGE(Store32); #if (RT_LITTLE_ENDIAN) STAGE(Store24Little); #else STAGE(Store24Big); #endif STAGE(Store16); STAGE_ATTR(IncPtrs); STAGE_ATTR(EndLoop); STAGE_ATTR(Copy16); STAGE(Copy32); STAGE(Epilogue); #undef STAGE_COMMON #undef STAGE #undef STAGE_ATTR
Compilation process takes 2 loops (this idea was found in Qube's Software QDraw). During the first we only calculate the size of conversion routine, during the second code is finally generated and put into buffer (of size calculated during first pass, possibly with some alignment). Below you can find main compiling function.
//----------------------------------------------------------------------------- // mu32 Stage::Copy(mu8*& pCompiled, const Compiler::Config& cfg, // mu32* pJumps) // // Desc: Copy code to given address (pCompiled, copying is only done // when this is not NULL). pJumps is an array of at least // Attr::MAX_JUMPS items -- jump adresses are stored there. // It will also update pCompiled address, so that it points // at the byte just past the end of copied code. // // Out: Size of this stage (code) in bytes //----------------------------------------------------------------------------- mu32 Stage::Copy(mu8*& pCompiled, const Compiler::Config& cfg, mu32* pJumps) { // Right/left shift? Returns positive one. #define GET_SHIFT(C) \ (info.leftShifts[C] > 0 ? info.leftShifts[C] : info.rightShifts[C]) // Generate opcode (SHL or SHR) // // Note: Please make sure opcodes are OK (they vary together with // registers, so beware!) // Each macro is in the following form: // OPCODE_FOO where FOO is name of register that's shifted // left or right. // For example 0xE0 is the same as SHL EAX and 0xE8 - SHR EAX. // Basically you add 0x8 to get SHR from SHL. #define OPCODE_SH_EAX (info.leftShifts[PixConvInfo::R] > 0 ? 0xE0 : 0xE8) #define OPCODE_SH_EBX (info.leftShifts[PixConvInfo::G] > 0 ? 0xE3 : 0xEB) #define OPCODE_SH_EDX (info.leftShifts[PixConvInfo::B] > 0 ? 0xE2 : 0xEA) #define OPCODE_SH_EBP (info.leftShifts[PixConvInfo::A] > 0 ? 0xE5 : 0xED) RT_ASSERT(pStart != 0); RT_ASSERT(pEnd != 0); RT_ASSERT(pJumps != 0); RT_ASSERT(pEnd - pStart > 0); const PixConvInfo& info = cfg.info; const ms32 codeSize = pEnd - pStart; RT_ASSERT(codeSize >= 0); if (codeSize == 0) return 0; // If pCompiled == 0 then it's only the first loop -- we only need to // calculate code size, return it. if (pCompiled == 0) return codeSize; // Copy opcodes rtMemCpy(pCompiled, pStart, codeSize); // No attributes -- return if (pAttrs == 0) { pCompiled += codeSize; return codeSize; } // Set attributes if needed... for (const Attr* pAttr = pAttrs; pAttr->code != Attr::NONE; ++pAttr) { mu32 val = 0; mu32 jumpIdx = 0; // Relative offset -- offset from the start of the code. const mu32 relOffs = pAttr->offs - (mu32)pStart; switch (pAttr->code) { // Set jump address case Attr::SET_JUMP0: case Attr::SET_JUMP0 + 1: case Attr::SET_JUMP0 + 2: case Attr::SET_JUMP0 + 3: jumpIdx = pAttr->code - Attr::SET_JUMP0; pJumps[jumpIdx] = (mu32)pCompiled + relOffs; break; // Execute jump case Attr::GO_JUMP0: case Attr::GO_JUMP0 + 1: case Attr::GO_JUMP0 + 2: case Attr::GO_JUMP0 + 3: jumpIdx = pAttr->code - Attr::GO_JUMP0; // +4 -> opcode val = pJumps[jumpIdx] - ((mu32)pCompiled + relOffs + 4); break; case Attr::SHIFT_R_OPCODE: val = OPCODE_SH_EAX; break; case Attr::SHIFT_G_OPCODE: val = OPCODE_SH_EBX; break; case Attr::SHIFT_B_OPCODE: val = OPCODE_SH_EDX; break; case Attr::SHIFT_A_OPCODE: val = OPCODE_SH_EBP; break; case Attr::SHIFT_R_VALUE: val = GET_SHIFT(PixConvInfo::R); break; case Attr::SHIFT_G_VALUE: val = GET_SHIFT(PixConvInfo::G); break; case Attr::SHIFT_B_VALUE: val = GET_SHIFT(PixConvInfo::B); break; case Attr::SHIFT_A_VALUE: val = GET_SHIFT(PixConvInfo::A); break; case Attr::AND_R_VALUE: val = info.masks[PixConvInfo::R]; break; case Attr::AND_G_VALUE: val = info.masks[PixConvInfo::G]; break; case Attr::AND_B_VALUE: val = info.masks[PixConvInfo::B]; break; case Attr::AND_A_VALUE: val = info.masks[PixConvInfo::A]; break; case Attr::SRCPIX_SIZE: val = cfg.srcFmt.GetByteDepth(); break; case Attr::DSTPIX_SIZE: val = cfg.dstFmt.GetByteDepth(); break; default: RT_ASSERT(false); break; } // Now put the value in the code if (pAttr->bytes > 0) { RT_ASSERT(pAttr->bytes <= 4); const mu8* pVal = RT_REINTERPRET_CAST(mu8*)(&val); for (mu32 i = 0; i < pAttr->bytes; ++i) pCompiled[relOffs + i] = *pVal++; } } pCompiled += codeSize; return codeSize; #undef OPCODE_SH_EDX #undef OPCODE_SH_EBX #undef OPCODE_SH_EAX #undef GET_SHIFT }
One tricky part here is performing the jump. We get the absolute jump address (stored as SET_JUMPx attribute) and put it just after the 'jnz' instruction.
Another good idea is to add some compiler interface. All the converting routines need to get is pointer to blitter compiler. This allows for specialized implementations, like compilers optimizing for MMX/SSE and so on. Really cool. In my engine I have compiler interface + generic 80x86 implementation only, but adding MMX blitters wouldn't even require recompilation!
Please also remember that runtime-compiled code doesn't have to be limited to blitters (in fact I'd first try HW only solution in this case, nowadays :). It's just an example, you can most probably think of many other applications for this technique.
MMX blitter is blitter optimized for MMX (NOT runtime-compiled, hardcoded only for the most common formats) taken from TinyPTC by Glenn "Gaffer" Fiedler. Now, you may think this example is a little rusty. And you're right. Here's something newer -- my zoomer application (sample from RDE), running in 640x480xARGB888 on Celeron 1.2+GeForce2MX. Compiled blitter - 20fps, HW screen quad - 15fps. Much more interesting, right?
Addition -- 26-12-2002: DirectX9 has been released just a few days ago with new StretchRect method of the device interface. I didn't make any tests yet, but it sounds like it may be the ultimate solution of our problem.
Maciej Sinilo,
maciejsinilo@o2.pl