If the float4's are rows, then building a float4x4 is really easy:
float4x4 CreateMatrixFromRows(float4 r0, float4 r1, float4 r2, float4 r3) { return float4x4(r0, r1, r2, r3); }
float4x4 has a constructor that takes vectors as arguments. However, as you can see above, it assumes these are rows. I find this a bit odd since internally, DirectX, by default stores matrices as column-major. Therefore, behind the scenes, it will have to do lots of swizzle copy-swaps.
If the float4's are columns, building the float4x4 becomes a bit more icky for our viewing, since we have to manually pick off each element and send it to the full float4x4 constructor. However, I suspect behind the scenes the compiler will know better.
float4x4 CreateMatrixFromCols(float4 c0, float4 c1, float4 c2, float4 c3) { return float4x4(c0.x, c1.x, c2.x, c3.x, c0.y, c1.y, c2.y, c3.y, c0.z, c1.z, c2.z, c3.z, c0.w, c1.w, c2.w, c3.w); }
I wasn't happy with just guessing what the compiler would and wouldn't do, so I created a simple HLSL vertex shader to see how many OPs each function produced. hlsli_util.hlsli contains the two functions defined above. (Yes, I know the position isn't being transformed to clip space. It's just a trivial shader)
#include "hlsl_util.hlsli" cbuffer cbPerObject : register(b1) { uint gStartVector; uint gNumVectorsPerInstance; }; StructuredBuffer<float4> gInstanceBuffer : register(t0); float4 main(float3 pos : POSITION, uint instanceId : SV_INSTANCEID) : SV_POSITION { uint worldMatrixOffset = instanceId * gNumVectorsPerInstance + gStartVector; float4 c0 = gInstanceBuffer[worldMatrixOffset]; float4 c1 = gInstanceBuffer[worldMatrixOffset + 1]; float4 c2 = gInstanceBuffer[worldMatrixOffset + 2]; float4 c3 = gInstanceBuffer[worldMatrixOffset + 3]; float4x4 instanceWorldCol = CreateMatrixFromCols(c0, c1, c2, c3); //float4x4 instanceWorldRow = CreateMatrixFromRows(c0, c1, c2, c3); return mul(float4(pos, 1.0f), instanceWorldCol); }
I compiled the shader as normal and then used the following command to disassemble the compiled byte code:
fxc.exe /dumpbin /Fc <outputfile.txt> <compiledshader.cso>
HUGE DISCLAIMER: This is the intermediate asm that fxc creates. The final number/form of OPs will depend on the final compile done by the graphics driver. However, I feel the intermediate asm will generally be close-ish to what is finally produced, and therefore, can be used as a rough gauge.
Here is the asm code for creating the matrix from columns. I'll include the register signature for this one. The other asm code samples use the same register signature.
// Resource Bindings: // // Name Type Format Dim Slot Elements // ------------------------------ ---------- ------- ----------- ---- -------- // gInstanceBuffer texture struct r/o 0 1 // cbPerObject cbuffer NA NA 1 1 // // // Input signature: // // Name Index Mask Register SysValue Format Used // -------------------- ----- ------ -------- -------- ------- ------ // POSITION 0 xyz 0 NONE float xyz // SV_INSTANCEID 0 x 1 INSTID uint x // // // Output signature: // // Name Index Mask Register SysValue Format Used // -------------------- ----- ------ -------- -------- ------- ------ // SV_POSITION 0 xyzw 0 POS float xyzw // imad r0.x, v1.x, cb1[8].y, cb1[8].x ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw iadd r0.xyz, r0.xxxx, l(1, 2, 3, 0) mov r2.xyz, v0.xyzx mov r2.w, l(1.000000) dp4 o0.x, r2.xyzw, r1.xyzw ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw dp4 o0.y, r2.xyzw, r1.xyzw ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.y, l(0), t0.xyzw ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r0.xyzw, r0.z, l(0), t0.xyzw dp4 o0.w, r2.xyzw, r0.xyzw dp4 o0.z, r2.xyzw, r1.xyzw ret // Approximately 13 instruction slots used
Creating the matrix from columns was actually very clean. The compiler knew what we wanted and completely got rid of all the swizzles, and rather just directly copied each column and did a dot product to get the final position.
Here is the asm for creating the matrix from rows:
imad r0.x, v1.x, cb1[8].y, cb1[8].x ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw iadd r0.xyz, r0.xxxx, l(1, 2, 3, 0) mov r2.x, r1.x ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r3.xyzw, r0.x, l(0), t0.xzyw mov r2.y, r3.x ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r4.xyzw, r0.y, l(0), t0.xywz ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r0.xyzw, r0.z, l(0), t0.xyzw mov r2.z, r4.x mov r2.w, r0.x mov r5.xyz, v0.xyzx mov r5.w, l(1.000000) dp4 o0.x, r5.xyzw, r2.xyzw mov r2.y, r3.z mov r2.z, r4.y mov r2.w, r0.y mov r2.x, r1.y dp4 o0.y, r5.xyzw, r2.xyzw mov r4.y, r3.w mov r3.z, r4.w mov r3.w, r0.z mov r4.w, r0.w mov r3.x, r1.z mov r4.x, r1.w dp4 o0.w, r5.xyzw, r4.xyzw dp4 o0.z, r5.xyzw, r3.xyzw ret // Approximately 27 instruction slots used
Wow! Look at all those mov OPs. So even though the HLSL constructor expects rows, giving it rows leads to huge number of mov's because the GPU uses column-major matrix representation.
I also tried manually specifying the swizzles to see if that would help:
float4x4 CreateMatrixFromRows(float4 r0, float4 r1, float4 r2, float4 r3) { return float4x4(r0.x, r0.y, r0.z, r0.w, r1.x, r1.y, r1.z, r1.w, r2.x, r2.y, r2.z, r2.w, r3.x, r3.y, r3.z, r3.w); }
However, the asm generated was identical to the constructor with 4 row vectors.
So I guess the lesson learned here today is to always try to construct matrices in the way that they are stored. I want to mention that you can tell the compiler to use Row-major matrices, but Col-major matrices are generally favored because it can simplify some matrix math.
My question to you all is: Are there better ways to do either of these two operations? As always, feel free to comment, ask questions, correct any errors I may have made, etc.
Happy coding
-RichieSams
Interesting post. I do not know of a better way to create 4x4s in HLSL. You probably know already it is possible to make 4x4s on the host side and pass them down to the device. I found your blog when looking for confirmation that the m[i][j] operations select the 'i' row and 'j' column.
ReplyDeleteYea, for non-instanced data, I just pass in the matrix from CPU. However, for indexed data, the matrix data is passed in as 4 float4s in a StructuredBuffer, so I have to create the matrix on the fly.
ReplyDelete