Sunday, May 18, 2014

[HLSL] Turning float4's into a float4x4

In one of my vertex shaders I need to turn a couple float4's into a float4x4. Specifically, I'm building a world matrix. (For those that are curious, it's instance data. The design I'm using is very similar to the one put forth by DICE on slide 29)

If the float4's are rows, then building a float4x4 is really easy:

float4x4 CreateMatrixFromRows(float4 r0, float4 r1, float4 r2, float4 r3) {
    return float4x4(r0, r1, r2, r3);
}

float4x4 has a constructor that takes vectors as arguments. However, as you can see above, it assumes these are rows. I find this a bit odd since internally, DirectX, by default stores matrices as column-major. Therefore, behind the scenes, it will have to do lots of swizzle copy-swaps.

If the float4's are columns, building the float4x4 becomes a bit more icky for our viewing, since we have to manually pick off each element and send it to the full float4x4 constructor. However, I suspect behind the scenes the compiler will know better.

float4x4 CreateMatrixFromCols(float4 c0, float4 c1, float4 c2, float4 c3) {
    return float4x4(c0.x, c1.x, c2.x, c3.x,
                    c0.y, c1.y, c2.y, c3.y,
                    c0.z, c1.z, c2.z, c3.z,
                    c0.w, c1.w, c2.w, c3.w);
}


I wasn't happy with just guessing what the compiler would and wouldn't do, so I created a simple HLSL vertex shader to see how many OPs each function produced. hlsli_util.hlsli contains the two functions defined above. (Yes, I know the position isn't being transformed to clip space. It's just a trivial shader)

#include "hlsl_util.hlsli"

cbuffer cbPerObject : register(b1) {
    uint gStartVector;
    uint gNumVectorsPerInstance;
};

StructuredBuffer<float4> gInstanceBuffer : register(t0);


float4 main(float3 pos : POSITION, uint instanceId : SV_INSTANCEID) : SV_POSITION {
    uint worldMatrixOffset = instanceId * gNumVectorsPerInstance + gStartVector;

    float4 c0 = gInstanceBuffer[worldMatrixOffset];
    float4 c1 = gInstanceBuffer[worldMatrixOffset + 1];
    float4 c2 = gInstanceBuffer[worldMatrixOffset + 2];
    float4 c3 = gInstanceBuffer[worldMatrixOffset + 3];

    float4x4 instanceWorldCol = CreateMatrixFromCols(c0, c1, c2, c3);
    //float4x4 instanceWorldRow = CreateMatrixFromRows(c0, c1, c2, c3);

    return mul(float4(pos, 1.0f), instanceWorldCol);    
}

I compiled the shader as normal and then used the following command to disassemble the compiled byte code:

fxc.exe /dumpbin /Fc <outputfile.txt> <compiledshader.cso> 


HUGE DISCLAIMER: This is the intermediate asm that fxc creates. The final number/form of OPs will depend on the final compile done by the graphics driver. However, I feel the intermediate asm will generally be close-ish to what is finally produced, and therefore, can be used as a rough gauge.


Here is the asm code for creating the matrix from columns. I'll include the register signature for this one. The other asm code samples use the same register signature.

// Resource Bindings:
//
// Name                                 Type  Format         Dim Slot Elements
// ------------------------------ ---------- ------- ----------- ---- --------
// gInstanceBuffer                   texture  struct         r/o    0        1
// cbPerObject                       cbuffer      NA          NA    1        1
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// POSITION                 0   xyz         0     NONE   float   xyz 
// SV_INSTANCEID            0   x           1   INSTID    uint   x   
//
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// SV_POSITION              0   xyzw        0      POS   float   xyzw
//

imad r0.x, v1.x, cb1[8].y, cb1[8].x
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw
iadd r0.xyz, r0.xxxx, l(1, 2, 3, 0)
mov r2.xyz, v0.xyzx
mov r2.w, l(1.000000)
dp4 o0.x, r2.xyzw, r1.xyzw
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw
dp4 o0.y, r2.xyzw, r1.xyzw
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.y, l(0), t0.xyzw
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r0.xyzw, r0.z, l(0), t0.xyzw
dp4 o0.w, r2.xyzw, r0.xyzw
dp4 o0.z, r2.xyzw, r1.xyzw
ret 
// Approximately 13 instruction slots used

Creating the matrix from columns was actually very clean. The compiler knew what we wanted and completely got rid of all the swizzles, and rather just directly copied each column and did a dot product to get the final position.

Here is the asm for creating the matrix from rows:

imad r0.x, v1.x, cb1[8].y, cb1[8].x
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw
iadd r0.xyz, r0.xxxx, l(1, 2, 3, 0)
mov r2.x, r1.x
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r3.xyzw, r0.x, l(0), t0.xzyw
mov r2.y, r3.x
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r4.xyzw, r0.y, l(0), t0.xywz
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r0.xyzw, r0.z, l(0), t0.xyzw
mov r2.z, r4.x
mov r2.w, r0.x
mov r5.xyz, v0.xyzx
mov r5.w, l(1.000000)
dp4 o0.x, r5.xyzw, r2.xyzw
mov r2.y, r3.z
mov r2.z, r4.y
mov r2.w, r0.y
mov r2.x, r1.y
dp4 o0.y, r5.xyzw, r2.xyzw
mov r4.y, r3.w
mov r3.z, r4.w
mov r3.w, r0.z
mov r4.w, r0.w
mov r3.x, r1.z
mov r4.x, r1.w
dp4 o0.w, r5.xyzw, r4.xyzw
dp4 o0.z, r5.xyzw, r3.xyzw
ret 
// Approximately 27 instruction slots used

Wow! Look at all those mov OPs. So even though the HLSL constructor expects rows, giving it rows leads to huge number of mov's because the GPU uses column-major matrix representation.

I also tried manually specifying the swizzles to see if that would help:

float4x4 CreateMatrixFromRows(float4 r0, float4 r1, float4 r2, float4 r3) {
    return float4x4(r0.x, r0.y, r0.z, r0.w,
                    r1.x, r1.y, r1.z, r1.w,
                    r2.x, r2.y, r2.z, r2.w,
                    r3.x, r3.y, r3.z, r3.w);
}

However, the asm generated was identical to the constructor with 4 row vectors.

So I guess the lesson learned here today is to always try to construct matrices in the way that they are stored. I want to mention that you can tell the compiler to use Row-major matrices, but Col-major matrices are generally favored because it can simplify some matrix math.

My question to you all is: Are there better ways to do either of these two operations? As always, feel free to comment, ask questions, correct any errors I may have made, etc.

Happy coding
-RichieSams

2 comments:

  1. Interesting post. I do not know of a better way to create 4x4s in HLSL. You probably know already it is possible to make 4x4s on the host side and pass them down to the device. I found your blog when looking for confirmation that the m[i][j] operations select the 'i' row and 'j' column.

    ReplyDelete
  2. Yea, for non-instanced data, I just pass in the matrix from CPU. However, for indexed data, the matrix data is passed in as 4 float4s in a StructuredBuffer, so I have to create the matrix on the fly.

    ReplyDelete