Skip to main content

Khronos Blog

VK_EXT_descriptor_buffer

We’ve just released an extension that I think will completely change how engines approach descriptors going forward.

tl;dr version for the busy graphics programmer: Descriptor sets are now backed by VkBuffer objects where you memcpy in descriptors. Delete VkDescriptorPool and VkDescriptorSet from the API, and have fun!

Rethinking how we see descriptors

To understand where we’re going with descriptors going forward, we need to understand how we got here.

The old days

What is a descriptor really? In baseline Vulkan, the concept is completely abstracted away. Updating a descriptor is an API call, and we allocate them from an opaque descriptor pool. When descriptors are bound to the command buffer, the driver will somehow ensure that shaders can access them. The driver is free to do whatever it wants with those descriptors. Tooling developers have an easier time since everything is observed on the CPU timeline and everyone is happy … except of course, developers chasing peak performance.

There is a good reason for the opaque model inherited from older APIs. GPUs reference resources in a variety of ways, including (for example) storing descriptors in fixed registers rather than in memory. Making them (and the pools they are allocated from) opaque was necessary to hide these differences.

There is an argument to be made about GPU performance as well. Drivers are able to choose the optimal paths for how to program descriptors. The price of all this abstraction is, of course, flexibility and CPU performance, both of which give developers headaches. We lived through this world with OpenGL and D3D11, and then in the early days of Vulkan. It works, but the model was exhausted a long time ago by more advanced engines. It simply does not scale to a GPU-driven world with millions of objects in the scene.

Root cause #1 - VkDescriptorPool

The concept of descriptor pools is a Vulkan-ism that doesn’t really exist in any other API. They behave similarly to command pools in the sense that they abstract away exactly how much memory is required, but unlike command pools, descriptor pools are not bottomless. Finding the right trade-offs can be tricky.

The constant battle of reuse vs. recycle

Do we recycle descriptor sets or not? “It Depends™”

Separate pools per descriptor set layout or common pool?

When creating a descriptor pool, we have to specify how many of each descriptor we need to allocate. This is awkward since we don’t necessarily know how many of each descriptor set layout we’re going to need during a render. There are various strategies, none of which are ideal:

Pretending it’s a linear allocator

Just allocate a bunch and when we hit VK_ERROR_OUT_OF_POOL_MEMORY, make a new one and throw the old away. Allocate VkDescriptorSet on demand and throw them away as well. It’s not obvious how to balance maxSets and various descriptor counts. Hopefully we don’t waste too much memory. Be careful though, because some implementations have bottomless pools, and you’ll never get OUT_OF_POOL_MEMORY.

This is a simple model that mirrors how we would use command allocators.

Slab allocator per unique VkDescriptorSetLayout

If we subdivide descriptor pools per unique set layout we know exactly how to balance the maxSets and descriptor type counts to allocate. The downside is of course, more pools. The upside is that we can just recycle VkDescriptorSet objects themselves instead of allocating them all the time.

One pool per rendering thread

Vulkan is designed for multi-threading, but using descriptor pools with multiple threads isn’t possible without locks, so we’re forced to allocate a pool per thread. This gets wasteful very quickly, especially with the linear allocator method. A smart allocator could implement a thread safe ring buffer using atomics for example. An even smarter one could optimize away atomics contention by allocating N descriptor sets in one go.

Are we allocating too much?

Given enough descriptors in a scene, we will eventually run into a situation where descriptor memory can become a problem for the implementation to juggle. As a developer there is no mechanism to control where descriptor memory is located or how much memory it actually consumes.

Root cause #2 - vkUpdateDescriptorSets

Updating descriptor sets is an extremely hot command in most engines. In a renderer with a legacy binding model, almost every draw call needs to update descriptors in some way. It’s certainly possible to design a renderer around pre-baking descriptor sets to avoid this, but it can cause headaches due to its tight coupling with rendering abstraction and underlying API. Just like descriptor pools, there are many strategies for how to deal with this:

Expose a “bind group” abstraction to higher level code

Gives the responsibility to the rendering code to know about descriptor sets explicitly. This implies a tighter coupling between shaders and high level concepts.

Linear allocator style

Does not attempt to do anything clever. Allocate, update and bind until invalidation.

Hash’n’cache

Reuse descriptor sets through hashmaps with some mechanism for garbage collection. Trades overhead of vkUpdateDescriptorSets and vkAllocateDescriptorSets with lookups. Not particularly elegant.

The special descriptor - VK_DESCRIPTOR_TYPE_*_BUFFER_DYNAMIC

Dynamic UBOs (and to a lesser degree SSBO) is a special feature of Vulkan that tries to overcome the overhead of allocating and updating descriptor sets. We can instead rebind the descriptor set, but replace some offsets. This allows for a simple linear allocation of uniform buffers. This works best when we dedicate a descriptor set to hold dynamic UBOs.

It’s still a fine model for most use cases

All of the concerns mentioned above are not necessarily problems if you’re primarily targeting Vulkan, just constraints that have to be accounted for. In my hobby engine for example, this model works just fine. The tooling support is excellent, as there is no need to mess around with GPU-assisted validation for basic things. More advanced features like descriptor indexing can be tapped into on an as-needed basis.

If this model meets your needs, there’s no need to feel pressured into rewriting everything to support this newer model, but if you’re still reading, it probably means you need something more flexible, or faster, or both!

The bindless takeover

“Bindless” is a term that comes up when discussing descriptors and is one of those overloaded terms that can mean too many things. I like to think of it as:

  • Just bind less, i.e. make fewer calls to vkCmdBindDescriptorSets.
  • Free from the concept of assigning a fixed number of descriptors to a draw call.

Both concepts are implemented by having giant arrays of descriptors where we can freely refer to descriptors by index. Indices are plain-old-data that can be freely passed around in buffer objects.

VK_EXT_descriptor_indexing (core in 1.2) was introduced to support this bindless style, and we punched a hole through the descriptor abstraction where we could now modify descriptor sets up until command buffer submission, update descriptor sets concurrently, leave unused descriptors undefined, and do all sorts of shenanigans that make tool developers lose sleep at night.

The introduction of descriptor indexing revealed that the descriptor model is all just smoke and mirrors. Really, a descriptor is just a blob of binary data that the GPU can interpret in some meaningful way. The API calls to manage descriptors really just boils down to “copy magic bits here.”

Support is widespread on desktop-class GPUs, but even more modern mobile hardware supports this just fine.

Free-form indexing

As the extension name suggests, descriptor indexing unlocked a new world for shader authors since we were now able to access descriptors by index.

// HLSL
// Replace vkCmdBindDescriptorSets with vkCmdPushConstant
Texture2D Tex[];
[[vk::push_constant]] cbuffer { uint index; };
float4 main(float2 uv : UV) : SV_Target {
    return Tex[index].Load(int3(uv, 0));
}

The idea extends to indexing which is not dynamically uniform as well.

// HLSL
// Let every triangle use a different texture.
Texture2D Tex[];
float4 main(nointerpolation uint index : INDEX, float2 uv : UV) : SV_Target {
    return Tex[NonUniformResourceIndex(index)].Load(int3(uv, 0));
}

Decoupling draw calls from binding descriptor sets

As we see above, we’re now able to think of descriptors in terms of indices. This style is basically what D3D12’s shader model 6.6 enables in a convenient way. The reason for calling it bindless is that we decouple individual draws and dispatches from concrete descriptor “binding” calls. Resources are accessed in a more indirect way.

// HLSL - SM 6.6 style
float4 main(nointerpolation uint index : INDEX, float2 uv : UV) : SV_Target {
    Texture2D Tex = ResourceDescriptorHeap[NonUniformResourceIndex(index)];
    return Tex.Load(int3(uv, 0));
}

The complication is that shaders must be written with this model in mind, i.e. the shaders become more tightly coupled with the engine design. Instead of declaring descriptors with set and binding decorations, you would need to pass around a uint index instead. D3D12 shader models pre-6.6 essentially did this kind of transform magic for you, converting register and space into a uint index automatically. For example, we could transform this kind of shader:

// HLSL
[[vk::binding(0, 0)]] Texture2D A : register(t0, space0);
[[vk::binding(1, 0)]] Texture2D B : register(t1, space0);
[[vk::binding(0, 1)]] Texture2D C : register(t0, space1);
[[vk::binding(1, 1)]] Texture2D D : register(t1, space1);
float main(float2 uv : UV) : SV_Target {
    float result = A.Load(int3(uv, 0));
    result += B.Load(int3(uv, 0));
    result += C.Load(int3(uv, 0));
    result += D.Load(int3(uv, 0));
    return result;
}

into this:

[[vk::push_constant]] cbuffer { uint setOffset0, setOffset1; };
float main(float2 uv : UV) : SV_Target {
    Texture2D A = ResourceDescriptorHeap[setOffset0 + 0];
    Texture2D B = ResourceDescriptorHeap[setOffset0 + 1];
    Texture2D C = ResourceDescriptorHeap[setOffset1 + 0];
    Texture2D D = ResourceDescriptorHeap[setOffset1 + 1];
    …
}

A VkDescriptorSet(Layout) is conceptually similar to this model where the driver does addressing logic for you behind the scenes.

With plain uints instead of actual descriptor sets, there are some design questions that come up. Do we assign one uint per descriptor, or do we try to group them together such that we only need to push one base offset? If we go with the latter, we might end up having to copy descriptors around. If we go with one uint per descriptor, we just added extra indirection on the GPU. We might end up with code looking like:

StructuredBuffer Set0;
StructuredBuffer Set1;
float main(float2 uv : UV) : SV_Target {
    Texture2D A = ResourceDescriptorHeap[Set0[0]];
    Texture2D B = ResourceDescriptorHeap[Set0[1]];
    Texture2D C = ResourceDescriptorHeap[Set1[0]];
    Texture2D D = ResourceDescriptorHeap[Set1[1]];
    …
}

which seems nonoptimal; GPU throughput might suffer with the added latency. On the other hand, having to group descriptors linearly one after the other can easily lead to copy hell. Copying descriptors is still an abstracted operation that requires API calls to perform, and we cannot perform it on the GPU. The overhead of all these calls in the driver can be quite significant, especially in API layering. I’ve seen up to 10 million calls to “copy descriptor” per second which adds up.

We need better ways to deal with descriptor churn.

This is really just memory, right?

Given the rambling in the previous sections, managing descriptors really starts looking more and more like just any other memory management problem. Let’s try translating existing API concepts into what they really are under the hood.

vkCreateDescriptorPool

vkAllocateMemory. Memory type unknown, but likely HOST_VISIBLE and DEVICE_LOCAL. Size of pool computed from pool entries.

vkAllocateDescriptorSets

Linear or arena allocation from pool. Size and alignment computed from VkDescriptorSetLayout.

vkUpdateDescriptorSets

Writes raw descriptor data by copying payload from VkImageView / VkSampler / VkBufferView. Write offset is deduced from VkDescriptorSetLayout and binding. The VkDescriptorSet contains a pointer to HOST_VISIBLE mapped CPU memory. Copies are similar.

vkCmdBindDescriptorSets

Binds the GPU VA of the VkDescriptorSet somehow.

Rewriting the descriptor API

The descriptor buffer API effectively removes VkDescriptorPool and VkDescriptorSet. The APIs now expose lower level detail. For example, there’s now a bunch of properties to query:

typedef struct VkPhysicalDeviceDescriptorBufferPropertiesEXT {
    …
    size_t             samplerDescriptorSize;
    size_t             combinedImageSamplerDescriptorSize;
    size_t             sampledImageDescriptorSize;
    size_t             storageImageDescriptorSize;
    size_t             uniformTexelBufferDescriptorSize;
    size_t             robustUniformTexelBufferDescriptorSize;
    size_t             storageTexelBufferDescriptorSize;
    size_t             robustStorageTexelBufferDescriptorSize;
    size_t             uniformBufferDescriptorSize;
    size_t             robustUniformBufferDescriptorSize;
    size_t             storageBufferDescriptorSize;
    size_t             robustStorageBufferDescriptorSize;
    size_t             inputAttachmentDescriptorSize;
    size_t             accelerationStructureDescriptorSize;
    …
} VkPhysicalDeviceDescriptorBufferPropertiesEXT;

Allocating descriptors

In a VkDescriptorSetLayout we can query how much memory to allocate:

void vkGetDescriptorSetLayoutSizeEXT(
    VkDevice                                    device,
    VkDescriptorSetLayout                       layout,
    VkDeviceSize*                               pLayoutSizeInBytes);

and where to place descriptors within the allocated set:

void vkGetDescriptorSetLayoutBindingOffsetEXT(
    VkDevice                                    device,
    VkDescriptorSetLayout                       layout,
    uint32_t                                    binding,
    VkDeviceSize*                               pOffset);

The alignment of a descriptor set itself is constant and is found in the properties struct. For VARIABLE_DESCRIPTOR_COUNT style descriptor sets, you can compute the size yourself based on the offset of the binding and descriptor count you need. Arrays are tightly packed as you would expect.

When allocating buffers to back descriptors, we need some new usage flags:

VK_BUFFER_USAGE_SAMPLER_DESCRIPTOR_BUFFER_BIT_EXT
VK_BUFFER_USAGE_RESOURCE_DESCRIPTOR_BUFFER_BIT_EXT

Do not slap this on every buffer allocation indiscriminately. On some drivers, descriptor buffers live in a restricted GPU VA space, which allows those drivers to only spend 32-bits instead of 64-bits to bind a descriptor VA. This VA range is precious. On these drivers, you’ll likely see new memory types that allocate 32-bit VA under the hood.

Updating descriptors

Extracting descriptor payloads is straight forward using:

void vkGetDescriptorEXT(
    VkDevice                                        device
    const VkDescriptorGetInfoEXT*                   pCreateInfo,
    void*                                           pDescriptor);

Here we can extract payloads from VkImageView and VkSampler objects. For raw buffers like UBOs and SSBOs, we give it VkDeviceAddress instead of VkBuffer + offset.

VkBufferView is also gone! We replaced it with VkDeviceAddress + range + VkFormat instead. This is a special feature that is particularly useful for D3D12 layering, but it’s generally just a nice feature that makes typed buffers easier to use.

pDescriptor can either point directly to a memory mapped VkBuffer or a payload we allocate ourselves outside Vulkan. vkUpdateDescriptorSets is gone—just use memcpy(). If you want an excuse to write a JIT code generator in your Vulkan backend, rejoice, because you can replace VkDescriptorUpdateTemplate with perfectly unrolled code that copies descriptor payloads. Just don’t tell your reviewer that you got the idea here :)

If pDescriptor points to uncached memory (i.e. write-combined), it is valuable to ensure multiple writes are written contiguously. Implementations are free to reorder VkDescriptorSetLayout offsets to achieve optimal descriptor packing, so a template-based update scheme should sort the update by offset rather than binding index. Note that this reordering is only for individual binding numbers. Arrays are of course packed linearly.

Keeping VkImageView and VkSampler objects around is a compromise. Implementations still need to keep the older binding model working in the driver and completely rewriting that is not practical. The descriptor payload itself might be a reference to the view object itself, but to the application, the descriptor data is just memory that can be copied around.

VkBool32 allowSamplerImageViewPostSubmitCreation;

takes this into account.

In an ideal world, we’d replace vkCmdBindDescriptorSets() with vkCmdBindDescriptorSetAddress() or something like that, but hardware isn’t necessarily that convenient. In D3D12 we have ID3D12DescriptorHeap which enshrines this hardware quirk. Some hardware is only able to access descriptors from a narrow address space that is bound globally. We inherit the same concerns in Vulkan, although implementations are free to expose support for many descriptor buffers being bound concurrently for added flexibility.

Binding descriptors

vkCmdBindDescriptorBuffersEXT(
    VkCommandBuffer                             commandBuffer,
    uint32_t                                    bufferCount,
    const VkDescriptorBufferBindingInfoEXT*     pBindingInfos);

vkCmdSetDescriptorBufferOffsetsEXT(
    VkCommandBuffer                             commandBuffer,
    VkPipelineBindPoint                         pipelineBindPoint,
    VkPipelineLayout                            layout,
    uint32_t                                    firstSet,
    uint32_t                                    setCount,
    const uint32_t*                             pBufferIndices,
    const VkDeviceSize*                         pOffsets);

Binding a descriptor buffer is a perfect analogue to D3D12’s SetDescriptorHeaps, but binding offsets isn’t quite the same thing. In Vulkan, we bind by byte offset, whereas in D3D12 we bind by descriptor offset masquerading as a VA. Descriptor buffers look more similar to Metal’s indirect argument buffers than D3D12’s root tables.

Changing the descriptor buffer bindings is highly discouraged, just like D3D12, but this depends on the implementation. Changing between descriptor sets and descriptor buffers between commands is also highly discouraged. Do not mix and match if possible. A good mental model is that changing the descriptor buffer might imply an ALL_COMMANDS -> ALL_COMMANDS pipeline barrier. Push descriptors can still be used alongside descriptor buffers as a way to bridge the gap without the extra cost.

Note that it’s not possible to mix and match within a single draw call or dispatch. A pipeline is either enabled for descriptor buffers, or not. This is controlled by flags such as:

VK_PIPELINE_CREATE_DESCRIPTOR_BUFFER_BIT_EXT
VK_DESCRIPTOR_SET_LAYOUT_CREATE_DESCRIPTOR_BUFFER_BIT_EXT

Copying descriptors on GPU timeline? Why not

Since descriptors are just memory now, there is nothing stopping us from doing descriptor updates on the GPU timeline. Combining this with GPU-driven rendering is exceptionally powerful.

How to deal with different descriptor sizes?

When copying descriptors around you would need to know the descriptor size since it will vary wildly between implementations. Specialization constants is a perfect use case here. Just record the various descriptor sizes as constants in the shader. You could also use vkCmdCopyBuffer. Note that when viewed as a buffer, descriptor data is just opaque binary data. This is not supported:

RWStructuredBuffer<Texture2D> Textures;
void main() { Textures[1] = Textures[0]; }

but this is:

[[vk::constant_id(0)]] const int SAMPLED_IMAGE_WORD_COUNT = 0;
[[vk::constant_id(1)]] const int STORAGE_IMAGE_WORD_COUNT = 0;
[[vk::constant_id(2)]] const int UNIFORM_BUFFER_WORD_COUNT = 0;
RWByteAddressBuffer DescriptorHeapRawBlobbyData;

A SPIR-V compiler would have no idea how big Texture2D is and what the alignment is, but specialization constants are here to save the day. This is in contrast to Metal 3’s indirect argument buffers where all descriptors are just uint64_t VAs, which simplifies shader based copying of descriptors, but would not be efficient on many Vulkan implementations. The size of an image descriptor can be completely different from a buffer descriptor for example. This style lets implementations reach descriptors with as few indirections as possible.

Synchronization

When synchronizing a shader stage, you can use VK_ACCESS_2_DESCRIPTOR_BUFFER_READ_BIT_EXT.

If you’re updating descriptors from the CPU, you get the implicit host -> device synchronization on vkQueueSubmit.

A massive foot-gun

While this is a ridiculously powerful feature, it’s also an equally ridiculous foot-gun. The requirements on debug infrastructure are extreme. Be warned!

A concrete use-case - D3D12 layering in vkd3d-proton

In the vkd3d-proton project, we implement D3D12 as a layered implementation over Vulkan, and we have been clamoring for this extension for a very long time now since emulating the descriptor model of D3D12 natively is painful when we’re chasing parity in performance against native drivers. Below are some features we use to make it work well. These points should be of interest to any D3D12-centric engine that wants to take advantage of this EXT.

Non-shader visible descriptor heap

Ideally, these descriptor heaps would be implemented with malloc(). Now we can do just that!

Avoiding API overhead of vkUpdateDescriptorSets

D3D12 applications tend to copy descriptors, and a lot of them. Now we can memcpy() instead of invoking API calls every time.

Greatly simplified static samplers

In the older implementation, ID3D12RootSignature contains a list of static samplers, which is closely related to Vulkan’s immutable samplers. In the earlier implementation, we’d have to create a VkDescriptorSetLayout with just immutable samplers, allocate a VkDescriptorSet and bind that when the root signature changed. This worked fine, but it couldn’t work with descriptor buffers since the sampler heap is controlled by the application now. For this reason, we added a simpler formulation for immutable samplers:

VK_DESCRIPTOR_SET_LAYOUT_CREATE_EMBEDDED_IMMUTABLE_SAMPLERS_BIT_EXT

Descriptor set layouts created with this flag are restricted to only contain samplers with descriptorCount = 1 (i.e. no arrays), but in return there is no need to allocate them. Drivers manage them internally just like D3D12, with similar restrictions on the number of unique samplers on the device. The descriptor sets still need to be bound, however, using:

vkCmdBindDescriptorBufferEmbeddedSamplersEXT(
    cmd, bindPoint, pipelineLayout, setIndex);

There is no vkCmdBindPipelineLayout() in Vulkan where we could have done this, so this was the next best option.

VkBufferView removal

Being able to place buffer views directly in buffers instead of holding onto VkBufferView objects was very helpful and even improved our SPIR-V code generation.

No shader backend changes

No shader translation work needed to change, because we still retain the model of VkDescriptorSetLayout. This is welcome since our descriptor model translation is complicated enough as-is.

Push descriptor interactions

Internally in our implementation, we often need to use our own shaders to implement things like copies, clears, blits, etc. These require actual descriptors. We cannot touch descriptor heaps owned by the application, and for performance reasons we cannot change the descriptor heap either, so the only option left to us is push descriptors. Fortunately, push descriptors are widely supported on the platforms where D3D12 layering is relevant, and they don’t require us to rebind descriptor heaps.

Using push descriptors is somewhat more complicated since it punches a hole in the abstraction model of some hardware, but it is supported.

typedef struct VkPhysicalDeviceDescriptorBufferFeaturesEXT {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           descriptorBuffer;
    VkBool32           descriptorBufferCaptureReplay;
    VkBool32           descriptorBufferImageLayoutIgnored;
    VkBool32           descriptorBufferPushDescriptors;
} VkPhysicalDeviceDescriptorBufferFeaturesEXT;

If the implementation sets this property to true:

typedef struct VkPhysicalDeviceDescriptorBufferPropertiesEXT {
    …
    VkBool32 bufferlessPushDescriptors;
    …
} VkPhysicalDeviceDescriptorBufferPropertiesEXT;

Then we can just use push descriptors as-is. If it is false, we need some extra hand-holding to help the driver. One (and only one) of the descriptor buffers we bind must be created with

VK_BUFFER_USAGE_PUSH_DESCRIPTORS_DESCRIPTOR_BUFFER_BIT_EXT

This allows the implementation to reserve magic scratch space it can use for push descriptors. Typically we would add this flag to a RESOURCE descriptor buffer.

When binding a descriptor buffer, we normally just pass in the GPU VA:

typedef struct VkDescriptorBufferBindingInfoEXT {
    VkStructureType                             sType;
    const void*                                 pNext;
    VkDeviceAddress                             address;
    VkBufferUsageFlags                          usage;
} VkDescriptorBufferBindingInfoEXT;

but if we need the buffer, we can pNext in the underlying buffer so that driver can bind its magic scratch space:

typedef struct VkDescriptorBufferBindingPushDescriptorBufferHandleEXT {
    VkStructureType                             sType;
    const void*                                 pNext;
    VkBuffer                                    buffer;
} VkDescriptorBufferBindingPushDescriptorBufferHandleEXT;

The state of tooling

With descriptor buffers, we have to recognize that the tooling becomes that much more important and difficult to write. Full GPU validation will be necessary since everything is now descriptor indexing on steroids.

GPU validation is currently only required for descriptor indexing or UPDATE_AFTER_BIND, which means we have a schism of CPU-side validation and GPU-side validation, and many developers tend to avoid GPU-assisted validation due to the massive speed hit. We want a future where GPU validation is fast and robust by default.

It’s going to take a good while to land the improved GPU validation, but the CPU-side API validation will land promptly with the specification being released. LunarG is hard at work to land a revamped implementation of GPU-side descriptor validation that should make things faster and more accurate.

On the debugger side of things, the extension supports capture-replay meaning that we can guarantee invariant descriptor payloads across runs. This is critical for replaying in e.g., RenderDoc. Again, implementing something like this is extremely difficult and is not going to happen tomorrow.

Implementation strategies for various scenarios

If you’re a brave soul looking to abandon all hope of debugging and join the descriptor buffer club already, here are some scenarios where descriptor buffers can help:

Linear allocation strategy too slow

This is particularly useful when replacing the linear allocation method mentioned in the earlier sections where sets are allocated on demand and thrown away. This is the common scenario where CPU overhead related to descriptor updates can become a problem and is often seen in applications trying to directly implement legacy binding models on top of Vulkan a-la OpenGL or D3D11.

On the start of the frame, allocate / recycle a large buffer enough to hold all descriptors in your frame. Treat this as a linear allocator.

Instead of allocating new VkDescriptorSets and updating them, allocate a region of the big descriptor buffer. For bonus points, amortize the contention by allocating a semi-large block in one go per-thread, and then parcel out smaller blocks inside the thread. A VkDescriptorPool is not sophisticated enough for this style. We would have to lock the entire pool instead, and having separate pools per thread leads to excessive bloat as discussed earlier. This is the best of both worlds.

Basically, this looks exactly like how an engine would manage linear UBO allocation.

To update descriptors, rather than setting up structs and calling vkUpdateDescriptorSets() or vkUpdateDescriptorsWithTemplate(), we can just do a bunch of memcpy() without the API calls. For raw buffers there is vkGetDescriptorEXT() which is much leaner than the full UpdateDescriptorSets().

If this idea is refined with a code-generating JIT we could envision something truly wild like:

typedef void (*PFN_bindGroupUpdateTemplate)(
    void *dst,                // Allocated from descriptor buffer
    const void * const *src); // An array of pointers to descriptor payloads

PFN_bindGroupUpdateTemplate compileBindGroupUpdateTemplate(JITCompiler *jit,
    VkDevice device,
    const VkDescriptorSetLayoutCreateInfo *create_info,
    VkDescriptorSetLayout *setLayout);

In effect, the only API call made per draw to update descriptors would be vkCmdSetDescriptorBufferOffsetsEXT(). Perhaps sprinkle in a few calls to vkGetDescriptorEXT() for UBOs.

Replacing DYNAMIC_UBO (and DYNAMIC_SSBO)

These special types are not supported with descriptor buffers, but we have alternatives.

Push descriptors

One option is to reserve a descriptor set and use push UBOs instead. The advantage of this is that no changes to shaders are required.

Push VkDeviceAddress

Buffer device address is already required for this extension, so this style is viable as well, but it requires shader changes.

// GLSL
layout(buffer_reference, std140) readonly buffer UBO0
{
    …
};

layout(buffer_reference, std140) readonly buffer UBO1
{
    …
};

layout(push_constant) uniform Registers
{
    UBO0 ubo0;
    UBO1 ubo1;
} registers;

// HLSL alternative in DXC: vk::RawBufferLoad(address)

Arguably, this style is more suited for compiler front-ends that can emit buffer device address code in SPIR-V.

Just use normal UBOs

The main reason to use DYNAMIC_UBO was that we didn’t need to allocate descriptor sets, but given that we avoid almost all that overhead anyways, it might be fine to just use vkGetDescriptorEXT(). Using INLINE_UNIFORM_BUFFER may also be an option. Inline UBO data is placed directly in the descriptor buffer. Nice!

More direct descriptor table implementation (older D3D12)

Using descriptor buffers means a far more unified implementation with a typical D3D12 backend. The concepts map almost 1:1 now. Here’s a gist:

  • ID3D12DescriptorHeap -> VkBuffer with Count * sizeof(largestDescriptorType)
  • A ROOT_TABLE -> VkDescriptorSetLayout
  • SetDescriptorHeaps -> vkCmdBindDescriptorBuffers
  • Instead of allocating N descriptors per ROOT_TABLE:
    • Allocate N bytes queried from VkDescriptorSetLayout
  • Instead of CopyDescriptorsSimple():
    • memcpy() into offsets queried from VkDescriptorSetLayout
  • Instead of CreateCBV/SRV/UAV/Sampler:
    • vkGetDescriptorEXT() in place
  • Instead of Set*RootDescriptorTable:
    • vkCmdBindDescriptorBufferOffsetsEXT()
    • Be aware of alignment here, it might not be aligned to sizeof(descriptor).

The main headache is that SRV and UAV can map to several different descriptor types. Ideally, the application should know if it expects to use SAMPLED_IMAGE, UNIFORM_TEXEL_BUFFER or ACCELERATION_STRUCTURE_KHR.

Improving a fully bindless design - a-la Shader Model 6.6

With descriptor indexing as-is, it’s already possible to reach a design where every resource is accessed by a uint index; the VkDescriptorSetLayout system does not change after all. We do this in vkd3d-proton already for example.

The main win of descriptor buffers for this kind of design is that it’s now far more efficient to shuffle descriptors around. We can also copy descriptors on the GPU timeline. I expect we’ll see some interesting innovation here.

Conclusion

Long term, I think descriptor buffers will change how Vulkan backends are designed, but it’s going to take a while for the ecosystem to catch up with tooling and debuggers. The work required to get there must not be underestimated.

Comments