AMDGPU Asynchronous Operations

Introduction

Asynchronous operations are memory transfers (usually between the global memory and LDS) that are completed independently at an unspecified scope. A thread that requests one or more asynchronous transfers can use asyncmarks to track their completion. The thread waits for each asyncmark to be completed, which indicates that requests initiated in program-order before this asyncmark have also completed.

Operations

Memory Accesses

The following instructions request asynchronous transfer of data between global memory and LDS memory.

Note

These listings are merely representative. The actual function signatures and supported architectures are documented in the User Guide for AMDGPU Backend.

GFX9 Async Instructions (LDS DMA)

void @llvm.amdgcn.load.async.to.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.global.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.raw.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.raw.ptr.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.struct.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.struct.ptr.buffer.load.async.lds(ptr %src, ptr %dst)

GFX12 Async Instructions

void @llvm.amdgcn.global.load.async.to.lds.type(ptr %dst, ptr %src)
void @llvm.amdgcn.global.store.async.from.lds.type(ptr %dst, ptr %src)
void @llvm.amdgcn.cluster.load.async.to.lds.type(ptr %dst, ptr %src)

GFX1250 Tensor DMA Instructions

void @llvm.amdgcn.tensor.load.to.lds(...)
void @llvm.amdgcn.tensor.store.from.lds(...)

Asyncmark Operations

An asyncmark in the abstract machine tracks all the async operations that are program-ordered before that asyncmark. An asyncmark M is said to be completed only when all async operations program-ordered before M are reported by the implementation as having finished, and it is said to be outstanding otherwise.

Thus we have the following sufficient condition:

An async operation X is completed at a program point P if there exists an asyncmark M such that X is program-ordered before M, M is program-ordered before P, and M is completed. X is said to be outstanding at P otherwise.

The abstract machine maintains a sequence of asyncmarks during the execution of a function body, which excludes any asyncmarks produced by calls to other functions encountered in the currently executing function.

@llvm.amdgcn.asyncmark()

When executed, inserts an asyncmark in the sequence associated with the currently executing function body.

@llvm.amdgcn.wait.asyncmark(i16 %N)

Waits until there are at most N outstanding asyncmarks in the sequence associated with the currently executing function body.

Memory Consistency Model

Each asynchronous operation consists of a non-atomic read on the source and a non-atomic write on the destination. Async “LDS DMA” intrinsics result in async accesses that guarantee visibility relative to other memory operations as follows:

An asynchronous operation A program ordered before an overlapping memory operation X happens-before X only if A is completed before X.

A memory operation X program ordered before an overlapping asynchronous operation A happens-before A.

Note

The only if in the above wording implies that unlike the default LLVM memory model, certain program order edges are not automatically included in happens-before.

Examples

Uneven blocks of async transfers

void foo(global int *g, local int *l) {
  // first block
  async_load_to_lds(l, g);
  async_load_to_lds(l, g);
  async_load_to_lds(l, g);
  asyncmark();

  // second block; longer
  async_load_to_lds(l, g);
  async_load_to_lds(l, g);
  async_load_to_lds(l, g);
  async_load_to_lds(l, g);
  async_load_to_lds(l, g);
  asyncmark();

  // third block; shorter
  async_load_to_lds(l, g);
  async_load_to_lds(l, g);
  asyncmark();

  // Wait for first block
  wait.asyncmark(2);
}

Software pipeline

void foo(global int *g, local int *l) {
  // first block
  asyncmark();

  // second block
  asyncmark();

  // third block
  asyncmark();

  for (;;) {
    wait.asyncmark(2);
    // use data

    // next block
    asyncmark();
  }

  // flush one block
  wait.asyncmark(2);

  // flush one more block
  wait.asyncmark(1);

  // flush last block
  wait.asyncmark(0);
}

Ordinary function call

extern void bar(); // may or may not make async calls

void foo(global int *g, local int *l) {
    // first block
    asyncmark();

    // second block
    asyncmark();

    // function call
    bar();

    // third block
    asyncmark();

    wait.asyncmark(1); // wait for the second block
    wait.asyncmark(0); // will wait for third block, including bar()
}

Implementation notes

[This section is informational.]

Optimization

The implementation may eliminate asyncmark/wait intrinsics in the following cases:

  1. An asyncmark operation which is not included in the wait count of a later wait operation in the current function. In particular, an asyncmark which is not post-dominated by any wait.asyncmark.

  2. A wait.asyncmark whose wait count is more than the outstanding async asyncmarks at that point. In particular, a wait.asyncmark that is not dominated by any asyncmark.

In general, at a function call, if the caller uses sufficient waits to track its own async operations, the actions performed by the callee cannot affect correctness. But inlining such a call may result in redundant waits.

void foo() {
  asyncmark(); // A
}

void bar() {
  asyncmark(); // B
  asyncmark(); // C
  foo();
  wait.asyncmark(1);
}

Before inlining, the wait.asyncmark waits for asyncmark B to be completed.

void foo() {
}

void bar() {
  asyncmark(); // B
  asyncmark(); // C
  asyncmark(); // A from call to foo()
  wait.asyncmark(1);
}

After inlining, the wait.asyncmark now waits for asyncmark C to complete, which is longer than necessary. Ideally, the optimizer should have eliminated asyncmark A in the body of foo() itself.