.. _amdgpu-memmodel:

=====================
 AMDGPU Memory Model
=====================

.. contents::
   :local:

Introduction
============

The :ref:`LLVM memory model<memmodel>` provides broad guarantees that are
sufficient to implement inter-thread communication via memory. But in most
communication patterns, not all memory accesses performed by a thread need to be
exposed to other threads. Even when they do need to be exposed, not all threads
may need to observe these memory accesses. This document describes the *AMDGPU
memory model* that allows the user to control how the side-effects of memory
accesses are propagated across threads. The programmer expresses this using
**new intrinsics and metadata** as described below, and the implementation can
then choose a more efficient mechanism to complete them, such as the cache
policy bits in an AMDGPU device.

The AMDGPU memory model allows executions that are not allowed by the LLVM
memory model. At the same time, a simple mapping can be used to implement these
new intrinsics and metadata using operations defined in the default LLVM memory
model. Thus, **there exists a safe-by-default implementation** that produces
executions that are valid in both models.

Terminology
===========

Memory Accesses
  Operations that read or write locations in memory are termed as *memory
  accesses*. Typical examples are ``load``, ``store`` and atomic instructions,
  as well as many intrinsics.

Synchronization Operations
  Synchronization operations control how the side-effects of memory accesses are
  propagated in the system. Typical examples are atomic operations (including
  fences) with at least ``release`` or ``acquire`` ordering.

.. _amdgpu-scopes:

Scopes
======

A *scope* is an abstract description of sets of memory accesses and
synchronization operations in a multi-threaded execution environment. Each such
set is called an *instance* of that scope, or a *scope instance* for short.

- Each memory access or synchronization operation belongs to at most one
  instance of every scope defined by the target.
- When an operation ``X`` specifies a scope ``S``, it indicates the instance of
  ``S`` that contains ``X``. This scope instance is also termed as *X's instance
  of scope S*, or just *X's scope instance* when ``S`` is implied by the
  context.
- When an operation does not specify a scope, it indicates the *system*
  scope defined below.

LLVM scopes
-----------

The LLVM Language Reference defines the following :ref:`scopes<syncscope>`:

*system scope* (empty string "")
  There exists a single instance of this scope that contains the memory accesses
  and synchronization operations performed by all threads.

"singlethread" scope
  Each thread corresponds to a "singlethread" scope instance that contains the
  memory accesses and synchronization operations performed by that thread.

AMDGPU scopes
-------------

The AMDGPU backend further refines the LLVM scopes with the following
target-defined scopes and constraints:

- *system scope* (same as LLVM)
- "agent" scope
- "cluster" scope
- "workgroup" scope
- "wavefront" scope
- "singlethread" scope (same as LLVM)

These are arranged from largest scope (*system scope*) to smallest scope
("singlethread").

- Every instance ``X`` of some scope ``S1`` other than "singlethread" scope is
  partitioned by the scope ``S2`` one level below it. Each subset defined by this
  partition is an instance of ``S2`` and is called a *subscope instance* of ``X``.
- It follows that if two scope instances ``X`` and ``Y`` intersect, then their
  intersection is the smaller of ``X`` and ``Y``.
- A scope ``S1`` is a *subscope* of a scope ``S2`` if every instance of ``S1``
  is a subscope instance of some instance of ``S2``.

**Inclusive Scopes**: Two operations ``X`` and ``Y`` are said to have *inclusive
scopes* if the scope instance of each operation contains the other operation. In
that case, the *common scope instance* ``S'`` of ``X`` and ``Y`` is the
intersection of their scope instances. The scope corresponding to ``S'`` is also
termed as the *common scope* of ``X`` and ``Y``.

Availability and Visibility
===========================

The AMDGPU memory model is built on top of the :ref:`happens-before<memmodel>`
order defined by the LLVM memory model. But when one of the new intrinsics or
metadata is used, **happens-before by itself is not sufficient** to describe its
observable effects. Instead, the AMDGPU model uses *availability* and
*visibility* to describe how the side-effects of these operations propagate to
other threads.

Availability determines how *far* the side-effects of a write have been
forwarded in the system relative to that write. Visibility determines how
*close* the side-effects of the same write have reached relative to an observer
operation (typically a read).

The AMDGPU memory model *does not change the structure of happens-before*, but
changes the rules that determine how operations may observe the side-effects of
other operations that *happen-before* them.

Consider a write ``W`` that ``happens-before`` a read ``R`` to the same address:

- ``R`` can potentially observe the side-effects of ``W`` **only if W is
  visible** to ``R``.
- ``W`` can potentially be visible to ``R`` **only if W is first made
  available** to ``R``.

The instructions used in the default LLVM memory model automatically satisfy
these necessary conditions, and hence they can be explained using the rules from
either memory model. But the new intrinsics and metadata *opt out* of the LLVM
memory model, and can only be explained using the AMDGPU memory model.

.. _amdgpu-store-available:

store-available
---------------

.. code-block:: llvm

   @llvm.amdgcn.av.global.store.b128(ptr, value, scope)
   store atomic [syncscope("<target-scope>")]
   atomicrmw    [syncscope("<target-scope>")]
   cmpxchg      [syncscope("<target-scope>")]

The ``@llvm.amdgcn.av.global.store.b128`` intrinsic performs a non-atomic
*store-available* operation on ``ptr`` with scope ``scope``.

An atomic operation that results in a store operation is a *store-available*
operation with scope ``syncscope``.

.. _amdgpu-load-visible:

load-visible
------------

.. code-block:: llvm

   @llvm.amdgcn.av.global.load.b128(ptr, scope)
   load atomic  [syncscope("<target-scope>")]
   atomicrmw    [syncscope("<target-scope>")]
   cmpxchg      [syncscope("<target-scope>")]

The ``@llvm.amdgcn.av.global.load.b128`` intrinsic performs a non-atomic
*load-visible* operation on ``ptr`` with scope ``scope``.

An atomic operation that results in a read operation is a *load-visible*
operation with scope ``syncscope``.

.. note::

   Metadata cannot be used to model this using ordinary load/store operations,
   because the scope is necessary for correctness. In a hypothetical operation
   like this:

   .. code-block:: llvm

      store ptr, data, !mmra !{!"amdgcn-av", !"workgroup"}

   If the metadata is dropped or ignored, there is no guarantee that the store
   will become available at the intended scope. In implementation terms, the
   store may be completed at a nearer cache than the one required for that
   scope. A corresponding *load-visible* that does not access the same near
   cache will fail to observe this store.

MakeAvailable and MakeVisible
-----------------------------

.. code-block:: llvm

   store atomic [syncscope("<target-scope>")] <ordering> [, !mmra !{!"amdgcn-av", !"none"}]
   load atomic  [syncscope("<target-scope>")] <ordering> [, !mmra !{!"amdgcn-av", !"none"}]
   atomicrmw    [syncscope("<target-scope>")] <ordering> [, !mmra !{!"amdgcn-av", !"none"}]
   cmpxchg      [syncscope("<target-scope>")] <ordering> [, !mmra !{!"amdgcn-av", !"none"}]
   fence        [syncscope("<target-scope>")] <ordering> [, !mmra !{!"amdgcn-av", !"none"}]

A synchronization operation with at least ``release`` ordering is a
``MakeAvailable`` operation with scope ``syncscope``, if it is not marked as
``!{!"amdgcn-av", !"none"}``.

A synchronization operation with at least ``acquire`` ordering is a
``MakeVisible`` operation with scope ``syncscope``, if it is not marked as
``!{!"amdgcn-av", !"none"}``.

These operations include ``MakeVisible`` and ``MakeAvailable`` operations by
default. The presence of this metadata removes this ability and essentially
creates *non-av* ordering operations, i.e., ordering operations that do not
establish availability or visibility.

For an atomic operation which itself accesses memory (e.g., ``store atomic``
or ``load atomic``), the metadata does not affect the availability or the
visibility of the access performed by the operation itself. It only affects
the ordering of other memory accesses.

.. code-block:: llvm

   ; This includes the following operations:
   ; - The atomic store at "agent" scope,
   ; - A store-available operation at "agent" scope on `ptr`,
   ; - A `MakeAvailable` operation at "agent" scope that affects previous memory accesses.
   store atomic syncscope("agent") release ptr

   ; This includes the following operations:
   ; - The atomic store at "agent" scope,
   ; - A store-available operation at "agent" scope on `ptr`.
   ; Noteably, it does not include a `MakeAvailable` operation on other memory accesses.
   store atomic syncscope("agent") release ptr, !mmra !{!"amdgcn-av", !"none"}

Ordering
========

.. note::

   **TODO:** These ordering operations affect all address spaces. We need to
   eventually make that a parameter similar to the storage class parameter on
   operations and orders in Vulkan.

Availability Operation
----------------------

An operation ``X`` is an *availability operation* on a write ``W`` if one of the
following holds:

- ``X`` is ``W`` itself, and ``W`` is a *store-available* operation, or,
- ``X`` is a ``MakeAvailable`` operation that follows ``W`` in program order,
  or,
- ``X`` is a ``MakeAvailable`` operation whose scope instance includes ``W``,
  and there is an availability operation ``Z`` on ``W`` such that:

  - ``Z`` happens-before ``X``, and,
  - ``Z``'s scope instance includes ``X``.

Then ``X`` makes ``W`` available in its own scope instance ``S`` and every
subscope instance of ``S`` that also includes ``W``.

Visibility Operation
--------------------

An operation ``Y`` is a *visibility operation* on a write ``W`` if ``Y`` is a
*load-visible* operation to the same address, or a ``MakeVisible`` operation,
and one of the following holds:

- There exists an *availability* operation ``X`` on write ``W`` such that:

  - ``X`` happens-before ``Y``, and,
  - ``X`` and ``Y`` specify inclusive scopes.

  Then ``Y`` makes ``W`` visible in the common scope instance ``S`` of ``X`` and
  ``Y``, and every subscope instance of ``S`` that includes ``Y``.

- There exists a *visibility* operation ``X`` on write ``W`` such that:

  - ``X`` happens-before ``Y``, and,
  - ``X`` makes ``W`` visible in a scope instance ``S1`` that includes ``Y``, and,
  - ``X`` is included in the scope instance ``S2`` of ``Y``.

  Then ``Y`` makes ``W`` visible in the intersection ``S`` of ``S1`` and ``S2``,
  and every subscope instance of ``S`` that includes ``Y``.

Location Order
--------------

A write ``W`` is *location-ordered* before an access ``Y`` to the same address
if ``W`` is program-ordered before ``Y``.

A write ``W`` is *location-ordered* before a write ``W1`` to the same address if
there exists an availability operation ``Z`` on ``W`` such that:

- ``Z`` happens-before ``W1``, and,
- ``W1`` is included in ``Z``'s scope instance.

A write ``W`` is *location-ordered* before a read ``R`` to the same address if
there exists a visibility operation ``Z`` on write ``W`` such that:

- ``Z`` is ``R`` itself, or,
- ``Z`` precedes ``R`` in program order.

The AMDGPU memory model overrides the definition of each byte in the
:ref:`LLVM memory model<memmodel>` as follows.

Every (defined) read operation ``R`` reads a series of bytes written by
(defined) write operations. Each initialized global is assumed to have an
initial *system scoped* atomic write operation that is *location-ordered* before
any other read or write to that same location.

For each byte of a read ``R``, ``R`` may see any write to the same byte, except:

- If a write ``W1`` is *location-ordered* before a write ``W2``, and ``W2`` is
  *location-ordered* before a read ``R``, then ``R`` may not see ``W1``.
- If a read ``R`` happens-before a write ``W3``, then ``R`` may not see ``W3``.

The value returned by ``R`` is then defined as follows:

- If no write is *location-ordered* before a read ``R``, then ``R`` returns
  ``undef``.
- Otherwise if the set consisting of ``R`` and all writes that ``R`` may see
  contains only atomic operations with inclusive scopes, then ``R`` returns the
  value written by one of those writes.
- Otherwise, if ``R`` may see some write that is not *location-ordered* before
  ``R``, then ``R`` returns ``undef``.
- Otherwise, if ``R`` may see exactly one write ``W``, then ``R`` returns the
  value written by ``W``.
- Otherwise, ``R`` returns ``undef``.

Properties
==========

.. tip::

   This section is informational.

The following properties follow from the definitions above:

1. **Happens-before is necessary for location-order.** A write ``W`` is
   *location-ordered* before a read ``R`` only if ``W`` happens-before ``R``.
   This follows from the definition of availability and visibility operations,
   which always require a happens-before link with the preceding operation in
   the chain.

2. **A write cannot be made available in a scope that does not contain it.** The
   definition of an availability operation ``X`` requires that ``X``'s scope
   instance includes ``W`` as a precondition. Since every scope instance that
   includes ``X`` also includes ``W``, availability cannot reach a scope
   instance that excludes ``W``. In other words, availability can only "expand
   outwards" into progressively larger scopes.

3. **Visibility is bounded by availability.** When a write is available in a
   scope instance, it can be made visible in that scope instance by a visibility
   operation with the corresponding scope. Subsequent ``MakeVisible`` operations
   make that write visible into narrower scope instances towards the observer.

4. **A write can be made visible in a scope instance that does not contain it.**
   The definition of a *visibility operation* anchors scope instances to the
   observer (``Y``), not to the original write. The only precondition is that the
   write must already be visible or available in the scope instance of the
   visibility operation.

5. **Availability and visibility chains.** For a write ``W`` to be visible to a
   read ``R`` anywhere in the system, the sufficient condition is a chain of
   happens-before edges that include availability and visibility operations with
   inclusive scopes. It is not necessary that ``W`` and ``R`` themselves have
   inclusive scopes. Each link in the availability and visibility definitions
   only checks the immediate predecessor, so intermediate operations can bridge
   scope gaps that the endpoints cannot satisfy directly. Such a chain passes
   through at least one availability operation and at least one visibility
   operation with inclusive scopes, such that their common scope includes both
   ``W`` and ``R``.

.. _amdgcn-av-vulkan:

The Vulkan Memory Model
=======================

The AMDGPU memory model draws heavily on the Vulkan memory model. In
particular, the following instructions are equivalent.

.. csv-table::
   :header: "LLVM", "SPIRV", "Available/Visible Semantics"
   :widths: 20, 20, 60

   "``load``", "``OpLoad NonPrivatePointer``", "\-"
   "``load-visible``", "``OpLoad NonPrivatePointer``", "``MakePointerVisible``"
   "``store``", "``OpStore NonPrivatePointer``", "\-"
   "``store-available``", "``OpStore NonPrivatePointer``", "``MakePointerAvailable``"
   "``load atomic``", "``OpAtomicLoad``", "``MakePointerVisible``. Also ``MakeVisible`` when order is at least ``acquire``."
   "``load atomic !{!""amdgcn-av"", !""none""}``", "``OpAtomicLoad``", "``MakePointerVisible``"
   "``store atomic``", "``OpAtomicStore``", "``MakePointerAvailable``. Also ``MakeAvailable`` when order is at least ``release``."
   "``store atomic !{!""amdgcn-av"", !""none""}``", "``OpAtomicStore``", "``MakePointerAvailable``"
   "``fence``", "``OpMemoryBarrier``", "``MakeAvailable`` when order is at least ``release``, and ``MakeVisible`` when order is at least ``acquire``."
   "``fence !{!""amdgcn-av"", !""none""}``", "``OpMemoryBarrier``", "\-"

.. note::

   The above table is representative only, and does not aim to be exhaustive. In
   particular, it does not list composite atomic operations like ``rmw`` and
   ``cmpxchg``. The ordering and semantics of these operations can be determined
   by combining suitable rules such as:

   - "``MakeAvailable`` if the order is at least ``release``, and the operation
     results in a store",
   - "Only if it is not marked as ``!{!"amdgcn-av", !"none"}``", etc.

The AMDGPU memory model is a special case of the Vulkan memory model:

a. LLVM fence/atomic ordering operations have ``MakeAvailable`` /
   ``MakeVisible`` semantics by default, thus satisfying the availability and
   visibility chains required in Vulkan. Hence the LLVM memory model is a
   "strong" subset of the Vulkan memory model.
b. The AMDGPU memory model described here makes it possible to opt-out of the
   default ``MakeAvailable`` and ``MakeVisible`` semantics, and instead specify
   it on select places including the new *load-visible* and *store-available*
   operations. This expands the subset of the Vulkan memory model that can now
   be expressed in LLVM IR.