Internals of the tinygrad WebGPU backend

December 6, 2025Ahmed Harmouche

Brief overview of tinygrad

tinygrad is a neural network framework built around one key idea: express the core of neural networks as simply as possible. Unlike larger frameworks such as PyTorch or TensorFlow, tinygrad doesn't rely on a large set of hand-written kernels. Instead, it generates kernels on the fly. The operation set is intentionally minimal, so higher-level ops like matrix multiplication or convolution are constructed from smaller primitives.
This simplicity enables aggressive kernel fusion, since the composed ops can be reorganized and optimized holistically. Rather than hand-tuning kernels for performance, tinygrad explores many possible kernel variants - via beam search - to discover fast implementations automatically.

WebGPU

WebGPU is a new web API designed to bring high-performance, GPU-accelerated applications to the browser. Unlike its predecessor WebGL, WebGPU includes a dedicated compute pipeline, making it practical to run compute kernels, an essential capability for efficiently executing neural networks.
While WebGL could approximate compute workloads through various workarounds (such as rendering into framebuffers), it offered no direct control over key compute concepts like workgroup sizes, memory layout, or synchronization. WebGPU exposes these capabilities natively, bringing the browser much closer to modern GPU APIs.
There are multiple WebGPU implementations in use today; for example, Google Chrome relies on its own engine, Dawn.

WebGPU runtime in tinygrad

tinygrad’s simplicity makes runtime and backend integrations straightforward. Although tinygrad is written entirely in Python, it depends on libraries implemented in languages like C and C++. The key difference is how it uses these libraries: instead of relying on third-party wrapper packages, tinygrad autogenerates its Python bindings in-tree. Previously this was done using clang2py, but thanks to work by chrism, the binding generator is now fully in-tree. From a C header file, tinygrad automatically produces the corresponding enums, structs, function declarations, and so on.
For the WebGPU runtime, we use these autogenerated bindings to Dawn, which allowed us to remove tinygrad’s last third-party dependency: wgpu-py.
There is a short backstory behind the WebGPU runtime. It was originally implemented, then removed from core tinygrad, and later reintroduced. The initial version relied on numerous hacks (e.g., logic based on Device.DEFAULT == "WEBGPU"), which became hard to maintain given tinygrad’s rapid development pace and frequent refactors. To reduce maintenance burden and keep core changes smooth, the WebGPU backend - along with the LLVM backend - was removed.
As tinygrad matured, however, bringing it back made sense, since the WebGPU backend had real users. The difference this time is that the previous hacks were no longer acceptable; the backend had to integrate cleanly with tinygrad’s architecture.

Bringing back WebGPU introduced quite a few challenges, because it differs from the other backends in several important ways:

In this blog post we will look at the first point.

Supporting small integer types

WebGPU Shading Language (WGSL) does not support integer types smaller than 32 bits. But tinygrad does. So this presents a challenge to solve if we want to support them. If we look deeper it turns out we have to solve three sub-challenges for sub-32 bit support to work properly:

The key idea is bitpacking. Let us explain this by exploring the 3 sub-challenges separately.

Loading from memory

Even though WGSL can’t load a single byte directly, it can load a full 32-bit word. From there, we extract the desired byte by shifting and masking. This approach lets tinygrad emulate smaller integer types while staying fully compliant with WGSL’s type system.

# Getting the 1st byte
my_dword = 0xAABBCCDD
my_byte_1 = my_dword & 0xFF # 0xDD

# Getting the 2nd byte
my_byte_2 = (my_dword >> 8)  & 0xFF # 0xCC

# Getting the 3rd byte
my_byte_3 = (my_dword >> 16)  & 0xFF # 0xBB

# Getting the 4th byte
my_byte_4 = (my_dword >> 24)  & 0xFF # 0xAA

To load 16-bit data types like short or ushort, the idea is the same as with bytes: we load a full 32-bit word and then extract the 16-bit portion we need. The only differences are the mask (0xFFFF) and the shift amounts.
The tricky part is translating the index to decide which byte or short should be selected. tinygrad’s indexing logic assumes an array of bytes, while WebGPU sees memory in 32-bit chunks. That means when tinygrad asks for the second element of a 16-byte array, WebGPU interprets it as the second 32-bit integer of a 4-integer array.
To resolve this mismatch, we treat the tinygrad-generated index as if it were two-dimensional. The first component determines which 32-bit word to load, and the second component determines which byte or half word within that word to extract.

# Select the 32-bit word by dividing the byte index by 4
my_dword = memory[index // 4]

# Select the specific byte by taking index % 4, converting that to a bit shift,
# and masking out the lower 8 bits
my_byte = (my_dword >> ((index % 4) * 8)) & 0xFF

The nice thing about tinygrad is that you can express all of this at the UOp layer - tinygrad’s intermediate representation - rather than directly inside the WGSL renderer. This means tinygrad can apply its own optimizations and generate code that’s often better than what you would hand-write at the WGSL level.
Now let’s look at how to perform arithmetic on these loaded values.

Performing arithmetic on the loaded values

Using the byte-extraction logic described earlier always produces a 32-bit unsigned value. That’s not ideal, because both signed and unsigned types (like byte and ubyte) would otherwise become identical once loaded into 32 bits. This is where sign extension and zero extension come in.
Sign extension preserves the sign of a smaller-than-32-bit signed value when promoting it to 32 bits.

An illustration showing a 16-bit-to-32-bit sign extensions. "S" denotes sign bit. Source: https://www.scs.stanford.edu/05au-cs240c/lab/i386/s03_01.htm

We treat the most significant bit of the original value as the sign bit and replicate it through the higher bits up to bit 31. In two’s-complement, this ensures that a signed byte like -1 remains -1 after being expanded to 32 bits. Without sign extension, that same byte would incorrectly become 255.
For unsigned types, we instead perform zero extension, filling all the upper bits with zeros. That same 0xFF example would correctly stay 255 when interpreted as an unsigned byte.
By using sign extension for signed types and zero extension for unsigned types, we ensure that sub-32-bit arithmetic behaves correctly inside WGSL, even though WGSL lacks native integer types smaller than 32 bits.

Storing the result back to memory

We also want to store the result of the sub-32-bit arithmetic operation back to memory, which involves indexing operations nearly identical to those used for loading. The main new concern is race conditions when multiple threads attempt to write to the same underlying memory location.
To understand why this happens, consider again how WebGPU represents a byte array. Suppose we have an array of 16 bytes. WGSL doesn’t support an 8-bit integer type in storage buffers, so this array is actually stored as an array of four 32-bit integers. When a GPU kernel that conceptually operates on bytes runs, each thread may compute a byte index between 0 and 15. But our physical buffer contains only four 32-bit words.
As shown earlier in the loading section, the first step in mapping a byte index to a 32-bit word index is to divide by 4. That means byte indices 0–3 all refer to word 0; indices 4–7 refer to word 1; and so on. This creates a race condition: four different threads may simultaneously attempt to update different bytes inside the same 32-bit word. Without protection, they would overwrite one another’s updates.
To avoid this, we must use atomic operations. WGSL provides atomic operations on atomic<u32>, such as atomicAnd and atomicOr (or atomicAdd if addition semantics are needed). The basic idea is:

1. Compute the target word index (index / 4).

2. Compute the byte position inside the word (index % 4).

3. Construct a mask that clears only the target byte.
For example, to clear byte b, we create a mask like 0xFF << (b * 8) and invert it.

4. Clear the target byte using atomicAnd.
This removes the old byte value while leaving the other three bytes untouched.

5. Set the new byte value using atomicOr (or atomicAdd if appropriate), shifting the new 8-bit value into the correct position.

Because both operations are atomic, the updates from multiple threads cannot interleave in a way that corrupts the 32-bit word; each thread safely updates only its own byte.

This atomic read-modify-update pattern is the only reliable way to emulate sub-32-bit stores in WGSL when multiple threads may write to the same underlying 32-bit element.

Final words

We hope this dive into tinygrad’s WebGPU internals gave you a clearer picture of what’s going on beneath the surface. We’ll be exploring more of these corners in upcoming posts.
And now, when you’re working with byte- and short-typed tinygrad tensors, you’ll know exactly what’s happening under the hood.