I am trying to understand how the GPUs work currently.
I saw that Apple M2 Max has for example 30 GPU cores. I was surprised because I heard that GPU have hundreds of cores. So I made a bit research and I got my answer:
on Apple silicon each core is made up of 16 execution units, which each have 8 distinct compute units (ALU)
That makes more sense now. 30 cores is actually 30*16*8=3840 execution units.
But why separate in “cores” like this then ? I heard that GPUs, conversely to GPUs can have all their units corking on the SAME task.
why separate in 16 then in 8 rather than a full grid ? I don’t get it.
Does it have implications or is it just pure marketing ?
GPU architecture details can’t be compared across vendors, they each do things in their own way.
All GPUs are organized in multi-level hierarchies. Nvidia has GPCs that contain TPCs which contain SMs (probably the closest thing to a “core”) which are split into partitions containing the actual ALUs. There’s quite a lot of other hardware spread out over these different levels like caches, fixed function units like blending or texture filtering, etc.
You can find more details in documents like these:
https://cdrdv2-public.intel.com/758302/introduction-to-the-xe-hpg-architecture-white-paper.pdf
Simplified version: an Apple GPU core contains four execution units, each of which is 32-wide (it performs an operation on 32 data values in parallel). An instruction in shader program is executed on one of these units. In other words, there are 128 scalar arithmetical units in an Apple GPU core, capable of executing up to four different 32-wide instructions per cycle.
More complicated, but correct version. An Apple GPU core contains multiple execution units of different types. There are also four instruction schedulers which select a shader instruction and send it to an execution unit. Each scheduler controls one 32-wide FP32 unit, one 32-wide FP16 unit, and (presumably, not quite sure) one 16-wide INT32 unit. So in total you have 4x of those units in a core. On M1 and M2 a scheduler can dispatch one instruction to a suitable execution unit per cycle. This means the other units are idling (e.g. it can do either FP32, FP16, or half INT32 operation per cycle). On Apple M3, schedulers are capable of dual issue and can dispatch two instructions per cycle (e.g. one FP32 and one FP16 or INT) assuming appropriate instructions can be found in the instruction stream. This is why M3 can be much faster on complex shaders even though the nominal spec of the GPU didn’t change much.
Each GPU core executes a large number of shader programs in parallel and switches between shaders every cycle, in order to make as much progress as possible. If it can’t find an instruction to execute (for example because all shaders are currently waiting for a texture load), the units have to go idle and thus your performance potential decreases. This is why it’s important to give the GPU as much work as possible, it helps to fill those gaps (the hardware can run some shaders while others are waiting).
How a GPU is broken down usually determines how other things are shared between those ALUs. I’ll use ARC Alchemist for this because I have the spec sheet for it.
The A770 is broken down into 32 Xe Cores. This means it has 4096 shading units, 256 TMUs, 128 ROPs, 512 Execution Units, 512 Tensor Cores, 32 RT Cores, and 16MB of L2 cache.
You can also think of this as each Xe Core being made of 128 shading units, 8 TMUs, 4 ROPs, 16 Execution Units, 16 Tensor Cores, 1 RT Cores, and 512KB of L2 cache.
That Xe Core is the smallest unit you could break an Alchemist GPU into and still have every part of the larger whole. You can’t literally just do that, cut the GPU in half, but if I had to draw a diagram of one that is what would be in each Xe Core block.
I’m not going to get into the technical side of how GPU design works, partly because that’s an entire doctorate thesis to write out, and also because I work on the CPU side and those guys are wizards to me.