This video entirely misses the point. The x86 emulator (technically a transpiler) is just a part of equation. The important thing is that Apple emulates some of the x86 behavior in hardware. And it’s not something a software emulator can do, at least not without significant performance problems. As far as I know, current Windows on ARM ”cheats” by pretending memory order is not a problem. This works until it doesn’t, as some software will crash or produce incorrect results. Microsoft has different emulation levels as described in their support docs, each with those come with increasing performance cost. As Qualcomm did not announce technology to safely emulate x86, I assume they don’t have it. Which would mean the same clusterfuck of crashing and incompatible software Windows on ARM has to deal with now.
https://learn.microsoft.com/en-us/windows/arm/apps-on-arm-program-compat-troubleshooter
The most important thing about GPU cores is that they are parallel in nature. A lot of GPUs out there use 1024-bit arithmetic units that can process 32 numbers at the same time. That is, if you do something like a + b, both a and b are “vectors” consisting of 32 numbers. Since a GPU is built to process large amount of data simultaneously, for example shading all pixels in a triangle, this is optimal design that has good balance between cost, performance, and power consumption.
But the parallel design of GPU units also means that they will have problems if you want more execution granularity. For example in common control logic like “if condition is true do x, otherwise do y”, especially if both x and y are complex things. Remember that GPUs really want to do the same thing for 32 items at a time, if you don’t have that many things to work with, their efficiency will suffer. So a lot of common problem solutions that are formulated with “one value at a time” approach in mind won’t translate directly to a GPU. For example, sorting. On a CPU it’s easy to compare numbers and put them in sorted order. On a GPU you want to compare and order hundreds or even thousands of numbers simultaneously to get good performance, and it’s much more difficult to design a program that will do it.
If you are talking about math specifically, well, it depends on the GPU. Modern GPUs are very well optimised for many operations and have native instructions to compute trigonometric functions (sin, cos), exponential functions and logarithms, as well as do complex bit manipulation. They also natively support a range of data values such as 32- and 16-bit floating point values. But 64-bit floating point value (double) support is usually lacking (either low performance or missing entirely).