@GomaEspumaRegional

GomaEspumaRegional@alien.top · 11 months ago

Because there is no need from an address space or compute standpoint.

to understand how large 128bit memory space really is; you’d need a memory size larger than all the number of atoms in the solar system

In the rare cases where you need to deal with a 128bit integer or floating, you can do it in software with not that much overhead by concatenating registers/ops. There hasn’t been enough pressure in terms of use cases that need 128bit int/fp precision for manufacturers to invest the resources in die area to add direct HW support for it.

FWIW there have been 64bit computers since the 60s/70s.

GomaEspumaRegional@alien.top · 1 year ago

The main issue is carrier relationships. Not patents.

The modem is not the hard part, not easy either. But having enough engagement with carriers worldwide to support all the use cases in terms of infrastructure combinations. The validation process for that is extremely expensive. That is one of the value propositions Qualcomm offers to the customers of their chipsets; they basically take care of all that headache for the phone vendor/integrator if they just go with their chipset (android) or modem (apple).

This is why the most successful modem companies (Qualcomm Huawei) either also offer a lot of infrastructure products themselves or have very strong connections with infrastructure manufacturers like Ericsson and Nokia (Samsung, Mediatek).

From the HW perspective, the issue is not the modem itself, but all the supporting chipset specially the antena/RF elements. Which in 5G involve a lot of beam “herding” whose power is hard to scale and are not that easy to manufacture. Also there are lots of thermal issues with those antenna elements.

Apple does not have, currently, the corporate culture for that type of engagement. Because they got a very good technical team from intel, but not the other side of the equation in terms of telco carrier infra engagement.

GomaEspumaRegional@alien.top · 1 year ago

analog compute in memory or silicon photonics

What does that word salad have to do with AI? ;-)

GomaEspumaRegional@alien.top · 1 year ago

Most HW startups fail because they never get the SW story right.

At the end of the day, hardware is used to run software. So unless you’re having access to a large software library from the get go (by accelerating a known entity or architecture), or you truly have a fantastic value proposition in terms of being orders of magnitude faster than the established competition and with a solid roadmap in terms of HW and SW. The best most HW startups can hope for is an exit plan where their IP is bought by a bigger player.

HW people some time miss the boat that if something is 2x as fast, but it takes 2x as long to develop for, you’re not giving that much in terms of leadership window for your customers. So they’ll remain with the known entity, even if it’s less efficient or performant, on paper.

GomaEspumaRegional@alien.top · 1 year ago

It depends on what you meant by 64bit computing, which is not the same as x86 becoming a 64bit architecture.

FWIW, 64 bit computing had been a thing for a very long time in the supercomputer/mainframe space since the 70s. And high end microprocessors had supported 64bit since the early 90s.

So by the time AMD introduced x86_64 there had been about a quarter century of 64bit computing ;-)

It was a big deal for x86 vendors, though. As that is when x86 took over most of the datacenter and workstation markets.

GomaEspumaRegional@alien.top · 1 year ago

Frame Buffers have been a thing since the 60s at least ;-)

Basically it is a piece of memory that contains the color information for a set of pixels. The simplest would be a black and white frame buffer, there the color of each pixel is defined by it being 1 (black) or 0 (white).

Let’s assume that you want to deal with a monitor that it is 1024x1024 pixels in resolution, so you need 1024x1024 (~1Mbit) bits of information to store the color of each pixel.

So in the simplest case, you had a CPU writing the 1Mbit BW image that it just generated (by whatever means) into the region of memory that the video hardware is aware of. Then the display generator would go ahead and read each of the pixels and generate the color based on the bit information it reads.

Rinse and repeat this process around 30 times per second and you can display video.

If you want to display color, you increase the number of bits per pixel to whatever color depth you want. And the process is basically the same, except the display generator is a bit more complex as to generate the proper shade of color by mixing Red/Gree/Blue/etc values.

That is the most basic frame buffer, unaccelerated meaning that the CPU does most of the work in generating the image data to be displayed.

So assuming you had a CPU that was incredibly fast, you could technically do just about the same that a modern GPU can do. It just would need to be thousands of times faster than the fastest modern CPU to match a modern GPU. ;-)

Hope this makes sense.

GomaEspumaRegional@alien.top · 1 year ago

It depends what type of GPU “core” you are talking about.

What NVIDIA refers to as CUDA/Tensor/RT cores are basically just glorified ALUs with their own itsy tiny control. But for the most part they are just an ALU.

For the most part CPUs tend to be more complete scalar processors, which they include the full control datapath as well as multiple Functional Units (FUs) not just an floating point unit.

The distictions are moot nowadays though; a modern GPU includes their own dedicated scalar core (usually in terms of a tiny ARM embedded core) for doing all the “housekeeping” stuff needed for them to interface with the outside world. And modern CPUs contain their own data-parallel functional units that can do some of the compute that GPUs can.

In the end the main difference is in terms of scale/width of data parallelism within a CPU (low) vs a GPU (high)

GomaEspumaRegional@alien.top · 1 year ago

GPUs are not necessarily Turing complete, BTW.

GomaEspumaRegional@alien.top · 1 year ago

AMD has had traidionally very competitive FLOPs with their shaders. The issue is that their software stack, for lack of a better word is; shit.

For specific customers, like national labs or research institutions, they can afford to pay a bunch of poor bastards to develop some of the compute kernels using the shitty tools. Because at the end of the day, most of their expenses are in terms of electricity and hardware, with salaries not being the critical cost for some of these projects. I.e. grad students are cheap!

However, when it comes to industry, things are a bit difference. First off, nobody is going to take a risk w a platform with little momentum behind it. Also they need to have access to talent pool that can develop and get the applications up and running as soon as possible. Under those scenarios, salaries (i.e. the people developing the tools) tend to be almost as important consideration as the HW. So you go with the vendor that gives you the biggest bang for your buck in terms of performance and time to market. And that is where CUDA wins hands down.

At this point AMD is just too behind, at least to get significant traction in industry.

GomaEspumaRegional@alien.top · 1 year ago

That’s just a pipedream.

Their current balance sheets seem to indicate otherwise…

GomaEspumaRegional@alien.top · 1 year ago

Standard float16 uses 1bit sign + 5bit exponent + 10bit fraction.

bfloat16 uses 1bit sign + 8bit exponent + 7bit fraction.

bfloat16 basically gives the same exponent precision as a standard float32. But most neural networks don’t require a huge fraction range. So bfloat16 gives you the possibility of executing 2x 8bit NP FLOPs vs using a float32 to do the same 1x8bit NP FLOP.

Having the ALU support this format allows for the scheduler to pack 4xbfloat16 that can be executed in parallel in a standard 64bit ALU. So basically you double or quadruple the 8bit NP FLOPs that you would get from using traditional float16/32 representations.

GomaEspumaRegional@alien.top · 1 year ago

And unlike things like automobiles they are long are in the chip business

EVs have been a huge priority for China, why would you think they are not long on their automotive sector?

GomaEspumaRegional@alien.top · 1 year ago

LOL. Are you seriously asking people to do all your work for you, for free?

GomaEspumaRegional@alien.top · 1 year ago

In a sense yes.

their current quarter was 279% larger than the same quarter last year. Which is insane.

GomaEspumaRegional@alien.top · 1 year ago

in x86 that’s not the case, only the critical path x86 instructions are implemented directly in logic lookup tables in the decoder. Some of the less used ones are on the uCode ROM on chip. And a bunch more on PAL code on off-chip ROM. And a few of the rarest ones are on the exception manager libraries of the OS.

A big chunk of the x86 ISA is rarely used so this tiered implementation has been used at least since Nehalem if not before.

GomaEspumaRegional@alien.top · 1 year ago

A lot of x86 ISA is in the micro and PAL codes. Only the most frequent and performance-limiting ones are on-core for modern x86.

x86 is a huge set, so “very few” is a relative term ;-)

GomaEspumaRegional@alien.top · 1 year ago

Mate, the 90s were a few decades back. ;-)

x86 decoding hasn’t been a limiter since then.

GomaEspumaRegional@alien.top · 1 year ago

Well, to be fair these are not responses to the Pro/Max M3 skus not just in terms of iGPUs, but also in core count.

GomaEspumaRegional@alien.top · 1 year ago

Well Top/Bottom is relative. But I assume you meant Top as in the side facing the heatsink.

Memory is “cooler” than dynamic logic, you want the colder element on top of the hotter one on the path of maximum dissipation.

GomaEspumaRegional@alien.top · 1 year ago

Honestly, it looks like Apple is pretty much conceding the upper end of the professional market. So a TR+High end NVIDIA GPU would likely obliterate a Mac Pro/Studio on those workloads, perhaps some very specific video encoding workflows may have an edge on AS.