When Intel’s Efficiency Cores Cause Inefficiency: APO Benchmarks & +30% Performance Tech Demo

M337ING@alien.top · 2 years ago

When Intel’s Efficiency Cores Cause Inefficiency: APO Benchmarks & +30% Performance Tech Demo

soggybiscuit93@alien.top · 2 years ago

Having Intel devs do manual, per game scheduling optimization seems unsustainable in the long term.
I wonder if the long term plan is to try and automate this, or use the NPU in upcoming generations to assist in scheduling.

igby1@alien.top · 2 years ago

NVIDIA has been optimizing their drivers for specific games each month for a long time.

YashaAstora@alien.top · 2 years ago

Nvidia is a graphics card company, they need constant driver work for their cards. Intel is a CPU company for whom gaming is a minor side hustle at best.

rorschach200@alien.top · 2 years ago

Also, GPUs are full of sharp performance cliffs and tuning opportunities, there is a lot to be gained. CPUs are a lot more resilient and generic - a lot less to be gained there.

No-Roll-3759@alien.top · 2 years ago

APO performance uplift suggests that may not be accurate.

rorschach200@alien.top · 2 years ago

The “gain” is largely a weighted average over all apps, not a max realizing in couple of outliers. It’s the bulk that determines the economics of the question, not singular exceptions.
The current status is heavily dominated by the historical state of affairs, as not enough time has passed to do much yet. Complex heterogenous cache hierarchies that generalize poorly is a very recent thing in CPUs, in GPUs it was the case for decades now, and in GPUs that is not the only source of large sensitivity to tuning.

PotentialAstronaut39@alien.top · 2 years ago

Kinda shitty they restrict it to the 14900k(f) and 14700k(f).

No 14900? No 14700? They’re the same number of P and E cores as the K versions. And they claim they focused on the gaming oriented CPUs, but no 14600k which must be the most gaming oriented CPU in the 14th series? The higher core counts are good for productivity, but gaming?

I don’t know what to think about that.

Put_It_All_On_Blck@alien.top · 2 years ago

Not sure why is so pessimistic about future support after seeing all the effort Intel has put into Arc drivers, which are obviously manual tuning too. APO will never ever come to every game, most games wont even benefit much from it, and it would be far too much work, but all they would need to do is look at like the top 100 games played every year, quickly go through them and see which are under performing due to scheduling issues (not hard to do), then hand tune the ones they expect to find performance left on the table.

Finding an additional 30% performance, and lower power consumption is definitely worth the effort, its far cheaper for Intel to go down this route than it is to get these gains in silicon. And its not like Intel has any plans to move away from heterogeneous designs anytime soon, even AMD is now doing them and they have their own scheduler issues (X3D on 1/2 CCDs and Zen4+Zen4c).

I’d obviously like to see support on 13th gen and the midrange SKUs too, and ideally not have a separate APO app.

rorschach200@alien.top · 2 years ago

but all they would need to do is look at like the top 100 games played every year

My main hypothesis on this subject - perhaps they already did, and out of the top 100 games only 2 games was possible to accelerate via this method, even after exhaustively checking all possible affinities and scheduling schemes, and only on CPUs with 2 or more 4-clusters of E-cores.

The support for the hypothesis is the following suggestions:

how many behavioral requirements the game threads might need to satisfy
how temporally stable the thread behaviors might need to be, probably disqualifying apps with any in-app task scheduling / load balancing
the signal that they possibly didn’t find a single game where 1 4-core E-cluster is enough (how rarely is this applicable if they apparently needed 2+, for… some reason?)
the odd choice of Metro Exodus as pointed out by HUB - it’s a single player game with very high visual fidelity, pretty far down the list of CPU limited games (nothing else benefited?)
the fact that none of the games supported (Metro and Rainbow 6) are based on either of the two most popular game engines (Unity and Unreal), possibly reducing how many apps could be hoped to have similar behavior and possibly benefit.

Now, perhaps the longer list of games they show on their screenshot is actually the games that benefit, and we only got 2 for now because those are the only ones they figured (at the moment) how to detect threads identities in (possibly not too far off from as curiously as this), or maybe that list is something else entirely and not indicative of anything. Who knows.

And then there comes the discussion you’re having, re implementation, scaling, and maintenance with its own can of worms.

AgeOk2348@alien.top · 2 years ago

that makes a lot of sense…

Helpdesk_Guy@alien.top · 2 years ago

And its not like Intel has any plans to move away from heterogeneous designs anytime soon, even AMD is now doing them and they have their own scheduler issues (X3D on 1/2 CCDs and Zen4+Zen4c).

AMD isn’t really doing anything heterogeneous, pal.
Correct me if I’m wrong here, but apart from the different clock-frequency properties, Zen4c-cores are in fact *identical* to the usual full-grown Zen4-Cores. Zen4c-Cores are barely anything else than a compactly built and neatly rearranged Zen4-Core, without the micro-bumps for the 3D-Cache. The only downside is the lower max clocks, and that’s literally it.

The main reason for AMD introducing any whatsoever Zen4c-Core was the mere fact of their increased power-density (Server-space; Muh, racks!), so solely for space-saving reasons andoverall efficiency and that’s it.
Even the L2-cache is identical, isn’t it?

→ A Zen4c-Core is not a E-Core, as it’s architecturally identical to any Zen4-Core, same IPC.
Same story for the X3D-endabled Cores/Chiplets. Identical apart from a larger cache.

So I don’t really know what you’re actually talking about when erroneously claiming AMD would also have jumped the heterogeneous hype-train. That statement of yours is utter nonsense.

On AMD there’s no heterogeneous mixing in terms of different IPC-/architecture-cores, being different and as such needing to be scheduled accordingly to run properly. Only Intel needs to rely on a heterogeneous-aware (and capable!) scheduler and depends on proper scheduling to NOT kill performance.

Meanwhile, for any mix-and-max AMD Zen4/Zen4c-CPU, it’s fundamentally irrelevant what core a thread is running on, as it doesn’t matter anyway. In fact, the scheduler doesn’t even need to know which core is a usual Zen4 and which is a Zen4c.

AMD’s designs are heterogeneous in terms of different chiplets/configs, yes.
The heterogeneousness you are talking about isn’t even remotely the same as heterogeneousness in terms of Heterogeneous computing (system [on a Chip], that uses multiple types of computing-cores) in terms of different architectures as Intel uses in their Hybrid-SoCs. So no, no heterogeneousness for you!

AgeOk2348@alien.top · 2 years ago

and this is why amd’s 3d + normal chiplet cpus arent having as hard a time as intels mess. heck even if amd wants to go big little they can have a big chiplet and a little chiplet to avoid many of these problems

rorschach200@alien.top · 2 years ago

From the video for convenience:

“Why did Intel only choose to enable Intel® Application Optimization on select 14th Gen processors? Settings within Intel® Application Optimization are custom determined for each supported processor, as they consider the number of P-cores, E-cores, and Intel® Hyperthreading Technology. Due to the massive amount of custom testing that went into the optimized setting parameters specifically gaming applications [sic], Intel chose to align support for our gaming-focused processors.”
- Intel

(original page quoted from)

SkillYourself@alien.top · 2 years ago

As for Metro Exodus, it’s more like:

When game devs get too ambitious and spawns SystemInfo.processorCount() workers for a 32-thread battleroyale in the L3$

APO still slightly outperforms E-cores off and HT off in this game, so I’m assuming APO does something useful with the 16MB of L2$ to take pressure of the L3$

Healthy_BrAd6254@alien.top · 2 years ago

Why these games though?
If this works best in games that already get high fps in the first place, then I think this is probably not going to be that useful. Games like Starfield need more CPU performance, not friggin R6 or Metro Exodus, both of which get hundreds of fps anyway.

AgeOk2348@alien.top · 2 years ago

Games like Starfield need more CPU performance

makes me think the game is more ram bandwidth limited than anything tbh

Healthy_BrAd6254@alien.top · 2 years ago

HUB or GN tested for this and didn’t find that

When Intel’s Efficiency Cores Cause Inefficiency: APO Benchmarks &amp; +30% Performance Tech Demo

When Intel’s Efficiency Cores Cause Inefficiency: APO Benchmarks &amp; +30% Performance Tech Demo

When Intel’s Efficiency Cores Cause Inefficiency: APO Benchmarks & +30% Performance Tech Demo

When Intel’s Efficiency Cores Cause Inefficiency: APO Benchmarks & +30% Performance Tech Demo