It is probably good that Chips and Cheese stays technical and objective. But, this was from the pre-Zen bad days of AMD, right? I wonder if a “where’d it all go wrong” post would be more interesting. Or, maybe more optimistically, “how’d this set things up for Zen.”
hajile 2 hours ago [-]
The interconnect didn't have very much to do with what went wrong or setting up for Zen.
AMD made a killer design with Athlon64 that should have taken over the entire industry and made them the largest hardware company on the planet. Instead, Intel leveraged their market position to make it economically infeasible for computer manufacturers to buy AMD chips.
AMD was out of money which limited options. Denard Scaling had just failed, but Moore's Law was still in effect and multithreading was hyped as the future of everything. This made a big argument for lots of smaller cores and the most area-efficient way to do this was sharing less-used resources resulting in AMD betting big on small-core CMT.
At the same time, AMD's ATI division was under pressure to make a new, flexible GPU design (that became GCN) and the cult of Nvidia (even knowingly shipping massive numbers of defective chips then having a worse GPU than GCN still wasn't enough to lose market dominance).
The interconnect was a lower-priority redesign, so they slapped a bandaid on it and pushed the redesign down the road.
ahartmetz 1 hours ago [-]
I think you're being too nice about Bulldozer. It really was a big fat unforced error. Approximately no one wants to buy a CPU that's significantly slower than the last one at common (single core) tasks.
Today, Intel is still selling more CPUs than AMD in most market segments even though they are usually worse.
Zardoz84 42 minutes ago [-]
They Bett too soon on having a high number of cores. And latter evolution of that microarchitecture wasn't bad.
From a proud ex user of a FX8370E
mlinhares 41 minutes ago [-]
Athlon64 should have been the wake up call for intel to focus on engineering to beat AMD, but they decided they would bully the market into a worse product forever.
toast0 2 hours ago [-]
That's probably in their Bulldozer article [1]. But this article is about memory access on their APUs; you just have to accept the CPU was what it was, no need to dwell on it here.
Something in the article that I had to look up that might bother others. He uses the term 'DCT' in this sentence, but it's never defined in the article. AFAIK it stands for 'DRAM Memory Controller', but that could be an LLM hallucination. Running a web search defines it as Discrete Cosine Transform. :P
> "AMD’s BIOS and Kernel Developer’s Guide (BKDG) indicates there’s a 4-bit read pointer for a sideband signal FIFO between the GMC and DCT, so the “Garlic” link may have a queue with up to 16 entries."
Should maybe swap DCT in for MCT (memory controller)?
It's wild how much extra work was done to avoid coherency, yet share memory.
Ok, there's the first part, the Garlic bus, which gives the GPU its own access to the DRAM request controller, instead of going through the CPU's memory controller.
Since the GPU is mostly going to miss, it's great that it's not wasting energy trying to go to the CPU's cache. But it means if you do want to share memory now you need a whole other access path for the GPU to read from the CPU memory, even though it's literally the same RAM (but maybe different cache). So, add a new Onion link, that lets the GPU go through the crossbar, and get handled by the memory controller. And this one is slower.
Infinity Fabric seems conceptually so much easier, to keep things in sync. But the work to snoop the bus, to maintain coherency: it has to be pretty massive effort.
It's so so different a thing, but I wonder how AMD deal with coherency (or not?) on the 6 Memory Control Die (MCD) in the 6800xt GPU. Having separate chips whose job is to be cache and dram controller, that must need at least some understanding of who has what memory, that has to be wild.
One other comment, on:
> modern games struggle or won’t launch at all on Trinity, so I’ve selected a few older workloads
I wonder how many more games would run under Linux? Theres an absurd amount of work still going into the radeonsi driver. The driver just switched to the newer ACO compiler pipeline by default, last December, for example. That said, Trinity is (2012) using a (2010) TeraScale3 (gfx4). This is old! But the improvements have been ongoing, in a way commercial systems would unlikely to ever be; there's so many wins over such a long time; not compatibility but getting multi threaded driver support (2017) also comes to mind as a big leap!
https://www.phoronix.com/news/RadeonSI-ACO-Default-Pre-GFX10https://www.phoronix.com/news/RadeonSI-G3D-Threadshttps://www.google.com/search?q=site%3Aphoronix.com+radeonsi
I wonder how granular the breakdown/fallback modes are for running ; I suspect if there's an unsupportable feature somewhere in the graphics pipeline the whole pipeline will usually need to fallback to CPU rendering, but perhaps perhaps perhaps there's some ability to fill in some GPU features via CPU while running most of the pipeline on CPU (and not having the latency destroy everything, perhaps using that Onion link/cacheable host memory)?
hajile 1 hours ago [-]
Redesigning their interconnect stuff for both GPU and CPU then implementing and validating would have been a massive expense and would have added additional time to ship.
With the company facing bankruptcy, I'd imagine that a small team hacking together the different GPU and CPU interconnects was cheaper and faster than designing a whole new interconnect and coherency then implementing and testing it everywhere.
toast0 1 hours ago [-]
> It's wild how much extra work was done to avoid coherency, yet share memory.
Having separate, non-coherent memory is status quo for GPUs. Bringing the GPU onto the die means you've got to share the path to memory, but access patterns are different.
Designing for the typical case where the addresses used are distinct is totally reasonable, it's not wild at all. After that works, you can try to maie shared use faster, too, but from the article, that didn't really happen in this design; the features are there, but the bandwidth isn't.
AMD made a killer design with Athlon64 that should have taken over the entire industry and made them the largest hardware company on the planet. Instead, Intel leveraged their market position to make it economically infeasible for computer manufacturers to buy AMD chips.
AMD was out of money which limited options. Denard Scaling had just failed, but Moore's Law was still in effect and multithreading was hyped as the future of everything. This made a big argument for lots of smaller cores and the most area-efficient way to do this was sharing less-used resources resulting in AMD betting big on small-core CMT.
At the same time, AMD's ATI division was under pressure to make a new, flexible GPU design (that became GCN) and the cult of Nvidia (even knowingly shipping massive numbers of defective chips then having a worse GPU than GCN still wasn't enough to lose market dominance).
The interconnect was a lower-priority redesign, so they slapped a bandaid on it and pushed the redesign down the road.
Today, Intel is still selling more CPUs than AMD in most market segments even though they are usually worse.
From a proud ex user of a FX8370E
[1] https://chipsandcheese.com/p/bulldozer-amds-crash-modernizat...
> "AMD’s BIOS and Kernel Developer’s Guide (BKDG) indicates there’s a 4-bit read pointer for a sideband signal FIFO between the GMC and DCT, so the “Garlic” link may have a queue with up to 16 entries."
Should maybe swap DCT in for MCT (memory controller)?
Ok, there's the first part, the Garlic bus, which gives the GPU its own access to the DRAM request controller, instead of going through the CPU's memory controller.
Since the GPU is mostly going to miss, it's great that it's not wasting energy trying to go to the CPU's cache. But it means if you do want to share memory now you need a whole other access path for the GPU to read from the CPU memory, even though it's literally the same RAM (but maybe different cache). So, add a new Onion link, that lets the GPU go through the crossbar, and get handled by the memory controller. And this one is slower.
Infinity Fabric seems conceptually so much easier, to keep things in sync. But the work to snoop the bus, to maintain coherency: it has to be pretty massive effort.
It's so so different a thing, but I wonder how AMD deal with coherency (or not?) on the 6 Memory Control Die (MCD) in the 6800xt GPU. Having separate chips whose job is to be cache and dram controller, that must need at least some understanding of who has what memory, that has to be wild.
One other comment, on:
> modern games struggle or won’t launch at all on Trinity, so I’ve selected a few older workloads
I wonder how many more games would run under Linux? Theres an absurd amount of work still going into the radeonsi driver. The driver just switched to the newer ACO compiler pipeline by default, last December, for example. That said, Trinity is (2012) using a (2010) TeraScale3 (gfx4). This is old! But the improvements have been ongoing, in a way commercial systems would unlikely to ever be; there's so many wins over such a long time; not compatibility but getting multi threaded driver support (2017) also comes to mind as a big leap! https://www.phoronix.com/news/RadeonSI-ACO-Default-Pre-GFX10 https://www.phoronix.com/news/RadeonSI-G3D-Threads https://www.google.com/search?q=site%3Aphoronix.com+radeonsi
I wonder how granular the breakdown/fallback modes are for running ; I suspect if there's an unsupportable feature somewhere in the graphics pipeline the whole pipeline will usually need to fallback to CPU rendering, but perhaps perhaps perhaps there's some ability to fill in some GPU features via CPU while running most of the pipeline on CPU (and not having the latency destroy everything, perhaps using that Onion link/cacheable host memory)?
With the company facing bankruptcy, I'd imagine that a small team hacking together the different GPU and CPU interconnects was cheaper and faster than designing a whole new interconnect and coherency then implementing and testing it everywhere.
Having separate, non-coherent memory is status quo for GPUs. Bringing the GPU onto the die means you've got to share the path to memory, but access patterns are different.
Designing for the typical case where the addresses used are distinct is totally reasonable, it's not wild at all. After that works, you can try to maie shared use faster, too, but from the article, that didn't really happen in this design; the features are there, but the bandwidth isn't.