Kwam een post op
SI tegen, welke mooi weergeeft welke IPC verbeteringen AMD toegepast heeft op de aankomende core (mede aan de hand van
deze 313 pagina's tellende pdf, welke een hoofdstuk aan deze verbetering wijd).
Zoals de poster daar ook zelf schrijft:
"Denk eens aan de tijd en bronnen die het kost om deze dingen te implementeren. Zouden ze het wel waard zijn als ze over het algemeen per modificatie niet minstens ~1% IPC verbetering zouden leveren?"
•
Comprehensive Upgrades for SSE
- Dual 128-bit SSE dataflow
- Up to 4 dual precision FP OPS/cycle
- Dual 128-bit loads per cycle
- Can perform SSE MOVs in the FP "store" pipe
- Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
- FP Scheduler can hold 36 Dedicated x 128-bit ops
- SSE Unaligned Load-Execute mode:
Remove alignment requirements for SSE ld-op instructions
Eliminate awkward pairs of separate load and compute instructions
To improve instruction packing and decoding efficiency
•
Advanced branch prediction
- Dedicated 512-entry Indirect Predictor
- Double return stacksize
- More branch history bits and improved branch hashing
•
32B instruction fetch
- Benefits integer code too
- Reduced split-fetch instruction cases
•
Sideband Stack Optimizer
- Perform stack adjustments for PUSH/POP operations “on the side”
- Stack adjustments don’t occupy functional unit bandwidth
- Breaks serial dependence chains for consecutive PUSH/POPs
•
Out-of-order load execution
- New technology allows load instructions to bypass:
Other loads
Other stores which are known not to alias with the load
- Significantly mitigates L2 cache latency
•
TLB Optimisations
- Support for 1G pages
- 48bit physical address
- Larger TLBs key for:
Virtualized workloads
Large-footprint databases and transaction processing
- DTLB:
Fully-associative 48-way TLB (4K, 2M, 1G)
Backed by L2 TLBs: 512 x 4K, 128 x 2M
- ITLB:
16 x 2M entries
•
Data-dependent divide latency
•
More Fastpath instructions:
-CALL and RET-Imm instructions
-Data movement between FP & INT
•
Bit Manipulation extensions
- LZCNT/POPCNT
•
SSE extensions
- EXTRQ/INSERTQ,
- MOVNTSD/MOVNTSS
•
Independent DRAM controllers
- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency
•
Optimized DRAM paging
- Increase page hits
- Decrease page conflicts
•
History-based pattern predictor
•
Re-architect NB for higher BW
- Increase buffer sizes
- Optimize schedulers
- Ready to support future DRAM technologies
•
Write bursting
- Minimize Rd/Wr Turnaround
•
DRAM prefetcher
- Track positive and negative, unit and non-unit strides
- Dedicated buffer for prefetched data
- Aggressively fill idle DRAM cycles
•
Core prefetchers
- DC Prefetcher fills directly to L1 Cache
- IC Prefetcher more flexible:
2 outstanding requests to any address
•
Shared L3
- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy