Efficient Hardware Support for the Partitioned Global Address Space

Efficient Hardware Support for the Partitioned Global Address Space and Heiner Litz Computer Architecture Group University of Heidelberg

Outline Motivation & Goal Architecture Performance Evaluation Conclusion 2

Motivation PGAS Cache coherent shared memory does not scale Neither Broadcast- nor Directory-based cache protocols See also AMD s Probe Filter Partitioned Global Address Space (PGAS) Locally coherent, globally non-coherent Yelick 2006: Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. Leverage local coherency advantages, avoid global coherency disadvantages 3

Motivation Goal PGAS relies on High bandwidth bulk transfers Fine grain accesses for both communication and synchronization purposes Goal Provide best support for fine grain accesses with minimal software overhead Bandwidth 4

Architecture Overview 1. Lean Shared Memory Engine Address Translation SrcTag Management Stateless on Origin side Virtualized 2. Reliable network with in-order delivery HT requests supposed to be answered 3. Leverage HyperTransport s latency advantage and direct CPU connectivity 4. Minimal protocol conversion CPU HT On-chip network Network Memory Memory 5

Architecture Working Principle 1. LOAD / STORE instruction 2. HT request to SMFU 3. SMFU performs address translation, target node determination 4. Request is send as Extoll network packet to target 5. SMFU performs SrcTag translation 6. HT request to target MC 7. MC handles request 8. HT response to SMFU 9. SMFU re-translates SrcTag 10. HT response encapsulated in Extoll network packet 11. HT response to CPU Memory 3 2 CPU HT 1 SMFU 11 4 EXTOLL Network Packet 10 CPU HT SMFU EXTOLL Custom FPGA-based high performance interconnection network 5 8 9 7 Memory 6

Architecture Egress & Ingress Unit 7

Architecture Address Translation ostartaddr tstartaddr taddr oaddr tnodeid gaddr = = oaddr ostartaddr ( gaddr & mask) >> shift _ count ( gaddr& mask) tstartaddr taddr = ~ + 8

Architecture Matching Store HT request Address Calculation I Egress Unit Requester Path Target Node ID Responder Path Address Calculation II Network Stores origin information Node ID, HT SrcTag Only used on target side HT response tsrctag Matching Matching Store Responder Path osrctag & onode ID Network Ingress Responder Unit Stores osrctag, onodeid tsrctag returned HT non-posted request HT request/response Store osrctag & onode ID Completer Path Forward transaction Assign free tsrctag Network Network Egress Responder Unit Uses tsrctag for lookup Ingress Unit 9

Architecture Framework Hyper-Transport IP Core On Chip Network Net-work Port Network Switch HT-Core: Direct CPU connection Fully synchronous Efficient pipelined structure Incoming / Outgoing : 12 / 6 cycles HTAX: Non-blocking crossbar HT-derived protocol 3(+2) cycles latency 10

Architecture Framework Hyper-Transport IP Core On Chip Network Net-work Port Network Switch Network Switch: In-order delivery of packets Hardware retransmission Virtual Output Queuing Virtual channels Cut-through switching Source-Path routing Credit based flow-control Fault tolerance Remote management access 11

Performance Evaluation Two nodes, each: 4x AMD Opteron 2.2GHz Quad Core 16GB RAM Standard Linux Virtex-4 FX100 Prototype 156MHz core clock HT400 interface (1.6GB/s) 6 links, each 6.24 Gbps FPGA: 75% utilization 12

Performance Evaluation Limitations & Solutions CPU microarchitecture Only one outstanding load transaction on MMIO space Max. size of load on uncacheable memory 64bit Move to DRAM space Cacheable memory BIOS MMIO space per PCI B:D:F number max. 256MB Extended MMIO (EMMIO) Move to DRAM space 13

Full round trip latency including main memory access Read Latency Read Latency Performance Evaluation Remote Loads Read Throughput Read Throughput Latency[μs] Latency[μs] 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 Measurement HT400 / 156MHz core clock Measurement HT400 / 156MHz core clock Simulation HT1000 / 800MHz core clock Simulation HT1000 / 800MHz core clock 1.98 1.89 1.98 1.89 1.21 1.27 1.21 1.27 4 8 4 Size[byte] 8 Size[byte] Throghput[MB/s] Throghput[MB/s] 200 200 180 180 160 160 140 140 120 120 100 100 80 80 60 60 40 40 20 20 0 0 4 4 Nonposted Measurement, max. 1 outstanding Nonposted Measurement, max. 1 outstanding Nonposted Simulation, max. 1 outstanding Nonposted Simulation, max. 1 outstanding Nonposted Simulation, max. 32 outstanding Nonposted Simulation, max. 32 outstanding 8 8 16 16 Size[byte] Size[byte] 32 32 64 64 Little s Law (1961, queuing theory) Number outstanding N = 1 Response Time R = 2 usec (approx.) Throughput X 1 8B X = N / R = = 4MB / 2us s 14

Performance Evaluation Remote Stores 1400 1400 1200 1200 1000 1000 Throghput[MB/s] Throghput[MB/s] 800 800 600 600 400 400 200 200 0 0 8 8 Write Throughput Write Throughput 16 16 32 32 Size[byte] Size[byte] Simulation HT1000 / 800MHz Simulation core HT1000 clock / 800MHz core clock Measurement HT400 / 156MHz Measurement core clock HT400 / 156MHz core clock 64 64 128 128 Rate [1/s] Rate [1/s] 7,000,000.000 7,000,000.000 6,000,000.000 6,000,000.000 5,000,000.000 5,000,000.000 4,000,000.000 4,000,000.000 3,000,000.000 3,000,000.000 2,000,000.000 2,000,000.000 1,000,000.000 1,000,000.000 0.000 0.000 Posted Transaction Rate Posted Transaction Rate 8 16 32 64 128 256 512 1024 2048 4096 8 16 32 64 128 256 512 1024 2048 4096 Size [Byte] Size [Byte] At network peak bandwidth for 64bytes Outstanding transaction rate 15

Performance Evaluation Put it into context! Comparing SMFU vs. EXTOLL (API) Comparing SMFU vs. EXTOLL (API) (two nodes, 4x 2.2GHz K10 Opteron) (two nodes, 4x 2.2GHz K10 Opteron) 400.00 400.00 20.00 20.00 MB/s MB/s 350.00 350.00 300.00 300.00 250.00 250.00 200.00 200.00 150.00 150.00 100.00 100.00 50.00 50.00 17.50 17.50 15.00 15.00 12.50 12.50 10.00 10.00 7.50 7.50 5.00 5.00 2.50 2.50 Speedup Speedup SMFU over EXTOLL SMFU over EXTOLL Speedup Speedup SMFU over IB SMFU over IB Shared Memory Shared Memory (FPGA-based) (FPGA-based) Remote Stores Remote Stores API over EXTOLL API over EXTOLL (FPGA-based) (FPGA-based) NetPIPE Stream NetPIPE Stream MPI over IB MPI over IB (ASIC-based) (ASIC-based) NetPIPE Stream NetPIPE Stream 0.00 0.00 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 size [bytes] size [bytes] 0.00 0.00 16

Conclusion & Future Conclusion Prove of concept of Distributed Shared Memory Fine grained remote stores & loads Efficient and slim design First performance numbers outstanding and encouraging (taken into account the technology differences) Future Atomic operations DRAM space (requires coherency) Consistency supporting strict and relaxed operations Application level evaluation UPC/GASNet Aggregating memory 17