Efficient Hardware Support for the Partitioned Global Address Space

Relaterede dokumenter
Interconnect. Front end interface

PARALLELIZATION OF ATTILA SIMULATOR WITH OPENMP MIGUEL ÁNGEL MARTÍNEZ DEL AMOR MINIPROJECT OF TDT24 NTNU

Scheduling Algorithms for Super 3G

Sorting on the SRC 6 Reconfigurable Computer

Serverteknologi I Project task list

IPv6 Application Trial Services. 2003/08/07 Tomohide Nagashima Japan Telecom Co., Ltd.

PEMS RDE Workshop. AVL M.O.V.E Integrative Mobile Vehicle Evaluation

HP ProDesk 400 G5 Minitower I GB 256GB

BlueGene/L System Status Update

Project Step 7. Behavioral modeling of a dual ported register set. 1/8/ L11 Project Step 5 Copyright Joanne DeGroat, ECE, OSU 1

Basic Design Flow. Logic Design Logic synthesis Logic optimization Technology mapping Physical design. Floorplanning Placement Fabrication

Avancerede Datanet. Udviklingen i Netværksarkitekturer. Ole Brun Madsen Professor Department of Control Engineering University of Aalborg

MPLS konfiguration. Scenarie hold 1 & 2

Generalized Probit Model in Design of Dose Finding Experiments. Yuehui Wu Valerii V. Fedorov RSU, GlaxoSmithKline, US

Engineering of Chemical Register Machines

GNSS/INS Product Design Cycle. Taking into account MEMS-based IMU sensors

VMware VMmark V1.1.1 Results

Serverteknologi I * Projekt * Opgaveliste

Simpel Network Management Protocol SNMP

Velkommen. Backup & Snapshot v. Jørgen Weinreich / Arrow ECS Technical Specialist

SIMD. 2.1 Computation Reuse [1] Memoization [2] Nagoya Institute of Technology. Nara Institute of Science and Technology

IP version 6. Kapitel 3: IPv6 in Depth Baseret på bogen: Cisco Self-study: Implementing Cisco IPv6 Networks Henrik Thomsen V1.0.

Algorithms & Architectures II

Self-Organization in Autonomous Sensor/Actuator Networks [SelfOrg]

ESG reporting meeting investors needs

En god Facebook historie Uddannelser og valgfag målrettet datacenterindustrien!?

talk outline sea bed network deployment issues; energy distribution seabed network issues;

Virtuel Hukommelse. Niels Olof Bouvin Institut for Datalogi Aarhus Universitet

Resource types R 1 1, R 2 2,..., R m CPU cycles, memory space, files, I/O devices Each resource type R i has W i instances.

LX5280. High-Performance RISC-DSP for IP Licensing

Aggregation based on road topologies for large scale VRPs

TCP/IP stakken. TCP/IP Protokollen består af 5 lag:

MOC On-Demand Administering System Center Configuration Manager [ ]

Vore IIoT fokus områder

Grow. With the Leader. IBM Storwize v7000. v/lars Kok

Sesam Automationstrend. Spørgsmål til leverandørerne? New Automation Technology

Architectural System Model

Financial Management -II

Dell Cloud Client Computing Hvordan virtualisere vi de tunge grafisk applikationer?

Dell Latitude 14" I5-8250U 8GB 256GB Intel UHD Graphics 620 Windows 10 Pro 64-bit

CoS. Class of Service. Rasmus Elmholt V1.0

Structured P2P over Wireless Multi-hop Networks

Kommunikationssysteme [KS]

Routeren. - og lag 3 switchen! Netteknik 1

26.499:- ekskl. moms. og fragt. Kampagne: PowerEdge R715. Server fra Dell. Ring eller skriv for yderligere information

OPC UA Information model for Advanced Manufacturing

Samsung Xpress SL-M2026W Laser

Breakfast Club: Lav dit eget Servicekatalog

Valg af Automationsplatform

Hardware og software på forskermaskinerne

Serverteknologi I Projektopgave. Mål for kurset

Lithium-Ion Jump-Starters / Jumper Cables

(Design and Implementation of Configuration Management for the MPLS VPN)

LEADit & USEit 2018 CampusHuset - Campus Bindslevs Plads i Silkeborg 25. Oktober 2018

Growth. Bahrain. Growthisneverbymerechance;itisthe resultofforcesworkingtogether. Oman Kuwait. JamesCashPenney. SaudiArabia

Constant Terminal Voltage. Industry Workshop 1 st November 2013

Evaluating Germplasm for Resistance to Reniform Nematode. D. B. Weaver and K. S. Lawrence Auburn University

Interim report. 24 October 2008

Varenr.: højre venstre º højre med coating º venstre med coating

SYNOLOGY DS418j 4-bay NAS server

Number of Hosts: 8 Uniform Hosts [yes/no]: yes Total sockets/core/threads in test: 16 Sockets / 128 Cores / 256 Threads

MB6-896 Distribution and Trade in Microsoft Dynamics 365 for Finance and Operations

Network. Grundlæggende netværk. Region Syd Grundlæggende netværk

Hardware og software på forskermaskinerne

HACKERNE BLIVER BEDRE, SYSTEMERNE BLIVER MERE KOMPLEKSE OG PLATFORMENE FORSVINDER HAR VI TABT KAMPEN? MARTIN POVELSEN - KMD

Finn Gilling The Human Decision/ Gilling September Insights Danmark 2012 Hotel Scandic Aarhus City

Dynamic Voltage and Frequency Management Based on Variable Update Intervals

Vi gør opmærksom på at undervisningen kan foregå på engelsk, afhængig af instruktør og deltagere. Alt kursusmateriale er på engelsk.

ก ก. ก (System Development) 5.7 ก ก (Application Software Package) 5.8 ก (System Implementation) Management Information System, MIS 5.

Unitel EDI MT940 June Based on: SWIFT Standards - Category 9 MT940 Customer Statement Message (January 2004)

Network. Grundlæggende netværk. Region Syd Grundlæggende netværk

Sider og segmenter. dopsys 1

Chapter 10: File System Interface

What s Our Current Position? Uddannelsesstruktur i AUE. What Can You Choose After DE5? Uddannelsesstruktur i AUE

Family of R744 control valves and suggestions for efficient testing

Styresystemer og tjenester

Applications. Computational Linguistics: Jordan Boyd-Graber University of Maryland RL FOR MACHINE TRANSLATION. Slides adapted from Phillip Koehn

OG-3600 Series Fiber Optic Transport for opengear card frame platform w/ SNMP Management

Breaking Industrial Ciphers at a Whim MATE SOOS PRESENTATION AT HES 11

High-Performance Data Mining med SAS Enterprise Miner 14.1

Peering in Infrastructure Ad hoc Networks

Information Meeting for DE5 and DE3 Further Study Possibilities

Challenges for the Future Greater Helsinki - North-European Metropolis

Basic statistics for experimental medical researchers

Netværksmålinger. - en introduktion! Netteknik. TCP - IP - Ethernet

Virtual classroom: CCNA Boot Camp Accelerated

Computer Literacy. En stationær bordmodel. En Bærbar Notebook, Labtop, Slæbbar, Blærebar mm.

NETOP WORKSHOP. Netop Business Solutions. Michael Stranau & Carsten Alsted Christiansen

Design til digitale kommunikationsplatforme-f2013

ARP og ICMP. - service protokoller, som vi ikke kan undvære! Netteknik 1

Dell Latitude 14" I7-8650U 16GB 256GB Intel UHD Graphics 620 Windows 10 Pro 64-bit

Status of & Budget Presentation. December 11, 2018

Network management. - hvad sker der på mit netværk?! Netteknik 1

Bilag 1 GPS dataudskrifter fra Stena Carisma ved passage af målefelt

Apple imac Retina 5K display AIO 16GB 1TB Apple macos Sierra

Partner session 3 Mamut - Temadag

Online kursus: Google Cloud

Black Jack --- Review. Spring 2012

Webkorpora: Yahoo API og perl

Studieordning del 3,

Transkript:

Efficient Hardware Support for the Partitioned Global Address Space and Heiner Litz Computer Architecture Group University of Heidelberg

Outline Motivation & Goal Architecture Performance Evaluation Conclusion 2

Motivation PGAS Cache coherent shared memory does not scale Neither Broadcast- nor Directory-based cache protocols See also AMD s Probe Filter Partitioned Global Address Space (PGAS) Locally coherent, globally non-coherent Yelick 2006: Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. Leverage local coherency advantages, avoid global coherency disadvantages 3

Motivation Goal PGAS relies on High bandwidth bulk transfers Fine grain accesses for both communication and synchronization purposes Goal Provide best support for fine grain accesses with minimal software overhead Bandwidth 4

Architecture Overview 1. Lean Shared Memory Engine Address Translation SrcTag Management Stateless on Origin side Virtualized 2. Reliable network with in-order delivery HT requests supposed to be answered 3. Leverage HyperTransport s latency advantage and direct CPU connectivity 4. Minimal protocol conversion CPU HT On-chip network Network Memory Memory 5

Architecture Working Principle 1. LOAD / STORE instruction 2. HT request to SMFU 3. SMFU performs address translation, target node determination 4. Request is send as Extoll network packet to target 5. SMFU performs SrcTag translation 6. HT request to target MC 7. MC handles request 8. HT response to SMFU 9. SMFU re-translates SrcTag 10. HT response encapsulated in Extoll network packet 11. HT response to CPU Memory 3 2 CPU HT 1 SMFU 11 4 EXTOLL Network Packet 10 CPU HT SMFU EXTOLL Custom FPGA-based high performance interconnection network 5 8 9 7 Memory 6

Architecture Egress & Ingress Unit 7

Architecture Address Translation ostartaddr tstartaddr taddr oaddr tnodeid gaddr = = oaddr ostartaddr ( gaddr & mask) >> shift _ count ( gaddr& mask) tstartaddr taddr = ~ + 8

Architecture Matching Store HT request Address Calculation I Egress Unit Requester Path Target Node ID Responder Path Address Calculation II Network Stores origin information Node ID, HT SrcTag Only used on target side HT response tsrctag Matching Matching Store Responder Path osrctag & onode ID Network Ingress Responder Unit Stores osrctag, onodeid tsrctag returned HT non-posted request HT request/response Store osrctag & onode ID Completer Path Forward transaction Assign free tsrctag Network Network Egress Responder Unit Uses tsrctag for lookup Ingress Unit 9

Architecture Framework Hyper-Transport IP Core On Chip Network Net-work Port Network Switch HT-Core: Direct CPU connection Fully synchronous Efficient pipelined structure Incoming / Outgoing : 12 / 6 cycles HTAX: Non-blocking crossbar HT-derived protocol 3(+2) cycles latency 10

Architecture Framework Hyper-Transport IP Core On Chip Network Net-work Port Network Switch Network Switch: In-order delivery of packets Hardware retransmission Virtual Output Queuing Virtual channels Cut-through switching Source-Path routing Credit based flow-control Fault tolerance Remote management access 11

Performance Evaluation Two nodes, each: 4x AMD Opteron 2.2GHz Quad Core 16GB RAM Standard Linux Virtex-4 FX100 Prototype 156MHz core clock HT400 interface (1.6GB/s) 6 links, each 6.24 Gbps FPGA: 75% utilization 12

Performance Evaluation Limitations & Solutions CPU microarchitecture Only one outstanding load transaction on MMIO space Max. size of load on uncacheable memory 64bit Move to DRAM space Cacheable memory BIOS MMIO space per PCI B:D:F number max. 256MB Extended MMIO (EMMIO) Move to DRAM space 13

Full round trip latency including main memory access Read Latency Read Latency Performance Evaluation Remote Loads Read Throughput Read Throughput Latency[μs] Latency[μs] 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 Measurement HT400 / 156MHz core clock Measurement HT400 / 156MHz core clock Simulation HT1000 / 800MHz core clock Simulation HT1000 / 800MHz core clock 1.98 1.89 1.98 1.89 1.21 1.27 1.21 1.27 4 8 4 Size[byte] 8 Size[byte] Throghput[MB/s] Throghput[MB/s] 200 200 180 180 160 160 140 140 120 120 100 100 80 80 60 60 40 40 20 20 0 0 4 4 Nonposted Measurement, max. 1 outstanding Nonposted Measurement, max. 1 outstanding Nonposted Simulation, max. 1 outstanding Nonposted Simulation, max. 1 outstanding Nonposted Simulation, max. 32 outstanding Nonposted Simulation, max. 32 outstanding 8 8 16 16 Size[byte] Size[byte] 32 32 64 64 Little s Law (1961, queuing theory) Number outstanding N = 1 Response Time R = 2 usec (approx.) Throughput X 1 8B X = N / R = = 4MB / 2us s 14

Performance Evaluation Remote Stores 1400 1400 1200 1200 1000 1000 Throghput[MB/s] Throghput[MB/s] 800 800 600 600 400 400 200 200 0 0 8 8 Write Throughput Write Throughput 16 16 32 32 Size[byte] Size[byte] Simulation HT1000 / 800MHz Simulation core HT1000 clock / 800MHz core clock Measurement HT400 / 156MHz Measurement core clock HT400 / 156MHz core clock 64 64 128 128 Rate [1/s] Rate [1/s] 7,000,000.000 7,000,000.000 6,000,000.000 6,000,000.000 5,000,000.000 5,000,000.000 4,000,000.000 4,000,000.000 3,000,000.000 3,000,000.000 2,000,000.000 2,000,000.000 1,000,000.000 1,000,000.000 0.000 0.000 Posted Transaction Rate Posted Transaction Rate 8 16 32 64 128 256 512 1024 2048 4096 8 16 32 64 128 256 512 1024 2048 4096 Size [Byte] Size [Byte] At network peak bandwidth for 64bytes Outstanding transaction rate 15

Performance Evaluation Put it into context! Comparing SMFU vs. EXTOLL (API) Comparing SMFU vs. EXTOLL (API) (two nodes, 4x 2.2GHz K10 Opteron) (two nodes, 4x 2.2GHz K10 Opteron) 400.00 400.00 20.00 20.00 MB/s MB/s 350.00 350.00 300.00 300.00 250.00 250.00 200.00 200.00 150.00 150.00 100.00 100.00 50.00 50.00 17.50 17.50 15.00 15.00 12.50 12.50 10.00 10.00 7.50 7.50 5.00 5.00 2.50 2.50 Speedup Speedup SMFU over EXTOLL SMFU over EXTOLL Speedup Speedup SMFU over IB SMFU over IB Shared Memory Shared Memory (FPGA-based) (FPGA-based) Remote Stores Remote Stores API over EXTOLL API over EXTOLL (FPGA-based) (FPGA-based) NetPIPE Stream NetPIPE Stream MPI over IB MPI over IB (ASIC-based) (ASIC-based) NetPIPE Stream NetPIPE Stream 0.00 0.00 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 size [bytes] size [bytes] 0.00 0.00 16

Conclusion & Future Conclusion Prove of concept of Distributed Shared Memory Fine grained remote stores & loads Efficient and slim design First performance numbers outstanding and encouraging (taken into account the technology differences) Future Atomic operations DRAM space (requires coherency) Consistency supporting strict and relaxed operations Application level evaluation UPC/GASNet Aggregating memory 17