Sorting on the SRC 6 Reconfigurable Computer

Relaterede dokumenter
Engineering of Chemical Register Machines

Speciale. Evaluering af Java til udvikling af indlejrede realtidssystemer ved brug af en eksisterende Java Optimized Processor (JOP)

Dynamic Voltage and Frequency Management Based on Variable Update Intervals

PARALLELIZATION OF ATTILA SIMULATOR WITH OPENMP MIGUEL ÁNGEL MARTÍNEZ DEL AMOR MINIPROJECT OF TDT24 NTNU

Project Step 7. Behavioral modeling of a dual ported register set. 1/8/ L11 Project Step 5 Copyright Joanne DeGroat, ECE, OSU 1

Mm7: A little bit more about sorting - and more times for exercises - November 4, 2008

ECE 551: Digital System * Design & Synthesis Lecture Set 5

Det er muligt at chekce følgende opg. i CodeJudge: og

A multimodel data assimilation framework for hydrology

Efficient Hardware Support for the Partitioned Global Address Space

Heuristics for Improving

Multiple Slice Turbo Codes

SIMD. 2.1 Computation Reuse [1] Memoization [2] Nagoya Institute of Technology. Nara Institute of Science and Technology

LX5280. High-Performance RISC-DSP for IP Licensing

Basic Design Flow. Logic Design Logic synthesis Logic optimization Technology mapping Physical design. Floorplanning Placement Fabrication

Using SL-RAT to Reduce SSOs

Popular Sorting Algorithms CHAPTER 7: SORTING & SEARCHING. Popular Sorting Algorithms. Selection Sort 4/23/2013

Breaking Industrial Ciphers at a Whim MATE SOOS PRESENTATION AT HES 11

Automatic Code Orchestration from Descriptive Implementations

Sortering ved fletning (merge-sort)

Example sensors. Accelorometer. Simple kontakter. Lysfølsomme. modstande. RFID reader & tags. Temperaturfølsomme. Flex Sensor.

Tech College Aalborg. HomePort. Projekt Smart Zenior Home Guide til udvikling af nye adaptere til HomePort

VLSI Design. DC & Transient Response. EE 447 VLSI Design 4: DC and Transient Response 1

Online kursus: Programming with ANSI C

Sortering fra A-Z. Henrik Dorf Chefkonsulent SAS Institute

PEMS RDE Workshop. AVL M.O.V.E Integrative Mobile Vehicle Evaluation

RoE timestamp and presentation time in past

3D NASAL VISTA TEMPORAL

Embedded Software Memory Size Estimation using COSMIC: A Case Study

Mm6: More sorting algorithms: Heap sort and quick sort - October 29, 2008

University Medical Center of Princeton David Bodnar Construction Management Senior Thesis

Under 'Microsoft Block Editor', klik 'New project' for at åbne block editor-værktøjet.

Parallelle algoritmer

3D NASAL VISTA 2.0

Small Autonomous Devices in civil Engineering. Uses and requirements. By Peter H. Møller Rambøll

A Profile for Safety Critical Java

MM4. Algoritmiske grundprincipper. Lister, stakke og køer. Hash-tabeller og Træer. Sortering. Søgning.

Sådan bruger du BK- 9 Performance List. Formatering af USB- Memory. "Performance List" er en liste over dine registreringer.

Introduction Ronny Bismark

Sider og segmenter. dopsys 1

Scheduling Algorithms for Super 3G

Integrated Engine, Vehicle, and Underhood Model of a Light Duty Truck for VTM Analysis

Processer og tråde. dopsys 1

1. How many of the lectures for this module have you participated in? 2. How much of the curriculum have you read?

Benefits of Integrated System Design for complex FPGAs

Programmering og Problemløsning, 2017

Rekursion og dynamisk programmering

DDD Runde 2, 2015 Facitliste

VMware VMmark V1.1.1 Results

Statistical information form the Danish EPC database - use for the building stock model in Denmark

DET KONGELIGE BIBLIOTEK NATIONALBIBLIOTEK OG KØBENHAVNS UNIVERSITETS- BIBLIOTEK. Index

Algorithms & Architectures II

Læs spilforslag og odds til Esbjerg fb... Sønderjyske: Odds, spilforslag og statistik 31. Okt 2018 Manchester City - Fulham: Odds og spilforslag til

uprocessorens hardware

LED STAR PIN G4 BASIC INFORMATION: Series circuit. Parallel circuit HOW CAN I UNDERSTAND THE FOLLOWING SHEETS?

Mere data på mindre plads Flemming Märtens Lenovo

Enterprise Strategy Program

Generalized Probit Model in Design of Dose Finding Experiments. Yuehui Wu Valerii V. Fedorov RSU, GlaxoSmithKline, US

Basic statistics for experimental medical researchers

Particle-based T-Spline Level Set Evolution for 3D Object Reconstruction with Range and Volume Constraints

Lovkrav vs. udvikling af sundhedsapps

Agenda. The need to embrace our complex health care system and learning to do so. Christian von Plessen Contributors to healthcare services in Denmark

Designing Complex FPGAs

Reexam questions in Statistics and Evidence-based medicine, august sem. Medis/Medicin, Modul 2.4.

Challenges for the Future Greater Helsinki - North-European Metropolis

Mircobit Kursus Lektion 1

ECE 551: Digital System Design & Synthesis Lecture Set 5

Molio specifications, development and challenges. ICIS DA 2019 Portland, Kim Streuli, Molio,

Unit. Programming for Problem Solving

Getting your Agillic solution up-to-speed with your digital marketing ambitions

Dell Cloud Client Computing Hvordan virtualisere vi de tunge grafisk applikationer?

// Definition af porte og funktioner

meter2cash Ltd. meter2cash www100 Internet Information System

GAMPIX: a new generation of gamma camera for hot spot localisation

Handelsbanken. Lennart Francke, Head of Accounting and Control. UBS Annual Nordic Financial Service Conference August 25, 2005

Systemkald DM Obligatoriske opgave. Antal sider: 7 inkl. 2 bilag Afleveret: d. 18/ Afleveret af: Jacob Christiansen,

Algoritmer og Datastrukturer 1. Gerth Stølting Brodal

High-Performance Data Mining med SAS Enterprise Miner 14.1

Algorithms and Architectures I Rasmus Løvenstein Olsen (RLO), Jens Myrup Pedersen (JMP) Mm4: Sorting algorithms - October 23, 2009

Navn: Søren Guldbrand Pedersen Klasse: 2i Fag: up/ansi C Opgave: Brev til Sigurd Lære: John Austin Side 1 af 13 Dato:

University of Copenhagen Faculty of Science Written Exam - 3. April Algebra 3

Applications. Computational Linguistics: Jordan Boyd-Graber University of Maryland RL FOR MACHINE TRANSLATION. Slides adapted from Phillip Koehn

extreme Programming Kunders og udvikleres menneskerettigheder

Karaktergivende opgave i Styresystemer og multiprogrammering (reeksamen) 13. august 2007

Internt interrupt - Arduino

Chapter. Information Representation

Programmering i C Programmering af microcontroller i C (4 af 4) 12. april 2007

Skriftlig Eksamen DM507 Algoritmer og Datastrukturer

Resource types R 1 1, R 2 2,..., R m CPU cycles, memory space, files, I/O devices Each resource type R i has W i instances.

Bilag 8. TDC technical requirements for approval of splitterfilters and inline filters intended for shared access (ADSL or VDSL over POTS).

Studieordning del 3,

MySQL C API. Denne artikel beskriver hvordan man bruger MySQL C API. Der er beskrivelse af build med forskellige compilere.

Skriftlig Eksamen Algoritmer og Datastrukturer 1. Datalogisk Institut Aarhus Universitet

1 Indholdsfortegnelse.

Design by Contract Bertrand Meyer Design and Programming by Contract. Oversigt. Prædikater

Oracle PL/SQL. Overview of PL/SQL

Differential Evolution (DE) "Biologically-inspired computing", T. Krink, EVALife Group, Univ. of Aarhus, Denmark

Den nye Eurocode EC Geotenikerdagen Morten S. Rasmussen

Sider og segmenter. dopsys 1

Styresystemer og tjenester

Transkript:

Sorting on the SRC 6 Reconfigurable Computer John arkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing uang The George Washington University Washington, DC J. arkins 1 of 51 MAPD2005/C178

Algorithms Quick Sort eap Sort Radix Sort Bitonic Sort Odd/Even Merge J. arkins 2 of 51 MAPD2005/C178

SRC System Architecture 16 Port Crossbar Switch 1.6 GB/s Peak Port BW \ 64 \ 64 \ 64 \ 64 Processor Node FPGA Node Memory Node Up to 16 Nodes per Switch J. arkins 3 of 51 MAPD2005/C178

Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] J. arkins 4 of 51 MAPD2005/C178

Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] J. arkins 5 of 51 MAPD2005/C178

Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] J. arkins 6 of 51 MAPD2005/C178

Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] J. arkins 7 of 51 MAPD2005/C178

Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] m: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8] J. arkins 8 of 51 MAPD2005/C178

Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] m: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8] PS: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8] J. arkins 9 of 51 MAPD2005/C178

Quick Sort - MIMD Architecture 6 Instances Median of 3 to select pivot Pipeline Sort for partitions 10 vs. Insertion Sort 20 A B C D E F QS 1 QS 2 QS 3 QS 4 QS 5 QS 6 FPGA 1 90% FPGA 2 84% J. arkins 10 of 51 MAPD2005/C178

Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 13 3 14 15 10 2 6 0 8 4 12 7 5 11 1 9 J. arkins 11 of 51 MAPD2005/C178

Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 13 3 14 15 10 2 6 0 8 4 12 7 5 11 1 9 J. arkins 12 of 51 MAPD2005/C178

Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 13 3 14 15 10 2 6 0 8 4 12 7 5 11 1 9 J. arkins 13 of 51 MAPD2005/C178

Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7: [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0] 13 3 14 15 10 2 6 9 8 4 12 7 5 11 1 0 J. arkins 14 of 51 MAPD2005/C178

Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 6: [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0] 13 3 14 15 10 2 6 9 8 4 12 7 5 11 1 0 J. arkins 15 of 51 MAPD2005/C178

Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 6: [13][ 3][14][15][10][ 2][11][ 9][ 8][ 4][12][ 7][ 5][ 6][ 1][ 0] 13 3 14 15 10 2 11 6 9 8 4 12 7 5 11 6 1 0 J. arkins 16 of 51 MAPD2005/C178

Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 max: [15][13][14][ 9][12][ 7][11][ 3][ 8][ 4][10][ 2][ 5][ 6][ 1][ 0] 15 13 14 9 12 7 11 3 8 4 10 2 5 6 1 0 J. arkins 17 of 51 MAPD2005/C178

eap Sort - MIMD Architecture 6 Instances Almost identical to processor code A B C D E F S 1 S 2 S 3 S 4 S 5 S 6 FPGA 1 55% FPGA 2 5% J. arkins 18 of 51 MAPD2005/C178

Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] Pass1: 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 index 0 = 0 index 1 = 4 index 2 = 8 index 3 = 12 count 1 = 4 count 2 = 4 count 3 = 4 count 4 = 4 index 0 = 0 n index n = count i n > 0 i=1 J. arkins 19 of 51 MAPD2005/C178

Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 Pass2: index 0 = 0 index 1 = 4 index 2 = 8 index 3 = 12 count 0 = 0 count 1 = 0 count 2 = 0 count 3 = 0 J. arkins 20 of 51 MAPD2005/C178

Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 Pass2: 1101 index 0 = 0 index 1 = 5 index 2 = 8 index 3 = 12 count 0 = 0 count 1 = 0 count 2 = 0 count 3 = 1 J. arkins 21 of 51 MAPD2005/C178

Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ 3][ ][ ][ ] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 Pass2: 1101 0011 index 0 = 0 index 1 = 5 index 2 = 8 index 3 = 13 count 0 = 1 count 1 = 0 count 2 = 0 count 3 = 1 J. arkins 22 of 51 MAPD2005/C178

Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2: [ ][ ][ ][ ][13][ ][ ][ ][14][ ][ ][ ][ 3][ ][ ][ ] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 Pass2: 1101 1110 0011 index 0 = 0 index 1 = 5 index 2 = 9 index 3 = 13 count 0 = 1 count 1 = 0 count 2 = 0 count 3 = 2 J. arkins 23 of 51 MAPD2005/C178

Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 0000 1000 0100 1100 1101 0101 0001 1001 1110 1010 0010 0110 0011 1111 0111 1011 Pass3: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 index 0 = 4 index 1 = 8 index 2 = 12 index 3 = 16 J. arkins 24 of 51 MAPD2005/C178

Radix Sort - MIMD Architecture 3 Instances Uses enumeration sort Radix 13 bits vs. 8 bits A B C D E F Radix Sort 1 Radix Sort 2 Radix Sort 3 FPGA 1 33% FPGA 2 5% J. arkins 25 of 51 MAPD2005/C178

MIMD Code Structure main.c int main( ) { int n = 523770*6; int64 *buf; buf = cachealign(n); mapsort.mc void mapsort(int64 *buf, n) { OBM_BANK_A (bufa, int64, n/6) OBM_BANK_B (bufb, int64, n/6) OBM_BANK_F (buff, int64, n/6) } mapsort(buf, n); free(buf); exit(0); DMA_CPU(dir, bufa, stripes, buf, n); #pragma src parallel sections { #pragma src section {Xsort(bufA, n/6);} #pragma src section {Xsort(bufB, n/6);} #pragma src section {Xsort(bufF, n/6);} } DMA_CPU(dir, bufa, stripes, buf, n); return; } J. arkins 26 of 51 MAPD2005/C178

Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [13][ 3][14][15] [10][ 2][ 6][ 0] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 13 3 14 15 J. arkins 27 of 51 MAPD2005/C178

Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ ][ ][ ][ ] [10][ 2][ 6][ 0] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 3 13 15 14 10 2 6 0 J. arkins 28 of 51 MAPD2005/C178

Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ ][ ][ ][ ] [ ][ ][ ][ ] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 5 11 3 15 1 9 13 14 2 10 6 0 J. arkins 29 of 51 MAPD2005/C178

Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ ][ ][ ][ ] [ ][ ][ ][ ] [ 8][ 4][12][ 7] [ ][ ][ ][ ] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 5 11 3 13 9 1 14 15 8 4 6 2 12 7 10 0 J. arkins 30 of 51 MAPD2005/C178

Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ 0][ 2][ 3][ 6] [ ][ ][ ][ ] [ ][ ][ ][ ] [ ][ ][ ][ ] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 1 12 0 2 5 8 3 6 7 9 10 13 4 11 14 15 J. arkins 31 of 51 MAPD2005/C178

Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ 0][ 2][ 3][ 6] [10][13][14][15] [ ][ ][ ][ ] [ ][ ][ ][ ] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 1 7 4 5 9 12 10 13 8 11 14 15 J. arkins 32 of 51 MAPD2005/C178

Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ 0][ 2][ 3][ 6] [10][13][14][15] [ ][ ][ ][ ] [ 1][ 4][ 5][ 7] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 1 4 5 7 8 9 11 12 J. arkins 33 of 51 MAPD2005/C178

Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ 0][ 2][ 3][ 6] [10][13][14][15] [ 8][ 9][11][12] [ 1][ 4][ 5][ 7] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 8 9 11 12 J. arkins 34 of 51 MAPD2005/C178

Bitonic Sort - SIMD Architecture 2 Instances Parallel sorting network A B C D E F 8 Input Bitonic Sorting Network 1 4 Input Bitonic Sort 2 SIMD Controller FPGA 2 5% FPGA 1 27% J. arkins 35 of 51 MAPD2005/C178

Example - Odd/Even Merge Input Keys: A: [ 0][ 1][ 2][ 4][ 7][11][12][14] B: [ 3][ 5][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] MUX Z -2 Z -1 J. arkins 36 of 51 MAPD2005/C178

Example - Odd/Even Merge Input Keys: A: [ 0][ 1][ 2][ 4][ 7][11][12][14] B: [ 3][ 5][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 0 3 Z -2 1 5 Z -1 J. arkins 37 of 51 MAPD2005/C178

Example - Odd/Even Merge Input Keys: A: [ ][ ][ 2][ 4][ 7][11][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 2 3 0 Z -2 4 5 1 Z -1 J. arkins 38 of 51 MAPD2005/C178

Example - Odd/Even Merge Input Keys: A: [ ][ ][ ][ ][ 7][11][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 7 3 2 0 Z -2 11 5 4 1 Z -1 J. arkins 39 of 51 MAPD2005/C178

Example - Odd/Even Merge Input Keys: A: [ ][ ][ ][ ][ ][ ][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ 0][ 1][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 7 6 3 2 Z -2 0 11 8 5 Z -1 4 1 J. arkins 40 of 51 MAPD2005/C178

Example - Odd/Even Merge Input Keys: A: [ ][ ][ ][ ][ ][ ][12][14] B: [ ][ ][ ][ ][ 9][10][13][15] Merged Keys: C: [ 0][ 1][ 2][ 3][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 7 9 6 4 Z -2 2 11 10 8 Z -1 5 3 J. arkins 41 of 51 MAPD2005/C178

Odd/Even Merge - SIMD Architecture 1 Instance Parallel sorting network A/B = odd ; C/D = even A B C D E F Odd Merge Two Even Merge Two Merge Out FPGA 1 40% FPGA 2 5% J. arkins 42 of 51 MAPD2005/C178

SIMD Code Structure main.c int main( ) { int n = 523770*6; int64 *buf; buf = cachealign(n); mapsort.mc void mapsort(int64 *buf, n) { OBM_BANK_A (AA, int64, n/6) OBM_BANK_B (BB, int64, n/6) OBM_BANK_F (FF, int64, n/6) } mapsort(buf, n); free(buf); exit(0); } DMA_CPU(dir, AA, stripes, buf, n); for (i=0; i<rounds; i++) { schedule( &r1, &r2); bitonicsort8(aa[r1],bb[r1],cc[r1],dd[r1], AA[r2],BB[r2],CC[r2].DD[r2], &AA[r1],&BB[r1],&CC[r1],&DD[r1], &AA[r2],&BB[r2],&CC[r2],&DD[r2]); bitonicsort4(ee[r1],ff[r1],ee[r2],ff[r2], ); } DMA_CPU(dir, bufa, stripes, buf, n); return; J. arkins 43 of 51 MAPD2005/C178

Implementation Comparisons Algorithm Processor Complexity anguage Compiler ines Of Code Recursion FPGA Util. % Slices MIMD SIMD Refactoring Upper Bound x10 6 keys/s Quick Sort X86 FPGA N lgn N lgn C MC 81 97/96 n/a 90,84 31.58 eap Sort X86 FPGA N lgn N lgn C MC 55 56/54 - n/a 55,0 31.58 Radix Sort X86 FPGA N N C MC 70 81/64 - n/a 33,0 60.00 Bitonic Sort X86 FPGA Nlg 2 N lg 2 N C VD 78 53/478/365 n/a 27,0 6.32 O/E Merge X86 FPGA N N C MC 52 71/120 - n/a 40,0 60.87 X86 = Dual Xeon 2.8Gz FPGA = Virtex2XC6000 @ 100Mz MC = MAP C = icc v8.0 -fast = mcc v1.8 = mcc v1.9 = entirely = major changes = some = very little = almost none J. arkins 44 of 51 MAPD2005/C178

esson earned #1 Know your tools Develop accurate assessments early O/E Merge Bitonic Sort Radix Sort eap Sort Quick Sort Compiler 2.8 Gz Xeon x10 6 keys/s gcc icc -fast 1.99 5.66 0.50 1.06 1.63 4.72 - - - - FPGA upper bound estimate x10 6 keys/s 31.58 31.58 60.00 6.32 60.87 Upper bound on speedup vs gcc vs icc 15.87 5.58 63.16 29.79 36.81 12.71 - - - - J. arkins 45 of 51 MAPD2005/C178

Test Conditions 64 bit unsigned integer keys Uniformly distributed Randomly permuted Scores average of 10 runs FPGA configuration time ~65ms DMA time ~18ms Typical key quantity 3.14M Processor comparison: Xeon 2.8Gz, 1GB mem J. arkins 46 of 51 MAPD2005/C178

Experimental Results - 64 bit keys x 10 6 keys/s 14 12 10 8 6 4 2 0 5.66 2.32 1.06 1.96 4.72 12.99 0.69 1.02 Quick eap Radix Bitonic X86 FPGA 90 80 70 60 50 40 30 20 10 0 77.03 36 O/E Merge X86 FPGA Sorting Algorithms J. arkins 47 of 51 MAPD2005/C178

mcc Compiler Attempts to pipeline inner loops Maintains sequential behavior of C Reports dependencies/penalties Quick Sort: 1 penalty* eap Sort: 12 penalties Radix Sort: 2 penalties Bitonic Sort: 5 penalties Odd/Even Merge: 1 penalty Easy to build embarrassingly parallel code Resource usage ~2x D J. arkins 48 of 51 MAPD2005/C178

Conclusion FPGAs not best choice for sorting Sorting is memory bound Tight loops, low computation suited to processor More parallel memory accesses Faster clock rates Refactoring for better performance FPGAs underutilized Understand compiler limitations Eliminate dependencies J. arkins 49 of 51 MAPD2005/C178

Tight oop Example Merge a[n]=b[n]=infinity; j=k=0; oop i = 0 to 2N-1 { if (a[j] > b[k]) merged[i] = b[k++]; else merged[i] = a[j++]; } J. arkins 50 of 51 MAPD2005/C178

Future Work More refactoring Greater use of block rams W prediction to reduce penalties FPGA performance gain = ƒ(computation density/memory access) J. arkins 51 of 51 MAPD2005/C178