Sorting on the SRC 6 Reconfigurable Computer John arkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing uang The George Washington University Washington, DC J. arkins 1 of 51 MAPD2005/C178
Algorithms Quick Sort eap Sort Radix Sort Bitonic Sort Odd/Even Merge J. arkins 2 of 51 MAPD2005/C178
SRC System Architecture 16 Port Crossbar Switch 1.6 GB/s Peak Port BW \ 64 \ 64 \ 64 \ 64 Processor Node FPGA Node Memory Node Up to 16 Nodes per Switch J. arkins 3 of 51 MAPD2005/C178
Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] J. arkins 4 of 51 MAPD2005/C178
Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] J. arkins 5 of 51 MAPD2005/C178
Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] J. arkins 6 of 51 MAPD2005/C178
Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] J. arkins 7 of 51 MAPD2005/C178
Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] m: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8] J. arkins 8 of 51 MAPD2005/C178
Example - Quick Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] med: [ 0][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][13] QS1: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8][ 9][12][15][14][11][10][13] m: [ 0][ 3][ 5][ 7][ 4][ 2][ 6][ 1][ 8] PS: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8] J. arkins 9 of 51 MAPD2005/C178
Quick Sort - MIMD Architecture 6 Instances Median of 3 to select pivot Pipeline Sort for partitions 10 vs. Insertion Sort 20 A B C D E F QS 1 QS 2 QS 3 QS 4 QS 5 QS 6 FPGA 1 90% FPGA 2 84% J. arkins 10 of 51 MAPD2005/C178
Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 13 3 14 15 10 2 6 0 8 4 12 7 5 11 1 9 J. arkins 11 of 51 MAPD2005/C178
Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 13 3 14 15 10 2 6 0 8 4 12 7 5 11 1 9 J. arkins 12 of 51 MAPD2005/C178
Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] 13 3 14 15 10 2 6 0 8 4 12 7 5 11 1 9 J. arkins 13 of 51 MAPD2005/C178
Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 7: [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0] 13 3 14 15 10 2 6 9 8 4 12 7 5 11 1 0 J. arkins 14 of 51 MAPD2005/C178
Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 6: [13][ 3][14][15][10][ 2][ 6][ 9][ 8][ 4][12][ 7][ 5][11][ 1][ 0] 13 3 14 15 10 2 6 9 8 4 12 7 5 11 1 0 J. arkins 15 of 51 MAPD2005/C178
Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 6: [13][ 3][14][15][10][ 2][11][ 9][ 8][ 4][12][ 7][ 5][ 6][ 1][ 0] 13 3 14 15 10 2 11 6 9 8 4 12 7 5 11 6 1 0 J. arkins 16 of 51 MAPD2005/C178
Example - eap Sort 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 max: [15][13][14][ 9][12][ 7][11][ 3][ 8][ 4][10][ 2][ 5][ 6][ 1][ 0] 15 13 14 9 12 7 11 3 8 4 10 2 5 6 1 0 J. arkins 17 of 51 MAPD2005/C178
eap Sort - MIMD Architecture 6 Instances Almost identical to processor code A B C D E F S 1 S 2 S 3 S 4 S 5 S 6 FPGA 1 55% FPGA 2 5% J. arkins 18 of 51 MAPD2005/C178
Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1: [13][ 3][14][15][10][ 2][ 6][ 0][ 8][ 4][12][ 7][ 5][11][ 1][ 9] Pass1: 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 index 0 = 0 index 1 = 4 index 2 = 8 index 3 = 12 count 1 = 4 count 2 = 4 count 3 = 4 count 4 = 4 index 0 = 0 n index n = count i n > 0 i=1 J. arkins 19 of 51 MAPD2005/C178
Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 Pass2: index 0 = 0 index 1 = 4 index 2 = 8 index 3 = 12 count 0 = 0 count 1 = 0 count 2 = 0 count 3 = 0 J. arkins 20 of 51 MAPD2005/C178
Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 Pass2: 1101 index 0 = 0 index 1 = 5 index 2 = 8 index 3 = 12 count 0 = 0 count 1 = 0 count 2 = 0 count 3 = 1 J. arkins 21 of 51 MAPD2005/C178
Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2: [ ][ ][ ][ ][13][ ][ ][ ][ ][ ][ ][ ][ 3][ ][ ][ ] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 Pass2: 1101 0011 index 0 = 0 index 1 = 5 index 2 = 8 index 3 = 13 count 0 = 1 count 1 = 0 count 2 = 0 count 3 = 1 J. arkins 22 of 51 MAPD2005/C178
Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2: [ ][ ][ ][ ][13][ ][ ][ ][14][ ][ ][ ][ 3][ ][ ][ ] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 Pass2: 1101 1110 0011 index 0 = 0 index 1 = 5 index 2 = 9 index 3 = 13 count 0 = 1 count 1 = 0 count 2 = 0 count 3 = 2 J. arkins 23 of 51 MAPD2005/C178
Example - Radix Sort 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3: [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13][14][15] 1101 0011 1110 1111 1010 0010 0110 0000 1000 0100 1100 0111 0101 1011 0001 1001 0000 1000 0100 1100 1101 0101 0001 1001 1110 1010 0010 0110 0011 1111 0111 1011 Pass3: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 index 0 = 4 index 1 = 8 index 2 = 12 index 3 = 16 J. arkins 24 of 51 MAPD2005/C178
Radix Sort - MIMD Architecture 3 Instances Uses enumeration sort Radix 13 bits vs. 8 bits A B C D E F Radix Sort 1 Radix Sort 2 Radix Sort 3 FPGA 1 33% FPGA 2 5% J. arkins 25 of 51 MAPD2005/C178
MIMD Code Structure main.c int main( ) { int n = 523770*6; int64 *buf; buf = cachealign(n); mapsort.mc void mapsort(int64 *buf, n) { OBM_BANK_A (bufa, int64, n/6) OBM_BANK_B (bufb, int64, n/6) OBM_BANK_F (buff, int64, n/6) } mapsort(buf, n); free(buf); exit(0); DMA_CPU(dir, bufa, stripes, buf, n); #pragma src parallel sections { #pragma src section {Xsort(bufA, n/6);} #pragma src section {Xsort(bufB, n/6);} #pragma src section {Xsort(bufF, n/6);} } DMA_CPU(dir, bufa, stripes, buf, n); return; } J. arkins 26 of 51 MAPD2005/C178
Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [13][ 3][14][15] [10][ 2][ 6][ 0] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 13 3 14 15 J. arkins 27 of 51 MAPD2005/C178
Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ ][ ][ ][ ] [10][ 2][ 6][ 0] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 3 13 15 14 10 2 6 0 J. arkins 28 of 51 MAPD2005/C178
Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ ][ ][ ][ ] [ ][ ][ ][ ] [ 8][ 4][12][ 7] [ 5][11][ 1][ 9] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 5 11 3 15 1 9 13 14 2 10 6 0 J. arkins 29 of 51 MAPD2005/C178
Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ ][ ][ ][ ] [ ][ ][ ][ ] [ 8][ 4][12][ 7] [ ][ ][ ][ ] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 5 11 3 13 9 1 14 15 8 4 6 2 12 7 10 0 J. arkins 30 of 51 MAPD2005/C178
Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ 0][ 2][ 3][ 6] [ ][ ][ ][ ] [ ][ ][ ][ ] [ ][ ][ ][ ] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 1 12 0 2 5 8 3 6 7 9 10 13 4 11 14 15 J. arkins 31 of 51 MAPD2005/C178
Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ 0][ 2][ 3][ 6] [10][13][14][15] [ ][ ][ ][ ] [ ][ ][ ][ ] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 1 7 4 5 9 12 10 13 8 11 14 15 J. arkins 32 of 51 MAPD2005/C178
Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ 0][ 2][ 3][ 6] [10][13][14][15] [ ][ ][ ][ ] [ 1][ 4][ 5][ 7] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 1 4 5 7 8 9 11 12 J. arkins 33 of 51 MAPD2005/C178
Example - Bitonic Sort 0: 1: 2: 3: Input Keys: [ 0][ 2][ 3][ 6] [10][13][14][15] [ 8][ 9][11][12] [ 1][ 4][ 5][ 7] Schedule: (0,1) (3,2) (0,2) (1,3) (0,1) (2,3) 8 9 11 12 J. arkins 34 of 51 MAPD2005/C178
Bitonic Sort - SIMD Architecture 2 Instances Parallel sorting network A B C D E F 8 Input Bitonic Sorting Network 1 4 Input Bitonic Sort 2 SIMD Controller FPGA 2 5% FPGA 1 27% J. arkins 35 of 51 MAPD2005/C178
Example - Odd/Even Merge Input Keys: A: [ 0][ 1][ 2][ 4][ 7][11][12][14] B: [ 3][ 5][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] MUX Z -2 Z -1 J. arkins 36 of 51 MAPD2005/C178
Example - Odd/Even Merge Input Keys: A: [ 0][ 1][ 2][ 4][ 7][11][12][14] B: [ 3][ 5][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 0 3 Z -2 1 5 Z -1 J. arkins 37 of 51 MAPD2005/C178
Example - Odd/Even Merge Input Keys: A: [ ][ ][ 2][ 4][ 7][11][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 2 3 0 Z -2 4 5 1 Z -1 J. arkins 38 of 51 MAPD2005/C178
Example - Odd/Even Merge Input Keys: A: [ ][ ][ ][ ][ 7][11][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 7 3 2 0 Z -2 11 5 4 1 Z -1 J. arkins 39 of 51 MAPD2005/C178
Example - Odd/Even Merge Input Keys: A: [ ][ ][ ][ ][ ][ ][12][14] B: [ ][ ][ 6][ 8][ 9][10][13][15] Merged Keys: C: [ 0][ 1][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 7 6 3 2 Z -2 0 11 8 5 Z -1 4 1 J. arkins 40 of 51 MAPD2005/C178
Example - Odd/Even Merge Input Keys: A: [ ][ ][ ][ ][ ][ ][12][14] B: [ ][ ][ ][ ][ 9][10][13][15] Merged Keys: C: [ 0][ 1][ 2][ 3][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ] 7 9 6 4 Z -2 2 11 10 8 Z -1 5 3 J. arkins 41 of 51 MAPD2005/C178
Odd/Even Merge - SIMD Architecture 1 Instance Parallel sorting network A/B = odd ; C/D = even A B C D E F Odd Merge Two Even Merge Two Merge Out FPGA 1 40% FPGA 2 5% J. arkins 42 of 51 MAPD2005/C178
SIMD Code Structure main.c int main( ) { int n = 523770*6; int64 *buf; buf = cachealign(n); mapsort.mc void mapsort(int64 *buf, n) { OBM_BANK_A (AA, int64, n/6) OBM_BANK_B (BB, int64, n/6) OBM_BANK_F (FF, int64, n/6) } mapsort(buf, n); free(buf); exit(0); } DMA_CPU(dir, AA, stripes, buf, n); for (i=0; i<rounds; i++) { schedule( &r1, &r2); bitonicsort8(aa[r1],bb[r1],cc[r1],dd[r1], AA[r2],BB[r2],CC[r2].DD[r2], &AA[r1],&BB[r1],&CC[r1],&DD[r1], &AA[r2],&BB[r2],&CC[r2],&DD[r2]); bitonicsort4(ee[r1],ff[r1],ee[r2],ff[r2], ); } DMA_CPU(dir, bufa, stripes, buf, n); return; J. arkins 43 of 51 MAPD2005/C178
Implementation Comparisons Algorithm Processor Complexity anguage Compiler ines Of Code Recursion FPGA Util. % Slices MIMD SIMD Refactoring Upper Bound x10 6 keys/s Quick Sort X86 FPGA N lgn N lgn C MC 81 97/96 n/a 90,84 31.58 eap Sort X86 FPGA N lgn N lgn C MC 55 56/54 - n/a 55,0 31.58 Radix Sort X86 FPGA N N C MC 70 81/64 - n/a 33,0 60.00 Bitonic Sort X86 FPGA Nlg 2 N lg 2 N C VD 78 53/478/365 n/a 27,0 6.32 O/E Merge X86 FPGA N N C MC 52 71/120 - n/a 40,0 60.87 X86 = Dual Xeon 2.8Gz FPGA = Virtex2XC6000 @ 100Mz MC = MAP C = icc v8.0 -fast = mcc v1.8 = mcc v1.9 = entirely = major changes = some = very little = almost none J. arkins 44 of 51 MAPD2005/C178
esson earned #1 Know your tools Develop accurate assessments early O/E Merge Bitonic Sort Radix Sort eap Sort Quick Sort Compiler 2.8 Gz Xeon x10 6 keys/s gcc icc -fast 1.99 5.66 0.50 1.06 1.63 4.72 - - - - FPGA upper bound estimate x10 6 keys/s 31.58 31.58 60.00 6.32 60.87 Upper bound on speedup vs gcc vs icc 15.87 5.58 63.16 29.79 36.81 12.71 - - - - J. arkins 45 of 51 MAPD2005/C178
Test Conditions 64 bit unsigned integer keys Uniformly distributed Randomly permuted Scores average of 10 runs FPGA configuration time ~65ms DMA time ~18ms Typical key quantity 3.14M Processor comparison: Xeon 2.8Gz, 1GB mem J. arkins 46 of 51 MAPD2005/C178
Experimental Results - 64 bit keys x 10 6 keys/s 14 12 10 8 6 4 2 0 5.66 2.32 1.06 1.96 4.72 12.99 0.69 1.02 Quick eap Radix Bitonic X86 FPGA 90 80 70 60 50 40 30 20 10 0 77.03 36 O/E Merge X86 FPGA Sorting Algorithms J. arkins 47 of 51 MAPD2005/C178
mcc Compiler Attempts to pipeline inner loops Maintains sequential behavior of C Reports dependencies/penalties Quick Sort: 1 penalty* eap Sort: 12 penalties Radix Sort: 2 penalties Bitonic Sort: 5 penalties Odd/Even Merge: 1 penalty Easy to build embarrassingly parallel code Resource usage ~2x D J. arkins 48 of 51 MAPD2005/C178
Conclusion FPGAs not best choice for sorting Sorting is memory bound Tight loops, low computation suited to processor More parallel memory accesses Faster clock rates Refactoring for better performance FPGAs underutilized Understand compiler limitations Eliminate dependencies J. arkins 49 of 51 MAPD2005/C178
Tight oop Example Merge a[n]=b[n]=infinity; j=k=0; oop i = 0 to 2N-1 { if (a[j] > b[k]) merged[i] = b[k++]; else merged[i] = a[j++]; } J. arkins 50 of 51 MAPD2005/C178
Future Work More refactoring Greater use of block rams W prediction to reduce penalties FPGA performance gain = ƒ(computation density/memory access) J. arkins 51 of 51 MAPD2005/C178