Design and Evaluation of an Auto-Memoization Processor

0

No comments posted yet

Comments

Slide 1

Design and Evaluation of an Auto-Memoization Processor Tomoaki TSUMURA (Nagoya Inst. of Tech.) Ikuma SUZUKI (Toyohashi Univ. of Tech.) Yasuki IKEUCHI (Toyohashi Univ. of Tech.) Hiroshi MATSUO (Nagoya Inst. of Tech.) Hiroshi NAKASHIMA (Kyoto Univ.) Yasuhiko NAKASHIMA (Nara Inst. of Sci. and Tech.)

Slide 2

Table of Contents Background ILP-based Speedup Techniques Memoization Auto-Memoization Processor Overview Structure and Execution Model Parallel Speculative Execution Performance Evaluation

Slide 3

Background: ILP Modern commercial microprocessors High performance w/ ILP-based technique Superscalar VLIW Limit of ILP-based technique Many programs have little distinct ILPs Resource conflict, Memory throughput Power dissipation Memory barrier

Slide 4

Background Conventional programs become impossible to gain MORE performance only with ILP Our goal: auto-memoization Binary compatible speedup technique It is not ILP-based It can be used together with ILP-based techniques How to skip execution

Slide 5

Background: Memoization Memoization Programming technique Stores the result of functions Reuse the result (next call) and omit execution int fib(int x){ if( x == 0 || x == 1){ return 1; else return fib(x-1)+fib(x-2); } int memo[N]; int fib(int x){ if( memo[x] ) return( memo[x] ); if( x == 0 || x == 1){ memo[x] = 1; }else{ memo[x] = fib(x-1)+fib(x-2); } return( memo[x] ); } Memoize

Slide 6

ILP and Memoization Processing stream lattice order ILP-based techniques shrink lattice horizontally Memoization shrink lattice vertically

Slide 7

Table of Contents Background ILP-based Speedup Techniques Memoization Auto-Memoization Processor Overview Structure and Execution Model Parallel Speculative Execution Performance Evaluation

Slide 8

Auto-Memoization Processor: Overview Memoization by HW Automatically Fast input matching enlarges the target Target Functions and Loop iterations = instruction regions func: : : ret %x main: : call func : : .LL3: : : ba .LL3 : functions between a callee label and return instruction loop iterations between bkwd branch and branch target

Slide 9

Inputs/Outputs of Instruction Region Inputs arguments (for functions) variables read in the region exclude function-local variables Outputs return values (for functions) variables written in the region exclude function-local variables data area stack area upper limit of stack size (by OS) current stack pointer stack pointer just before current function is called global variables local variables parent-local variables

Slide 10

Save inputs & outputs into linear lists Structure & Execution Model Memo Tbl. Main memory Processor Core D$2 D$1 Memoize system Regs ALU Save inputs & outputs for future reuse Compare past input set & current input Write back outputs skip exec. & speedup Memo Buf.

Slide 11

Memo Tables MemoBuf. (RAM) linear lists (we assume six lines) inputs/outputs of nested inner region are also memorized as inputs/outputs of outer region(s) each line corresponds to each of the multilevel regions works as write buffer for MemoTbl. MemoTbl. (CAM) assosiative memory each linear list of MemoBuf. is folded into MemoTbl.

Slide 12

Input Tree Input sequence of region some values of inputs can alter the following input addresses int g = 4; int h = 5; int main(){ return B(3); } int B(int x){ int b = x; if(x > 2) b += g; else b += h; return b; } &g &x PC B() 3 4 &h 2 5 Whole inputs patterns can be represented by multiway tree

Slide 13

Storing/Searching MemoTbl. &g &x PC B() 3 4 &h 2 5 00 01 02 03 04 05 06 07 00 01 03 04 02 cache Storing Searching h x all inputs are matched

Slide 14

Extension with Speculative Cores Memoization can reuse the region iff the input set is completely same with one of the past input sets. memoization-unfriendly region/programs ex.) A loop which uses its iterator variable as its input Speculative Cores executes the same region with main core using predicted input set store the result into MemoTbl.

Slide 15

Performance Evaluation Simulator SPARC V8 simulator 2MB CAM: based on MOSAID DC18288 Parameters CAM  Reg. matching latency: 32Byte/cycle CAM  Mem. matching latency: 32Byte/2cycle Workloads GENEsYs: Genetic Algorithm single core SPEC CPU95 (train) w/ three speculative cores

Slide 16

GENEsYs Genetic Algorithms (single core) genotype op. (crossover & mutation) calc fitness genotype selection initialization etc. Reduced cycles: 83% (max) 28% (ave.)

Slide 17

SPEC CPU95 (w/ three speculative cores) Reduced cycles (max) one core :32% w/ Sp. cores :65% Reduced cycles (ave.) one core :13% w/ Sp. cores :27% cache misses are also reduced

Slide 18

Conclusion Auto-Memoization Processor Non-ILP-based speedup technique automatically memoize multilevel regions memoization and ILP-based technique can complement each other Memoization with speculative cores helps memoization-unfriendly regions main core need to pay no backout penalty Future work hardware cost estimation energy-efficiency estimation better management algorithms for memo tables

Slide 19

Thank you Design and Evaluation of an Auto-Memoization Processor

Summary: Tomoaki Tsumura, Ikuma Suzuki, Yasuki Ikeuchi, Hiroshi Matsuo, Hiroshi Nakashima and Yasuhiko Nakashima: "Design and Evaluation of an Auto-Memoization Processor", Proc. of Parallel and Distributed Computing and Networks (PDCN2007), pp.245-250 (Feb. 2007)

URL:
More by this User
Most Viewed