|
|
Design and Evaluation of an Auto-Memoization Processor Tomoaki TSUMURA (Nagoya Inst. of Tech.) Ikuma SUZUKI (Toyohashi Univ. of Tech.) Yasuki IKEUCHI (Toyohashi Univ. of Tech.) Hiroshi MATSUO (Nagoya Inst. of Tech.) Hiroshi NAKASHIMA (Kyoto Univ.) Yasuhiko NAKASHIMA (Nara Inst. of Sci. and Tech.)
Table of Contents Background ILP-based Speedup Techniques Memoization Auto-Memoization Processor Overview Structure and Execution Model Parallel Speculative Execution Performance Evaluation
Background: ILP Modern commercial microprocessors High performance w/ ILP-based technique Superscalar VLIW Limit of ILP-based technique Many programs have little distinct ILPs Resource conflict, Memory throughput Power dissipation Memory barrier
Background Conventional programs become impossible to gain MORE performance only with ILP Our goal: auto-memoization Binary compatible speedup technique It is not ILP-based It can be used together with ILP-based techniques How to skip execution
Background: Memoization Memoization Programming technique Stores the result of functions Reuse the result (next call) and omit execution int fib(int x){ if( x == 0 || x == 1){ return 1; else return fib(x-1)+fib(x-2); } int memo[N]; int fib(int x){ if( memo[x] ) return( memo[x] ); if( x == 0 || x == 1){ memo[x] = 1; }else{ memo[x] = fib(x-1)+fib(x-2); } return( memo[x] ); } Memoize
ILP and Memoization Processing stream lattice order ILP-based techniques shrink lattice horizontally Memoization shrink lattice vertically
Table of Contents Background ILP-based Speedup Techniques Memoization Auto-Memoization Processor Overview Structure and Execution Model Parallel Speculative Execution Performance Evaluation
Auto-Memoization Processor: Overview Memoization by HW Automatically Fast input matching enlarges the target Target Functions and Loop iterations = instruction regions func: : : ret %x main: : call func : : .LL3: : : ba .LL3 : functions between a callee label and return instruction loop iterations between bkwd branch and branch target
Inputs/Outputs of Instruction Region Inputs arguments (for functions) variables read in the region exclude function-local variables Outputs return values (for functions) variables written in the region exclude function-local variables data area stack area upper limit of stack size (by OS) current stack pointer stack pointer just before current function is called global variables local variables parent-local variables
Save inputs & outputs into linear lists Structure & Execution Model Memo Tbl. Main memory Processor Core D$2 D$1 Memoize system Regs ALU Save inputs & outputs for future reuse Compare past input set & current input Write back outputs skip exec. & speedup Memo Buf.
Memo Tables MemoBuf. (RAM) linear lists (we assume six lines) inputs/outputs of nested inner region are also memorized as inputs/outputs of outer region(s) each line corresponds to each of the multilevel regions works as write buffer for MemoTbl. MemoTbl. (CAM) assosiative memory each linear list of MemoBuf. is folded into MemoTbl.
Input Tree Input sequence of region some values of inputs can alter the following input addresses int g = 4; int h = 5; int main(){ return B(3); } int B(int x){ int b = x; if(x > 2) b += g; else b += h; return b; } &g &x PC B() 3 4 &h 2 5 Whole inputs patterns can be represented by multiway tree
Storing/Searching MemoTbl. &g &x PC B() 3 4 &h 2 5 00 01 02 03 04 05 06 07 00 01 03 04 02 cache Storing Searching h x all inputs are matched
Extension with Speculative Cores Memoization can reuse the region iff the input set is completely same with one of the past input sets. memoization-unfriendly region/programs ex.) A loop which uses its iterator variable as its input Speculative Cores executes the same region with main core using predicted input set store the result into MemoTbl.
Performance Evaluation Simulator SPARC V8 simulator 2MB CAM: based on MOSAID DC18288 Parameters CAM Reg. matching latency: 32Byte/cycle CAM Mem. matching latency: 32Byte/2cycle Workloads GENEsYs: Genetic Algorithm single core SPEC CPU95 (train) w/ three speculative cores
GENEsYs Genetic Algorithms (single core) genotype op. (crossover & mutation) calc fitness genotype selection initialization etc. Reduced cycles: 83% (max) 28% (ave.)
SPEC CPU95 (w/ three speculative cores) Reduced cycles (max) one core :32% w/ Sp. cores :65% Reduced cycles (ave.) one core :13% w/ Sp. cores :27% cache misses are also reduced
Conclusion Auto-Memoization Processor Non-ILP-based speedup technique automatically memoize multilevel regions memoization and ILP-based technique can complement each other Memoization with speculative cores helps memoization-unfriendly regions main core need to pay no backout penalty Future work hardware cost estimation energy-efficiency estimation better management algorithms for memo tables
Thank you Design and Evaluation of an Auto-Memoization Processor
Summary: Tomoaki Tsumura, Ikuma Suzuki, Yasuki Ikeuchi, Hiroshi Matsuo, Hiroshi Nakashima and Yasuhiko Nakashima: "Design and Evaluation of an Auto-Memoization Processor", Proc. of Parallel and Distributed Computing and Networks (PDCN2007), pp.245-250 (Feb. 2007)
| URL: |
No comments posted yet
Comments