University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado.

Презентация:



Advertisements
Похожие презентации
HPC Pipelining Parallelism is achieved by starting to execute one instruction before the previous one is finished. The simplest kind overlaps the execution.
Advertisements

Loader Design Options Linkage Editors Dynamic Linking Bootstrap Loaders.
The waterfall model is a popular version of the systems development life cycle model for software engineering. Often considered the classic approach to.
Management Information Systems Systems Development Management Information Systems Systems Development.
XjCharts A C++ / Java Statecharts Tool for Developers Experimental Object Technologies
Power saving control for the mobile DVB-H receivers based on H.264/SVC standard Eugeny Belyaev, Vitaly Grinko, Ann Ukhanova Saint-Petersburg State University.
Evgeniy Krivosheev Andrey Stukalenko Vyacheslav Yakovenko Last update: Nov, 2013 Spring Framework Module 1 - Introduction.
The waterfall model is a popular version of the systems development life cycle model for software engineering. Often considered the classic approach to.
Lecture # Computer Architecture Computer Architecture = ISA + MO ISA stands for instruction set architecture is a logical view of computer system.
Designing QoS © 2004 Cisco Systems, Inc. All rights reserved. Designing QoS for Enterprise Networks ARCH v
© 2006 Cisco Systems, Inc. All rights reserved. HIPS v Configuring Groups and Policies Building an Agent Kit.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v Customer-to-Provider Connectivity with BGP Connecting a Multihomed Customer to Multiple Service.
Flynns Architecture. SISD (single instruction and single data stream) SIMD (single instruction and multiple data streams) MISD (Multiple instructions.
© 2002 IBM Corporation Confidential | Date | Other Information, if necessary November 4, 2014 Copyright © 2006 Eclipse Foundation, Inc., Made available.
© 2005 Cisco Systems, Inc. All rights reserved.INTRO v Managing Your Network Environment Managing Cisco Devices.
Comparison of Lotus Notes Designer, Domino Workflow Architect and AdHoc Workflow Builder 2003 (c) AdHoc.
Designing Network Management Services © 2004 Cisco Systems, Inc. All rights reserved. Designing the Network Management Architecture ARCH v
General Packet Radio Service. GPRS GPRS is a packet-based data bearer service for GSM and TDMA networks. GPRS gives mobile users faster data speeds and.
Designing Network Management Services © 2004 Cisco Systems, Inc. All rights reserved. Developing an Enterprise Network Management Strategy ARCH v
© 2005 Cisco Systems, Inc. All rights reserved. BGP v Route Selection Using Policy Controls Using Multihomed BGP Networks.
Транксрипт:

University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado at Boulder Towards a Toolchain for Pipeline-Parallel Programming on CMPs John Giacomoni

University of Colorado at Boulder Core Research Lab Problem UP performance at end of life Chip-Multiprocessor systems –Individual cores less powerful than UP –Asymmetric and Heterogeneous –10s-100s-1000s of cores How to program? Intel (2x2-core)MIT RAW (16-core)100-core400-core

University of Colorado at Boulder Core Research Lab Programmers… Programmers are: –Bad at explicitly parallel programming –Better at sequential programming Solutions? –Hide parallelism Compilers Sequential libraries? –Math, iteration, searching, and ??? Routines

University of Colorado at Boulder Core Research Lab Using Multi-Core Task Parallelism –Desktop Data Parallelism –Web serving –Split/Join, MapReduce, etc… Pipeline Parallelism –Video Decoding –Network Processing

University of Colorado at Boulder Core Research Lab We believe that the best strategy for developing parallel programs may be to evolve them from sequential implementations. Therefore we need a toolchain that assists programmers in converting sequential programs into parallel ones. This toolchain will need to support all four conversion stages: identification, implementation, verification, and runtime system support. Joining the Minority Chorus

University of Colorado at Boulder Core Research Lab The Toolchain Identification –LoopProf and LoopSampler –ParaMeter Implementation –Concurrent Threaded Pipelining Verification Runtime system support

University of Colorado at Boulder Core Research Lab LoopProf LoopSampler Thread level parallelism benefits from coarse grain information –Not provided by gprof, et al. Visualize relationship between functions and hot loops No recompilation LoopSampler is effectively overhead free

University of Colorado at Boulder Core Research Lab Partial Loop Call Graph Boxes are functions Ovals are loops

University of Colorado at Boulder Core Research Lab ParaMeter Dynamic Instruction Number vs. Ready Time graph Visualize dependence chains –Fast random access of trace information –Compact representation Trace Slicing –Moving forward or backwards in a trace based on a flow (control, dependences, etc) –Requires information from disparate trace locations Variable Liveness Analysis

University of Colorado at Boulder Core Research Lab DIN vs. Ready Time

University of Colorado at Boulder Core Research Lab DIN vs. Ready Time DIN plot for 254.gap (IA64,gcc,inf) Multiple Dep. chains

University of Colorado at Boulder Core Research Lab Handling the Information Glut Challenging Trace size Trace complexity Need fast random access Solution –Binary Decision Diagrams –Compression ratios: 16-60x 10 9 instructions in 1GB

University of Colorado at Boulder Core Research Lab Implementation Well researched –Task-Parallel –Data-Parallel More work to be done –Pipeline-Parallel Concurrent Threaded Pipelining –FastForward –DSWP Stream languages –Streamit

University of Colorado at Boulder Core Research Lab Concurrent Threaded Pipelining Pipeline-Parallel organization –Each stage bound to a processor –Sequential data flow Data Hazards are a problem Software solution –FastForward

University of Colorado at Boulder Core Research Lab Threaded Pipelining Concurrent Sequential

University of Colorado at Boulder Core Research Lab Related Work Software architectures –Click, SEDA, Unix pipes, sysv queues, etc… –Locking queues take >= 1200 cycles (600ns) Additional overhead for cross-domain communication Compiler extracted pipelines –Decoupled Software Pipelining (DSWP) Modified IMPACT compiler Communication operations <= 100 cycles (50ns) –Assumes hardware queues Decoupled Access/Execute Architectures

University of Colorado at Boulder Core Research Lab FastForward Portable software only framework ~70-80 cycles (35-40ns)/queue operation Core-core & die-die –Architecturally tuned CLF queues Works with all consistency models Temporal slipping & prefetching to hide die-die communication –Cross-domain communication Kernel/Process/Thread

University of Colorado at Boulder Core Research Lab Network Scenario FShm How do we protect? GigE Network Properties: 1,488,095 frames/sec 672 ns/frame Frame dependencies

University of Colorado at Boulder Core Research Lab Verification Characterize run-time behavior with static analysis –Test generation –Code verification –Post-mortem root-fault analysis Identify the frontier of states leading to an observed fault Use formal methods to final fault-lines

University of Colorado at Boulder Core Research Lab Runtime System Support Hardware virtualization –Asymmetric and heterogeneous cores –Cores may not share main memory (GPU) Pipelined OS services Pipelines may cross process domains –FShm –Each domain should keep its private memory Protection Need label for each pipeline –Co/gang-scheduling of pipelines

University of Colorado at Boulder Core Research Lab Questions?