11/4/20141 The Current State, Trends, and Future of Supercomputing Jack Dongarra University of Tennessee Oak Ridge National Laboratory.

Презентация:

Advertisements

Похожие презентации

Introducing Cisco Network Service Architectures © 2004 Cisco Systems, Inc. All rights reserved. Introducing the Cisco AVVID Framework ARCH v

Advertisements

© 2006 Cisco Systems, Inc. All rights reserved. BCMSN v Introducing Campus Networks Network Requirements.

© 2005 Cisco Systems, Inc. All rights reserved.INTRO v Managing Your Network Environment Managing Cisco Devices.

Designing Network Management Services © 2004 Cisco Systems, Inc. All rights reserved. Designing the Network Management Architecture ARCH v

Образец заголовка Образец текста Второй уровень Третий уровень Четвертый уровень Пятый уровень 1 Investment Attractiveness Index with the support of the.

Flynns Architecture. SISD (single instruction and single data stream) SIMD (single instruction and multiple data streams) MISD (Multiple instructions.

© 2002 IBM Corporation Confidential | Date | Other Information, if necessary November 4, 2014 Copyright © 2006 Eclipse Foundation, Inc., Made available.

INTERNATIONAL SPACE STATION. ARTHUR C. CLARKS THEORY Science fiction author Arthur C. Clark has an interesting theory about new ideas. He thinks they.

© 2009 Avaya Inc. All rights reserved.1 Chapter Two, Voic Pro Components Module Two – Actions, Variables & Conditions.

Innovation Strategy Management Lecture 6. Programme Part 1 – The basis of Innovation Part 1 – The basis of Innovation Part 2 – Innovation and New Product.

© 2005 Cisco Systems, Inc. All rights reserved. BGP v Route Selection Using Policy Controls Using Multihomed BGP Networks.

HPC Pipelining Parallelism is achieved by starting to execute one instruction before the previous one is finished. The simplest kind overlaps the execution.

© 2005 Cisco Systems, Inc. All rights reserved.INTRO v Building a Simple Serial Network Exploring the Functions of Networking.

Using Information Technology Chapter 1 Introduction to Information Technology.

Designing Virtual Private Networks © 2004 Cisco Systems, Inc. All rights reserved. Designing Site-to-Site VPNs ARCH v

1 Where is the O(penness) in SaaS? Make sure youre ready for the next wave … Jiri De Jagere Senior Solution Engineer, Progress Software Session 123.

The advantages of computers the disadvantages of computers the advantages of computers the disadvantages of computers.

© 2006 Cisco Systems, Inc. All rights reserved. HIPS v Administering Events and Generating Reports Managing Events.

Designing Enterprise Campus Networks © 2004 Cisco Systems, Inc. All rights reserved. Designing the Server Farm ARCH v

© 2002, Cisco Systems, Inc. All rights reserved. AWLF 3.0Module 7-1 © 2002, Cisco Systems, Inc. All rights reserved.

Транксрипт:

11/4/20141 The Current State, Trends, and Future of Supercomputing Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Overview Computational Science High Performance Computing Projections for the Future 2

3 Simulation: The Third Pillar of Science Simulation: The Third Pillar of Science Traditional scientific and engineering paradigm: 1)Do theory or paper design. 2)Perform experiments or build system. Limitations: Too difficult -- build large wind tunnels. Too expensive -- build a throw-away passenger jet. Too slow -- wait for climate or galactic evolution. Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3)Use high performance computer systems to simulate the phenomenon Base on known physical laws and efficient numerical methods.

Computational Science Fuses Three Distinct Elements: 4

Look at the Fastest Computers Supercomputing Matters Essential for scientific discovery Critical for national security Fundamental contributor to the economy and competitiveness through use in engineering and manufacturing Supercomputers are the tool for solving the most challenging problems through simulations 5

6 H. Meuer, H. Simon, E. Strohmaier, & JD - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP Ax=b, dense problem - Updated twice a year SCxy in the States in November Meeting in Germany in June - All data available from org Size Rate TPP performance

Top50 Supercomputers Russia 7

Top100 Supercomputers China 8

Top100 Supercomputers India 9

Performance Development 6-8 years My Laptop

Gflop/s Tflop/s Pflop/s Eflop/s Cray 2 1 Gflop/s O(1) Thread ASCI Red 1 Tflop/s O(10 3 ) Threads RoadRunner 1.1 Pflop/s O(10 6 ) Threads 1 Eflop/s O(10 9 ) Threads ~8 Hours~1 Year~1000 Year~1 Min.

Processors Used in Supercomputers Intel 71% AMD 13% IBM 7%

How are the Processors Connected? 13 Percent

Efficiency

Countries / System Share 58% 9% 5% 4% 3% 2% 1%

Russian Top500 Systems 16 RankSiteComputerCores Rmax 35 Joint Supercomputer CenterHP Cluster 3000 Xeon 3 GHz, Iband Moscow State University SKIF/T-Platforms T60, Intel quad core 3 MHz, Iband Kurchatov Institute Moscow HP Cluster 3000 Xeon 2.33GHz, Iband Moscow State UniversityIBM Blue Gene/P Solution Ufa State Aviation Technical University IBM HS21 Cluster, Xeon quad core 2.33 GHz, Infiniband Vyatsky State University HP Cluster 3000 Xeon 2.33GHz, Iband Roshydromet SGI Altix ICE 8200, Xeon quad core 2.83 GHz Siberian National University IBM HS21 Cluster, Xeon quad core 2.33 GHz, Iband

Customer Segments

18 Industrial Use of Supercomputers Of the 500 Fastest Supercomputer Worldwide, Industrial Use is > 60% Aerospace Automotive Biology CFD Database Defense Digital Content Creation Digital Media Electronics Energy Environment Finance Gaming Geophysics Image Proc./Rendering Information Processing Service Information Service Life Science Media Medicine Pharmaceutics Research Retail Semiconductor Telecomm Weather and Climate Research Weather Forecasting

Distribution of the Top500 Rmax (Tflop/s) Rank 19 systems > 100 Tflop/s 51 systems > 50 Tflop/s 12.6 Tflop/s 1.1 Pflop/s 2 systems > 1 Pflop/s 8 Russian Supercomputers on the list 267 systems replaced last time 119 systems > 25 Tflop/s ORNL UT

32 nd List: The TOP10

LANL Roadrunner A Petascale System in 2008 Connected Unit cluster 192 Opteron nodes (180 w/ 2 dual-Cell blades connected w/ 4 PCIe x8 links) 13,000 Cell HPC chips 1.33 PetaFlop/s (from Cell) 7,000 dual-core Opterons 122,000 cores 17 clusters 2 nd stage InfiniBand 4x DDR interconnect (18 sets of 12 links to 8 switches) 2 nd stage InfiniBand interconnect (8 switches) Based on the 100 Gflop/s (DP) Cell chip Hybrid Design (2 kinds of chips & 3 kinds of cores) Programming required at 3 levels. Dual Core Opteron Chip Cell chip for each core

ORNLs Newest System Jaguar XT5 Office of Science The systems will be combined after acceptance of the new XT5 upgrade. Each system will be linked to the file system through 4x-DDR Infiniband JaguarTotalXT5XT4 Peak Performance1,6451, AMD Opteron Cores181,504150, ,328 System Memory (TB) Disk Bandwidth (GB/s) Disk Space (TB)10,75010, Interconnect Bandwidth (TB/s) Center (40,000 ft 2 ~ 3700 m 2 ) Upgrading power to 15 MW Deploying a 6,600 ton chiller plant Tripling UPS and generator capability

s HPC System University of Tennessees National Institute for Computational Sciences Housed at ORNL, operated for the NSF, named Kraken Today: Cray XT5 (608 TF) + Cray XT4 (167 TF) XT5: 16,512 sockets, 66,048 cores XT4: 4,512 sockets, 18,048 cores Number 15 on the Top500 Later 2009: upgrading to 1 Pflop/s

ORNL/UTK Computer Power Cost Projections Over the next 5 years ORNL/UTK will deploy 2 large Petascale systems Using 15 MW today By 2012 close to 50MW!! Power costs close to $10M today. Cost estimates based on $0.07 per KwH Cost Per Year Power becomes the architectural driver for future large systems > $10M > $20M > $30M

26 Power is an Industry Wide Problem Hiding in Plain Sight, Google Seeks More Power, by John Markoff, June 14, 2006 Google Plant in The Dalles, Oregon, from NYT, June 14, 2006 Google facilities leveraging hydroelectric power old aluminum plants Microsoft and Yahoo are building big data centers upstream in Wenatchee and Quincy, Wash. – To keep up with Google, which means they need cheap electricity and readily accessible data networking Microsoft Quincy, Wash. 470,000 Sq Ft, 47MW!

Somethings Happening Here… In the old days it was: each year processors would become faster Today the clock speed is fixed or getting slower Things are still doubling every months Moores Law reinterpretated. Number of cores double every months From K. Olukotun, L. Hammond, H. Sutter, and B. Smith A hardware issue just became a software problem

Moores Law Reinterpreted Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). Need to deal with systems with millions of concurrent threads Future generation will have billions of threads! Need to be able to easily replace inter- chip parallelism with intro-chip parallelism Number of threads of execution doubles every 2 year

29 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3

30 Power Cost of Frequency Power Voltage 2 x Frequency (V 2 F) Frequency Voltage Power Frequency 3

Todays Multicores 98% of Top500 Systems Are Based on Multicore 31 Sun Niagra2 (8 cores) Intel Polaris (80 cores) IBM BG/P (4 cores) AMD Opteron (4 cores) IBM Cell (9 cores) Intel Clovertown (4 cores) SciCortex (6 cores) 282 use Quad-Core 204 use Dual-Core 3 use Nona-core

Cores Per Socket 4 cores: 67% 2 cores: 31% 9 cores: 7 systems Single core: 4 systems

Whats Next? SRAM Many Floating- Point Cores All Large Core Mixed Large and Small Core All Small Core Many Small Cores Different Classes of Chips Home Games / Graphics Business Scientific Different Classes of Chips Home Games / Graphics Business Scientific + 3D Stacked Memory

And then theres the GPGPUs NVIDIAs Tesla T10P T10P chip 240 cores; 1.5 GHz Tpeak 1 Tflop/s - 32 bit floating point Tpeak 100 Gflop/s - 64 bit floating point S1070 board 4 - T10P devices; 700 Watts GTX – T10P; 1.3 GHz Tpeak 864 Gflop/s - 32 bit floating point Tpeak 86.4 Gflop/s - 64 bit floating point 34

35 Intels Larrabee Chip Many X 86 IA cores Scalable to Tflop/s New cache architecture New vector instructions set Vector memory operations Conditionals Integer and floating point arithmetic New vector processing unit / wide SIMD

Architecture of Interest Manycore chip Composed of hybrid cores Some general purpose Some graphics Some floating point 36

Architecture of Interest Board composed of multiple chips sharing memory 37 Memory

Architecture of Interest Rack composed of multiple boards 38 Memory

Architecture of Interest A room full of these racks Think millions of cores 39 Memory

Moores Law Reinterpreted Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). Need to deal with systems with millions of concurrent threads Future generation will have billions of threads! Need to rethink the design of our software Very disruptive technology Number of threads of execution doubles every 2 year

Five Important Features to Consider When Computing at Scale Effective Use of Many-Core and Hybrid architectures Dynamic Data Driven Execution Block Data Layout Exploiting Mixed Precision in the Algorithms Single Precision is 2X faster than Double Precision With GP-GPUs 10x Self Adapting / Auto Tuning of Software Too hard to do by hand Fault Tolerant Algorithms With 1,000,000s of cores things will fail Communication Avoiding Algorithms For dense computations from O(n log p) to O(log p) communications GMRES s-step compute ( x, Ax, A 2 x, … A s x ) 41

Exascale Computing Exascale systems (10 18 Flop/s) are likely feasible by 2017± Million processing elements (cores or mini-cores) with chips perhaps as dense as 1,000 cores per socket, clock rates will grow more slowly 3D packaging likely Large-scale optics based interconnects PB of aggregate memory > 10,000s of I/O channels to Exabytes of secondary storage, disk bandwidth to storage ratios not optimal for HPC use Hardware and software based fault management Achievable performance per watt will likely be the primary measure of progress 42

Conclusions 43 Moores Law Reinterpreted Number of cores per chip doubles every two year, while clock speed roughly stable Threads of execution double every 2 years 100 M cores Need to deal with systems with millions of concurrent threads Future generation will have billions of threads! MPI and programming languages from the 60s will not make it Power limiting clock rate growth Power becomes the architectural driver for Exescale systems.

Conclusions For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware. This strategy needs to be rebalanced - barriers to progress are increasingly on the software side. Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while software has a half-life measured in decades. High Performance Ecosystem out of balance Hardware, OS, Compilers, Software, Algorithms, Applications No Moores Law for software, algorithms and applications

33 Collaborators / Support Top500 –Hans Meuer, Prometeus –Erich Strohmaier, LBNL/NERSC –Horst Simon, LBNL/NERSC

46 Weather and Economic Loss We now over-warn by a factor of 3 Average over-warning is 200 miles, or $200M per event Improved forecasts saving lives and resources Source: Kelvin Droegemeier, Oklahoma 40% of the $14T U.S. economy is impacted by weather and climate $1M in economic loss to evacuate each 1 mile of coastline