About This Workshop
This Workshop is being held on March 2nd and 3rd, 2015.
Extreme Model Adaptation
Predictive multi-physics simulations require verified and validated simulations that span several spatial and temporal scales. For realistic engineering problems, it is not often possible to use uniform discretization. Therefore, adaptive models in space, time and constitutive equations are demanded. First theme of the workshop will focus on novel computational algorithms for solution of partial differential equations over heterogeneous material domains. One such algorithm, Wavelet adaptive multiresolution representation (WAMR), performs an adaptive compression of the solution, and exhibits rapid convergence on the unevenly spaced collocation points. Since heterogeneous materials with different chemical composition and disparate thermo-mechanical behavior exhibit a complex response, constitutive model adaptation is also an important component. Thus, a hierarchy of models bridging all fundamental scales has to be constructed with error tolerances and uncertainty quantification controlling their applicability. In this theme of the E^4 workshop, we will discuss Extreme Model Adaptation of multi-scalar data with multiphysics exhibiting highly non-linear dynamics and constitutive behavior.
Extreme Runtime System
The High Performance ParalleX (HPX) runtime system amortizes the complexity and overhead of dynamic action away from applications, allowing developers to focus their efforts on the problem domain rather than designing algorithms to match a particular machine architecture and/or topology. HPX features a system-global address space for distributed memory, a variety of barrier-free synchronization mechanisms, and a fast distributed lightweight threading system. Together, these features present an abstraction that makes an extreme-scale supercomputer appear to applications as a single machine, and it can be programmed in ways very similar to traditional medium-grained (multi-threaded) parallel software.
Extreme Software Productivity
Computational scientists work most effectively when they can express their programs at a level of abstraction appropriate for their problem domain. Unfortunately, in order to realize codes with acceptable levels of performance, computational scientists must often devote considerable effort adapting and tuning their software to particular execution environments and data sets. This development overhead is an unnecessary distraction from the science that the scientist actually wants to accomplish. The goal of domain specific languages and libraries (DSLLs) is to establish a framework for software development that provides high levels of expressiveness as well as high levels of performance.
DSLLs have the potential to drastically improve the workflow for computational scientists by separating scientific concerns from implementation concerns (particularly those performance related issues specific to a given hardware architecture). In a traditional development workflow the computational scientist must consider the algorithms, data structures, run-time environment, and hardware to develop an efficient implementation for a single platform. DLLs improve the workflow by separating expression of the problem to be solved from how it is best solved on a given platform. Optimizations for specific environments and hardware are encapsulated in the DSLL, allowing the computational scientist to implement performance-portable algorithms and applications. Leveraging this encapsulation, computer scientists can provide general, domain-independent algorithms and data structures that can be used by entire communities of computational scientists. DSLLs will enable a high-productivity environment where computational scientists can concentrate on science and computer scientists can concentrate on computer science.
Modern computing architectures are the culmination of 30 years of optimization for dense matrix operations, including deep cache hierarchies, high bandwidth memories optimized for large unit-stride accesses, multiple cores, high bandwidth networking links between nodes, and hybrid systems. Floating point efficiencies often exceed 90%, even for Top10 systems. There are, however, significant red flags that this will not continue. Main memory capacity has switched from matching computing's 2X per year to a much lower 1.6X per year. This means that the available memory footprint versus cores, ops, nodes, etc is declining precipitously. Even worse, switches from dense-matrix oriented to either sparse or simply data-intensive applications indicate radical changes in scalability. The new HPCG (High Performance Conjugate Gradient) benchmark is exhibiting efficiencies on TOP10 systems of 1-4%. Graph-intensive problems such as GRAPH500 have gone flat in peak performance for the last 2.5 years. The forcing problem appears to be that access to memory for these applications looks nothing like that of LINPACK.
Looking forward, the presumed goal for extreme systems has been that they need to consume about 20pJ per flop to keep the overall systems to something less than dedicated nuclear reactors to power them. The trend for systems optimized for dense operations appears to reach within a factor of 2 or 3 of this goal. However, recent data from a TOP5 system indicates that the energy cost of accessing a word from local memory is in the order of 7900pJ and 9000pJ for remote memory. This would constrain the average flop to need only 1/1000th of a memory access to be even close to 20pJ. This flies in the face of emerging applications.