The keynotes will cover some important (not only for applications but also for key concepts) parallel languages and approaches for high-performance computing, including early PGAS (partitioned global address space) languages such as CAF (Coarray Fortran) and UPC (Unified Parallel C), APGAS (asynchronous PGAS) languages such as X10 and Chapel, languages for heterogeneous platforms such as Lime and OmpSs, streaming languages such as StreamIt and OpenStream, DSL-type approaches such as Faust (Functional audio stream), CnC (concurrent collections), TCE (tensor contraction engine). These languages will be presented by the main actors involved in their design and development. Many other approaches of course exist that cannot be covered due to the limited time, such as the work on TBB (threading building blocks), Cilk, Charm++, Swarm (scalable and efficient parallelism for exascale), StarPU (runtime). Some of them however are the subject of talks at CPC’13.

The set of keynotes is organized from Saturday morning, June 29 to Tuesday, July 2 until lunch, with a short introduction followed by 13 talks of various lengths (mostly 1.5-2 hours):

The schedule of courses is subject to changes but should be the following.

Sat. 29th Sun. 30th Mon. 1st Tue. 2nd
9 AM Welcome Brad Chamberlain François Bodin Yann Orlarey
Rob Schreiber
10 AM Kathleen Knobe
Break Break
11 AM John Mellor-Crummey Break Rodric Rabbah Break
Vivek Sarkar P. Sadayappan
12 AM
1 PM Lunch
2 PM Kathy Yelick Free Rosa Badia
3 PM
4 PM Break Albert Cohen
Dave Grove
5 PM

Here are the abstracts and slides of the keynotes.

Short presentation of the keynotes
Alain Darte, CNRS, ENS-Lyon.
Get the slides.

Introduction. HPC languages and compilation: où en sommes-nous?
Robert Schreiber, Hewlett Packard.
Get the slides.

I will give a personal view of the how the machines and the languages in HPC have evolved, trying to distill the big picture from experiences of the past, aiming to ask the right questions rather than provide the answers. What is the nature of the HPC programming problem? How is it changing? What are the approaches that have worked, and where is it worthwhile to look for improvements? How can we cope with a new round of changes to the machines and the problems they are built to solve? I may not succeed in this, but I hope you will enjoy the effort.

From HPF to Coarray Fortran 2.0
John Mellor-Crummey, Rice University.
Get the slides.

Fortran has long been a favorite among scientific programmers. For that reason, it has always been viewed as an important language to map to scalable parallel systems. In the 1990s, High Performance Fortran (HPF) was designed as a set of language extensions that would enable compilers to partition computation, add data movement and synchronization, and manage non-local storage. At Rice University, we developed the dHPF compiler, which used polyhedral analysis and code generation to compile HPF programs for distributed memory machines. While dHPF was able to generate efficient, scalable parallel code for complex programs, the compiler technology was brittle and it was easy to write programs that were beyond the range of its capabilities.

Ultimately, scientific programmers were uncomfortable with the lack of control and expressiveness offered by HPF. In 1998, Numrich and Reid conceived the Coarray Fortran (CAF) programming model as a lower-level set of language extensions that provide application programmers more control. In 2010, coarrays were added to the Fortran 2008 standard. Our experience with Coarray Fortran led us to conclude that it lacked several important features for supporting parallel libraries and generating high performance code for extreme-scale parallel systems. In response, we developed a richer set of language extensions for Fortran that we call Coarray Fortran 2.0 (CAF 2.0). CAF 2.0, a partitioned global address space programming model based on one-sided communication, is a coherent synthesis of concepts from MPI, Unified Parallel C, and IBM’s X10 programming language. CAF 2.0 includes a broad array of features including process subsets known as teams, team-based asynchronous collective communication, communication topologies, dynamic allocation of shared data, and global pointers, along with synchronization constructs including finish, a communication fence, and events.

This talk will trace the evolution of Fortran-based parallel programming models, examine the role of compiler technology, and discuss our experiences compiling and using these language models.

Beyond UPC. Antisocial parallelism: Avoiding, hiding, and managing communication
Katherine Yelick, University of California at Berkeley and Lawrence Berkeley National Laboratory.
Get the slides.

Future computing system designs will be constrained by power density and total system energy, and will require new programming models and implementation strategies. Data movement in the memory system and interconnect will dominate running time and energy costs, making communication cost reduction the primary optimization criteria for compilers and programmers. Communication cost can be divided into latency costs, which are per communication event, and bandwidth costs, which grow with total communication volume. The trends show growing gaps for both of these relative to computation, with the additional problem that communication congestion can conspire to worsen both in practice.

In this talk I will start with an overview of the UPC (Unified Parallel C) language, which uses one-sided communication to avoid synchronization costs and encourages latency hiding. UPC is one of the earliest examples of a Partitioned Global Address Space (PGAS) language, which combines the convenience of a global address space with the locality control and scalability of user-managed data partitioning. I will describe some of the key programming concepts in UPC, the performance benefits from overlapped and pipelined communication, and open problems that arise from increasingly hierarchical computing systems, with multiple levels of memory spaces and communication layers.

Bandwidth reduction often requires more substantial algorithmic transformations, although some techniques, such as loop tiling, are well known. These can be applied as hand-optimizations, through code generation strategies in autotuned libraries, or as fully automatic compiler transformations. Less obvious techniques for communication avoidance have arisen in the so-called “2.5D” parallel algorithms, which I will describe more generally as “.5D” algorithms. These ideas are applicable to many domains, from scientific computations to database operations. In addition to having provable optimality properties, these algorithms also perform well on large-scale parallel machines. I will end by describing some recent work that lays the foundation for automating transformations to produce communication optimal code for arbitrary loop nests.

Reflections on X10: Towards Performance and Productivity at Scale
Dave Grove, IBM Watson.
Get the slides.

The origin of the X10 language was the PERCS project, IBM’s response to DARPA’s HPCS (High Productivity Computing Systems) initiative. The PERCS project set out to develop a petaflop computer, which could be programmed ten times more productively than a computer of similar scale in 2002. The specific charter of the X10 team was to develop a programming model for such large scale, concurrent systems that could be used to program a wide variety of computational problems, and could be accessible to a large class of professional programmers. The result is X10: a modern language in the strongly typed, object-oriented programming tradition whose design fundamentally focuses on concurrency and distribution, and is capable of running with good performance at scale.

In this talk, I will focus on the development of the Asynchronous Partitioned Global Address Space (APGAS) programming model that lies at the heart of the X10 language. I will motivate why we believe the APGAS model addresses the fundamental challenges of performance and productivity at scale by illustrating how we have applied it in X10 in a variety of problem domains, ranging from traditional HPC kernels to commercial scale-out workloads. I will also describe the challenges and opportunities in designing and implementing a language that simultaneously targets both traditional HPC and commercial workloads and systems.

Chapel: The design and implementation of a multiresolution language
Brad Chamberlain, Cray, Inc.
Get the slides.

Chapel is an emerging parallel programming language that strives to dramatically improve the productivity of parallel programmers from desktops to supercomputers. Compared to currently adopted parallel programming models, it also strives to support more general styles of parallelism in software and hardware. Chapel’s design and development are being led by Cray Inc., in collaboration with members of academia, computing centers, and industry. It is being developed in a portable, open-source manner under the BSD license.

In this talk, I will provide a brief overview of Chapel’s motivating themes and central concepts, as well as an introduction to its implementation approach. A central theme in Chapel’s design is its “multiresolution philosophy,” in which the language supports a mix of lower- and higher-level features to give the programmer a spectrum of choices between explicit control and more productive abstractions. A key feature of this philosophy is that higher-level features are implemented within Chapel in terms of the lower-level concepts, ensuring that the various levels are compatible and composable. In describing Chapel’s implementation, I will explain how its low-level task parallelism and PGAS namespace are implemented by the compiler and runtime; I will then describe how higher-level abstractions, like distributed arrays and forall loops, are specified within the language itself.

Analysis and transformation of programs with explicit parallelism
Vivek Sarkar, Rice University.
Get the slides.

It is widely agreed that spatial parallelism in the form of multiple power-efficient cores must be exploited to compensate for the lack of clock frequency scaling. Two complementary compiler approaches to address this problem are 1) automatic extraction of parallelism from sequential programs, and 2) compilation and optimization of explicitly parallel programs. This lecture addresses opportunities and challenges related to the second approach, which is increasing in importance with the availability of more and more programming languages with explicit parallelism e.g., Chapel, Cilk, Coarray Fortran (CAF), CUDA, Habanero-Java (HJ), OpenMP, Unified Parallel C (UPC), and X10. While these HPC languages increase productivity by allowing the programmer to express multiple levels of parallelism, programs written in these languages can often suffer large performance degradations due to increased overheads.

The first part of this lecture describes a transformation framework for optimizing task-parallel programs with a focus on task creation and termination for determinate parallelism. These operations appear explicitly in constructs such as async, finish in HJ and X10, task, taskwait in OpenMP, and begin, sync in Chapel, or implicitly in parallel loop constructs. This framework includes a definition of data dependence in task-parallel programs, a happens-before analysis algorithm, and a range of program transformations for optimizing task parallelism, which cover three different but interrelated optimizations: (1) finish-elimination, (2) forall-coarsening, and (3) loop-chunking. All three optimizations are specified in an iterative transformation framework that applies a sequence of relevant transformations until a fixed point is reached.

The second part of this lecture focuses on optimization of critical sections, a major source of sequential bottlenecks in parallel programs. We introduce compiler optimization techniques to reduce the amount of time spent in critical sections, thereby improving performance and scalability. Specifically, we focus on redundancy elimination for critical sections and describe three transformations with their accompanying legality conditions: (1) scalar replacement within critical sections, (2) non-critical code motion to hoist local computations out of critical sections, and (3) critical section specialization to replace critical sections by non-critical code blocks on certain control flow paths. The effectiveness of the first and third transformations is further increased by interprocedural analysis of parallel programs. Finally, we report on recent results for speculative parallelism among critical sections using the delegated isolation approach.

Dealing with portability & performance on heterogeneous systems with directive-based programming approaches
François Bodin, University of Rennes 1, Irisa.
Get the slides.

Directive-based programming is a very promising technology for dealing with heterogeneous many-core architectures. Emerging standards such as OpenACC and other initiatives such as OpenHMPP provide a solid ground for users to invest in such paradigm. On one side, portability is required to ensure long software lifetime and to reduce maintenance cost. On the other-hand, obtaining efficient code requires to have a tight mapping between the code and the target architecture. In this presentation, we describe the challenges in building programming tools based on directives. We show how OpenACC and OpenHMPP directives offer an incremental development for various heterogeneous architectures ranging from AMD, Intel, Nvidia to ARM.

Liquid metal, StreamIt, and Lime
Rodric Rabbah, IBM Watson.
Get the slides.

Heterogeneous systems show a lot of promise for extracting high-performance by combining the benefits of conventional architectures with specialized accelerators in the form of graphics processors (GPUs) and reconfigurable hardware (FPGAs). Extracting this performance often entails programming in disparate languages and models, making it hard for a programmer to work equally well on all aspects of an application. Further, relatively little attention is paid to co-execution — the problem of orchestrating program execution using multiple distinct computational elements that work seamlessly together.

I will present Liquid Metal, a comprehensive compiler and runtime system for a new programming language called Lime. This work enables the use of a single language for programming heterogeneous computing platforms, and the seamless co-execution of the resultant programs on CPUs and accelerators that include GPUs and FPGAs. We have developed a number of Lime applications, and successfully compiled some of these for co-execution on various GPU and FPGA enabled architectures. Our experience so far leads us to believe the Liquid Metal approach is promising and can make the computational power of heterogeneous architectures more easily accessible to mainstream programmers.

Liquid Metal is joint work with Joshua Auerbach, David Bacon, Ioana Baldini, Perry Cheng, Stephen Fink, Sunil Shukla. More information, demos and publications available from our project web site.

Programming heterogeneous platforms with OmpSs
Rosa Badia, UPC Barcelona.
Get the slides.

OmpSs is a task-based programming model that aims to provide portability and flexibility to sequential codes while the performance is achieved by the dynamic exploitation of the parallelism at task level. OmpSs targets the programming of heterogeneous and multi-core architectures, and extends OpenMP 3.0 by offering asynchronous parallelism in the execution of the tasks. The main extension provided by OmpSs is the concept of data dependences between tasks. Tasks in OmpSs are annotated with data directionality clauses that specify the data used by them, and how these data will be used (read, write, or read&write). This information is used during the execution by the underlying OmpSs runtime to control the synchronization of the different instances of tasks by creating a dependence graph that guarantees the proper order of execution. This mechanism provides a simple way to express the order in which tasks must be executed, without the need of adding explicit synchronization.

Additionally, the OmpSs syntax offers the flexibility to express that the given tasks can be executed on heterogeneous target cores (i.e., regular processors, GPUs, or FPGAs). The runtime system is able to schedule and run those tasks, taking care of the required data transfers and synchronizations. Furthermore, more than one implementation can be provided for a given task and the runtime system will be able to choose the best suited one. The talk will present the basics of the OmpSs programming model: its syntax, the main features of its runtime, and the main target computing platforms: multicore, GPUs and clusters. The talk will be illustrated with examples and results on different target architectures.

Streaming data flow: A story about performance, programmability, and correctness
Albert Cohen, Inria, ENS Paris.
Get the slides.

Stream computing is often associated with regular, data-intensive applications, and more specifically with the family of cyclo-static data-flow models. The term has also been borrowed by bulk-synchronous data-parallel architectures favoring local, pipelined computations. Both interpretations are valid but incomplete: streams underline the formal definition of Kahn process networks for about 4 decades, a foundation for deterministic parallel languages and systems with a solid heritage. While some static analyses and optimizations impose strong restrictions on expressiveness, some stream languages push the complexity to a runtime system, gaining expressiveness without sacrificing much in terms of correctness and productivity. The combination of stream processing with data-flow computing is particularly attractive for its promise of functional determinism in parallel programs. The presentation will focus on two concrete research experiments: the OpenStream extension of OpenMP and the data-flow synchronous parallel language Decades. These languages have different objectives, leading to different tradeoffs in expressiveness and correctness.

OpenStream is meant for general-purpose computing on multi- and many-core architectures. It supports modular composition, nested parallelism, dynamic point-to-point and multi-cast synchronization, and first-class streams. The language combines unprecedented expressiveness and performance within the constraints of an existing imperative language. Streams are exposed as a language construct and also serve as a backbone to exploit fine-grain thread-level parallelism on conventional hardware. Decades is designed for safety-critical and embedded multiprocessor systems. It allows to harness the high levels of concurrency in reactive systems, while preserving a functionally deterministic semantics, with liveness, bounded memory, and bounded execution time guarantees. It allows for the modular construction of complex systems combining state machines and data-flow equations, relying on certified compilation methods generating sequential code competitive with low-level C programming.

If time permits, the presentation will conclude with a discussion of memory model, language evolutions, and the role of the compiler for many-core architectures. We will use many examples and short demonstrations throughout the presentation.

From Music III to Faust: A journey into audio and music DSLs
Yann Orlarey, Grame laboratory, Lyon.
Get the slides.

Since Music III (the first programming language for digital audio synthesis developed by Max Mathews in 1959 at Bell Labs) and MUSICOMP (the first music composition language developed by Lejaren Hiller and Robert Baker in 1963), research in music DSLs has been very active and innovative. Today, programming languages like Csound, OpenMusic, Max, Puredata, Faust, or Supercollider, to name only a few of them, are routinely used as a creative means by electronic musicians and avant-garde composers.

This talk is a discovery journey into the field of the audio and music DSLs and through its historical evolution. It will end with a presentation of Faust (Functional Audio Stream), a synchronous functional programming language specifically designed for high-performance real-time signal processing and synthesis. Thanks to several semantic and symbolic techniques, the Faust compiler can generate highly optimized scalar or parallel code for a variety of languages, from C++ to LLVM-IR.

CnC: A data and control flow language for high performance computing
Kathleen Knobe, Intel, Massachusets.
Get the slides.

CnC (concurrent collections) is a parallel language in the sense that its goal is parallel execution but in fact a CnC program does not indicate what runs in parallel. Instead, it identifies what precludes parallel execution. There are exactly two reasons that computations cannot execute in parallel. If one computation produces data that the other one consumes, the producer must execute before the consumer. If one computation determines if another will execute, the controller must execute before the controllee. So CnC is not a dataflow language. It is more of a data and control flow language. It is closer in philosophy to the PDG (Program Dependence Graph) intermediate form than to other parallel programming languages.

The initial motivation was to indicate the minimum constraints on program execution. This maximizes efficiency by maximizing flexibility. It also supports portability because these constraints are based on the application, not on the platform. A nice benefit of explicit specification of scheduling constraints is that the tuning process can be isolated. This isolation can have real productivity implications.

The talk will present CnC, the CnC tuning language, distributed CnC, analyses and optimizations, support for resilience and the CnC runtime.

Domain-specific abstractions and performance portability
P. (Saday) Sadayappan, Ohio State University.
Get the slides.

Recent trends in architecture are making multicore parallelism as well as heterogeneity ubiquitous. This creates significant chalenges to application developers as well as compiler implementations. Currently it is impossible to achieve performance portability of high-performance applications from a single version of a program – different code versions are necessary for different target platforms, e.g., for multicore CPUs versus GPUs.

A promising approach to performance portability, i.e., “write once, execute anywhere,” is via identifying suitable domain-specific abstractions and compiler techniques to transform high-level specifications automatically to high-performance implementations on different targets. This talk will discuss efforts to develop performance-portable compiler techniques for domain-specific abstractions.