12nd Workshop on Compiler-Driven Performance

November 19, 2013
Hilton Suites Toronto/Markham Conference Centre,
Markham, Ontario, Canada

Associated with CASCON 2013
(http://www.cas.ibm.com/cascon)


Final Program


Invited Talk:

10:00-10:40 Optimization by assembly (It's not what you think!)
Bob Blainey, IBM Canada Software Laboratory

10:40-11:00 Coffee Break

Mini Section 1 - Chair: José Nelson Amaral - University of Alberta

11:00-11:30 A Coherence Protocol for Optimizing Global Shared Data Accesses
Jeeva Paudel1, J. Nelson Amaral1, Olivier Tardieu2 - 1University of Alberta, 2IBM T. J. Watson Centre
11:30-12:00 Experiences Implementing Lightweight Synchronization in Java for the Intel Haswell Transactional Memory features
Yi Zhang, Nikola Grcevski - IBM Canada Software Laboratory
12:00-12:30 Mc2For: a MATLAB to Fortran 95 Compiler
Xu Li, Laurie Hendren - School of Computer Science, McGill Univeristy
12:30-13:00 MiX10: Compiling MATLAB for High Performance Computing via X10
Vineet Kumar, Laurie Hendren - School of Computer Science, McGill Univeristy

13:00-14:00 Lunch


Invited Talk:

14:00-14:40 Examining Trends in the Hardware/Software Interface
Marcel Mitran, IBM Canada Software Laboratory

Mini Section 2 - Chair: Laurie Hendren - School of Computer Science, McGill University

14:40-15:10 Quantifying Active Data Sharing in Multi-threaded Programs
Hao Luo, Chen Ding, Pengcheng Li - University of Rochester
15:10-15:40 Simple Profile Rectifications Go A Long Way: Demystifying the Influence of Sampling Errors on Feedback Driven Program Optimizations
Bo Wu1, Mingzhou Zhou1, Xipeng Shen1, Yaoqing Gao2, Raul Silvera2, Graham Yiu2 - 1College of William & Mary, 2IBM Canada Software Laboratory

15:40-16:00 Coffee Break

Mini Section 3 - Chair: Bob Blainey - IBM Canada Software Laboratory

16:00-16:30 Using Machine Learning to Automatically Tune GPU Program Performance
Tianyi David Han and Tarek S. Abdelrahman - University of Toronto
16:30-17:00 ABC-Optimizer: An Affinity-Based Code Layout Optimizer
Chen Ding, Rahman Lavaee, Pengcheng Li - University of Rochester


Optimization by assembly (It's not what you think!)
Bob Blainey, IBM Canada Software Laboratory
Presentation Slides: [PPT]

A Coherence Protocol for Optimizing Global Shared Data Accesses
Jeeva Paudel1, J. Nelson Amaral1, Olivier Tardieu2 - 1University of Alberta, 2IBM T. J. Watson Centre
Presentation Slides: [PPT]

Shared data variables are fundamental abstractions in programming languages supporting Distributed Shared-Memory Machines (DSMs). Programming languages intended for high-end DSMs offer notions of data locality, wherein a shared data variable is allocated in one processor and globally accessible from all other processors. Managing global references to the shared data variable allocated in a designated processor presents a signicant scalability challenge: efficiently maintaining data consistency across the processors. Popular programming paradigms, such as partitioned-global- address-space models, typically employ following strategies to manage consistency of the shared data: (1) use one-sided puts and gets messages to write or read the variables at their site of allocation; (2) migrate the task referencing the shared variable to the site of data itself, using an active message that either creates a task on the remote side or has the active message handler execute the remote operation itself When the accesses are local, message transmissions and task migration are usually avoided by the underlying runtime system of a programming language. In this talk, we will describe a technique for managing global shared variables using multiple copies of the variable, and performing invalidation and write-backs on chosen copies. Such a technique is in fact akin to managing cache line consistency in DSMs. Thus, we borrow ideas from directory-based cache coherence protocols and apply them to maintain consistency of global shared variables. Prior researches have applied such a technique in the context of SPMD programming model. This work investigates the technique on X10 programming system, which supports MIMD programming model as well.

Back to CDP13 Program

Experiences Implementing Lightweight Synchronization in Java for the Intel Haswell Transactional Memory features
Yi Zhang, Nikola Grcevski - IBM Canada Software Laboratory
Presentation Slides: [PPT]

Synchronization in Java is based on monitor enter and exit operations and it can often cause large overhead in the Java program execution. Programmers and compiler researchers have been trying to reduce this overhead by eliminating redundant monitors, or by replacing existing logic with new algorithms based on volatile variables. This presentation demonstrates an optimistic approach for increasing Java programs concurrency by exploiting the transactional memory extensions of the Intel Haswell processor. We will show our experiences and the challenges we faced in implementing with both HLE (Hardware Lock Elision) and RTM (Restricted Transactional Memory) in a Java Just-in-Time production compiler environment. The selected synchronized regions are transformed into transactional memory regions and several Java threads can enter the region simultaneously. The main challenge is that context switches can happen quite often on intel platform and cause transaction abortions even in case of uncontended monitor scenarios and the design should be able to handle frequent abortion. In case of RTM, 2 factors can affect the performance: 1. Number of retry after transaction abortions; 2. correctly detecting whether an abortion is permanent or transient. We will also present some experimental results with tuned retry count from micro-benchmarks using standard Java library collection classes. The results show the performance of HLE and RTM are more than 5 times faster in both contended and uncontended monitor scenarios. As the number of threads are increased RTM shows very good scalability improving the performance by more than 15 times.

Back to CDP13 Program

Mc2For: a MATLAB to Fortran 95 Compiler
Xu Li, Laurie Hendren - School of Computer Science, McGill Univeristy
Presentation Slides: [PPT]

MATLAB is a dynamic numerical scripting language used widely around the world by scientists, engineers and students. While MATLAB's high-level syntax and dynamic types makes it ideal for prototyping, programmers often prefer high-performance static programming languages such as Fortran for their final distributed code. One solution is to allow the programmer to program in MATLAB, and then automatically produce the equivalent Fortran program. In this talk, I will introduce a MATLAB to Fortran compiler, Mc2For. Since there are no type declarations for variables in MATLAB, the first step of the compiler is an interprocedural analysis to estimate static types which are used to generate variable declarations in the translated Fortran code. I will introduce two static analyses used for this step: shape analysis and range value analysis. Shape analysis is an analysis to infer the shape of all the variables in a given MATLAB program. Range value analysis is an analysis which extends the constant propagation to infer the possible range of the value a scalar variable can have at each point of the program. The second step of the compiler is the mapping of the program constructs from MATLAB to Fortran. I will highlight solutions to several challenges in this step, including how to handle the case where a variable is associated with more than one shape, how to transform the general linear indexing in MATLAB to Fortran, and how to map numerous MATLAB built-in functions to Fortran. This work has been implemented within the McLab framework, and I will also present the preliminary performance results for a set of MATLAB benchmarks.

Back to CDP13 Program

MiX10: Compiling MATLAB for High Performance Computing via X10
Vineet Kumar, Laurie Hendren - School of Computer Science, McGill Univeristy
Presentation Slides: [PPT]

MATLAB is a popular dynamic array-based language commonly used by students, scientists and engineers, who appreciate the interactive development style, the rich set of array operators, the extensive builtin library, and the fact that they do not have to declare static types. Even though these users like to program in MATLAB, their computations are often very compute-intensive and are better suited for the emerging high performance computing systems. Our solution to help scientific programmers make better use of HPC systems is a source-to-source MATLAB compiler, MiX10 that translates MATLAB programs to X10, a language designed for "Performance and Productivity at Scale". In this talk, we will discuss: (1) key differences between MATLAB, a dynamically typed array-based language with unconventional semantics and X10, a statically typed object-oriented language with cilk-style region-based arrays and rail-backed multidimensional arrays and design of the MiX10 compiler to efficiently handle these differences; (2) solutions to challenges in mapping various MATLAB constructs to equivalent X10 constructs for both types of X10 arrays; (3) important static analyses like identification of complex numerical values, handling variables with type-conflicts, array-bounds checks and identification of non-mutable variables; (4) a template-based specialization approach to compiling the builtin MATLAB operators; and finally (5) preliminary performance results for a set of benchmarks.

Back to CDP13 Program

Examining Trends in the Hardware/Software Interface
Marcel Mitran, IBM Canada Software Laboratory
Presentation Slides: [PPT]

Back to CDP13 Program

Quantifying Active Data Sharing in Multi-threaded Programs
Hao Luo, Chen Ding, Pengcheng Li - University of Rochester
Presentation Slides: [PPT]

When running threaded code, multiple cores may access the same data at the same time. As the number of cores increases and the cache hierarchy becomes more complex, active data sharing is increasingly important in determining the scalability and stability of parallel performance. This talk presents a concept called shared footprint, which is the amount of data actively used by two or more threads in a period of execution. As an analogue and extension to the concept of footprint, which measures the active data usage in a sequential program, the shared footprint measures the common working set in a parallel program. The paper presents a set of metrics parameterized by the number of sharers, a linear-time algorithm to measure all of them, and discusses the uses of shared footprint in program characterization, false sharing detection, and scalability analysis.

Back to CDP13 Program

Simple Profile Rectifications Go A Long Way: Demystifying the Influence of Sampling Errors on Feedback Driven Program Optimizations
Bo Wu1, Mingzhou Zhou1, Xipeng Shen1, Yaoqing Gao2, Raul Silvera2, Graham Yiu2 - 1College of William & Mary, 2IBM Canada Software Laboratory
Presentation Slides: [PPT]

Feedback-driven program optimization (FDO) is common in modern compilers, including Just-In-Time compilers increasingly adopted for object-oriented or scripting languages. This paper describes a systematic study in understanding and alleviating the effects of sampling errors on the usefulness of the obtained profiles for FDO. Taking a statistical approach, it offers a series of counter-intuitive findings, and identifies two kinds of profile errors that affect FDO critically, namely zero-count errors and inconsistency errors. It further proposes statistical profile rectification, a simple approach to correcting profiling errors by leveraging statistical patterns in a profile. Experiments show that the simple approach enhances the effectiveness of sampled profile-based FDO dramatically, increasing the average FDO speedup from 1.16X to 1.3X, around 92% of what full profiles can yield.

Back to CDP13 Program

Using Machine Learning to Automatically Tune GPU Program Performance
Tianyi David Han and Tarek S. Abdelrahman - University of Toronto
Presentation Slides: [PPT]

Today, there is an ever increasing need for tools that automatically tune the performance of parallel applications on target platforms. This is particularly the case for GPU accelerators, which expose the complex underlying architecture to the programmer. Determining what optimizations to apply is a complex process that is often done by exhaustive experimentation; a tedious and error-prone process. In this presentation we will describe our preliminary efforts into using machine learning to predict which GPU optimizations to apply to a given program. Two key challenges in designing an accurate machine learning model is finding a large and diverse training set and overcoming the high dimensionality of the program feature space, particularly for real programs. We describe how to generate large synthetic training sets for building a machine learning model. We propose and give a preliminary evaluation of a novel approach that combines several simple and accurate machine learning models using the control flow graph of the program to obtain more accurate predictions that a single whole-program model.

Back to CDP13 Program

ABC-Optimizer: An Affinity-Based Code Layout Optimizer
Chen Ding, Rahman Lavaee, Pengcheng Li - University of Rochester
Presentation Slides: [PPT]

As the code size of modern applications increases, code layout becomes an increasingly important problem for cache utilization, and as a result, program performance. The hierarchical model of reference affinity provides an approach to tackle this problem, by processing the instruction trace of the program and grouping code units that are often accessed closely together in time. In this paper, we present the profile-guided affinity-based code layout optimizer (ABC-optimizer), to enhance the utilization of instruction and data caches. The performance of our sampling algorithm for finding the affinity hierarchy outperforms the previous approaches in asymptotic complexity, and our evaluations on Python interpreter reveals that by applying ABC-optimizer, we are capable of improving the performance by up to double digit percentages.

Back to CDP13 Program