Author: Alexis Platonoff platonof@cri.mines-paristech.fr

The compilation is a sequence of five phases. The first phase checks that
the imput is a static control program, *i.e.*, fulfills some
restrictions. The second phase builds an array data flow graph (DFG). The
third one constructs a scheduling function compatible with the DFG. This
function minimizes the execution time for an unbounded PRAM. The fourth
one computes a mapping value for a physical distributed memory parallel
machine. Finally, the last phase generates the parallel code.

Although this compilation scheme is implemented in PIPS, it is NOT interprocedural yet.

- DO loop and the IF test control structures;
- assignment and input/output statements;

To each node of the DFG are associated an *instruction* and an
*execution domain*. The execution domain is a set of constraints
specifying the variation domain of the surrounding loop indices of the
instruction.

Nodes are connected by arcs which represent true data dependences. To each
edge of the DFG are associated the *reference* of the use
dependence, a *transformation function* in which each index of the
definition iteration domain is expressed as a linear function of the
indices of the use iteration domain, and a *governing predicate*
that specifies the sub-space of the sink's iteration domain on which the
edge exists (it is a system of constraints upon the indices of the sink's
iteration domain).

For a given time `t`, the scheduling function defines a set of
independent operations which can be executed simultaneously. This set is
called a front. The execution of two successive fronts must be sequential.

Even with the static control restrictions, it is not possible to always
have a unidimensional linear schedule. Typically, this happens when there
are two or more sequential loops in the same loop nest. In that case, a
*multidimensional* linear schedule is computed [Fea92b].

Each dimension of the placement function defines a placement direction for the instruction and all these directions constitute the distribution space of the target machine. The number of dimensions to compute is arbitrary with a maximum equal to the dimension of iteration space minus the dimension of the scheduling function.

Initially, the goal of the algorithm that computes the placement function was to reduce the number of communications. For an edge of the DFG, there is a potential communication between its source and its destination, which can be represented by a distance. If such a distance is equal to zero (the edge is ``cut'') then the source and the destination will be mapped onto the same processor, and there will be no communication at all.

The principle of the method is to nullify as many distances as possible.
To each edge is associated a *cutting condition* represented by a
system of equalities. The length of an edge is nullified if its cutting
condition is satisfied, i.e. its equalities are satisfied. These
equalities corresponds to the nullification of the factors of the
variables (loop indices and structure parameters) appearing in the
distance. In most cases, satisfying the cutting conditions of all the
edges will lead to the trivial solution: some null dimensions,
*i.e.*, all the occurences of the instruction are placed on the same
processor. To avoid this, some edges should not be cut. To choose them, a
greedy algorithm has been proposed [Fea93], it treats
the edges by decreasing importance. The importance of an edge is
represented by its weight which is equal to the volume of data which has
to be sent if the edge is not cut.

This method does not take into account the type (and hence the temporal cost) of the communications. To that purpose, an extension of the method has been implemented. It is based on a special treatment of potential communications that can be optimized [Pla95b], for instance spread, reduction...

do t = 1, number of fronts execute concurrently all operations in F(t) synchronize end doThe code generation is based upon three transformations:

- Total expansion of scalars and arrays:
- transformation of the initial program into a dynamic single assignment form [Fea88].
- Loop reordering:
- rearrangment of the iteration domain of the initial loops according to the scheduling and placement functions (the first gives the sequential loops, the second the parallel ones). The reordering is equivalent to scanning polyhedra with do loops [AI91].
- Reindexing:
- substitution of all the array access functions with new ones computed according to the new loops.

In PIPS, the generated parallel code can be either CM Fortran (for the CM-5) or CRAFT Fortran (for the Cray-T3D).

[CF93] J.-F. Collard and P. Feautrier. Automatic generation of data parallel code. In Fourth International Workshop on Compilers for Parallel Computers, Delft University of Technology, The Netherlands, December 1993.

[Fea88] P. Feautrier. Array expansion. in ACM Int. Conf. on Supercomputing, Saint-Malo, jul 1988.

[Fea91] P. Feautrier. Dataflow Analysis of Array and Scalar References. Int. Journal of Parallel Programming, 20(1):23-53, February 1991.

[Fea92a] P.Feautrier. Some Efficient Solutions to the Affine Scheduling Problem, Part I : One-dimensional Time. Int. J. of Parallel Programming, 21(5):313-348, October 1992.

[Fea92b] P. Feautrier. Some Efficient Solutions to the Affine Scheduling Problem, Part II : Multidimensional Time. Int. J. of Parallel Programming, 21(6):389-420, December 1992.

[Fea93] P. Feautrier. Toward Automatic Partitioning of Arrays on Distributed Memory Computers. In ACM ICS'93, pages 175-184, Tokyo, July 1993.

[Pla95a] A. Platonoff. Contribution à la distribution Automatique des Données pour Machines Massivement Parallèles. Thèse de Doctorat de l'Université P. et M. Curie, March 9th 1995.

[Pla95b] A. Platonoff. Automatic Data Distribution for Massively Parallel Computers. In 5th International Workshop on Compilers for Parallel Computers, Malaga University, Spain, June 1995.