APPENDIX 8 High Performance Fortran

Introduction
Some links to further HPF information
Data storage
Execution
Intrinsic functions
Storage rules
HPF Subset

Introduction

Very important, at the introduction of a new programming language standard, is nowadays that the language should permit efficient compilation and execution not only on conventional computers, but also on parallel and massively parallel systems. Somewhat simplified it could be stated that Fortran 90 is efficient on conventional computers and on vector processors, but less efficient on parallel processors. The reason for this is that in Fortran it does not exist any commands for partition into parallel processes, except for those that follow directly from the array concept, with its operators and functions.

A group based in Houston, High Performance Fortran Forum, has developed a proposal in the form of an extension to Fortran 90. The aim of this project High Performance Fortran or HPF is to offer a portable language which permits an efficient use on different parallel systems. The project has issued a final proposal in May 1993, and aims towards a de facto standard, and not formal standard from ANSI or ISO. In order to facilitate the introduction and general acceptance of HPF the group has also defined a subset based on Fortran 77 and just a few parts of Fortran 90. A number of manufacturers were involved in the group, and the proposal will hopefully get a fast acceptance, with many implementations. Those parts of the proposal that were controversial, and therefore requires further study, is available in a separate document "Journal of Development".

A very good book on HPF is the one by Koelbel et al.

The High Performance Fortran Forum has continued to develop the HPF language and are aiming to publish a new specification later this year. New features address asynchronous I/O, tasking, more general distributions, interfacing to ANSI C, reduction operations and more. It is likely that the core language will not be significantly expanded and that the specification will define extensions which are standardised but optional for a given implementation. There will also be a definition of a kernel of facilities which should be most efficient across a wide range of implementations.

The proposal is a Fortran-based language with facilities to control distribution of arrays onto distributed memory parallel computers and includes extensions for

Parallel execution
Good performance both on MIMD and SIMD computers
Distribution of data on the available processors
Location of data within a specific processor

The complete HPF is based on Fortran 90 and includes

New directives
New syntax (the statement FORALL and new intrinsic functions)
Restrictions in the rules for storage

The Subset HPF is instead based on Fortran 77 in order to become implemented already in 1993 and includes

MIL-STD-1753 (US military extension to Fortran 77, it is also included in Fortran 90)
Array-operations
All HPF-constructs, but with certain limitations in their functionality

The directives have been given such a form that it will be possible to use them as ordinary statements in a future version of Fortran. At present they will have to be written as comments.


CHPF$ directive	   ! Fortran 77 and fix form Fortran 90
!HPF$ directive	   ! Fortran 90 (both fix and free form)

Since the founding of the group HPFF, much has been accomplished and commercial implementations of HPF version 1.0 are now appearing: Applied Parallel Research, Digital, Intel, Kuck and Associates, Meiko, Motorola, NEC, ACE, Hitachi, the Portland Group, Inc. and SofTech have already announced commercial products based on the HPF 1.0 standard (some of these are joint ventures).

Further information on the DEC Fortran 90, which includes HPF, is available.

The Portland Group has a High Performance Fortran Compiler available.

The Applied Parallel Research has a High Performance Fortran Compiler available.

The current (source: September 1995 HPFF Minutes) vendor implementation list is:

Announced Products
Applied Parallel Research, Digital, Hitachi, Intel, Meiko, Motorola, NA Software, NEC, Pacific Sierra Research, The Portland Group, Inc. (PGI), and SofTech.
Announced Efforts
ACE, Fujitsu, IBM, Lahey, MasPar, NAG, nCube, and Thinking Machines.
Interested

Cray Research, EPC, Convex, HP, Silicon Graphics, and Sun.

Some links to further HPF information

Data storage

HPF has directives on how data ought to be stored. These directives do not influence the computational results, but they may influence the computational speed (hopefully increasing it). The reason for introducing the directives for data storage are the following two very simple observations:

An operation on several data objects ought to be faster if all the objects are on the same processor.
It ought to be possible to perform many operations at the same time if the operations are performed on different processors.

Using the data storage directives we can distribute the data on the various processors. Sometimes it is desirable first to align an array against a template, and then distribute the template on the processors. A first example can explain the situation.

	REAL A(1000), B(1000), C(1000), X(500), Y(0:501)
	INTEGER INX(1000)
!HPF$   PROCESSORS PROCS(10)
!HPF$   DISTRIBUTE A(BLOCK), B(BLOCK) ON PROCS
!HPF$   DISTRIBUTE INX(BLOCK) ON PROCS
!HPF$   DISTRIBUTE C(CYCLIC) ON PROCS
!HPF$   ALIGN X(I) WITH Y(I-1)

	A(I) = B(I)                             ! (1)
	X(I) = Y(I-1)                           ! (2)
	A(I) = C(I)                             ! (3)
	A(I) = A(I-1) + A(I) + A(I+1)           ! (4)
	C(I) = C(I-1) + C(I) + C(I+1)           ! (5)
	X(I) = Y(I)                             ! (6)
	A(I) = A(INX(I)) + B(INX(I))            ! (7)

We here work with 10 processors, and we have distributed the floating point vectors A and B and also the integer vector INX block wise over these processors. We then have the first one hundred elements from each one of the vectors on the first processor, the next one hundred on the next processor, and so on. The floating point vector C is however distributed cyclically, the first, eleventh, twenty-first etc. elements are on the first processor, the second, twelfth, twenty-second etc. elements are on the second, and so on.

Without instructing the system exactly how the two vectors X and Y are to be distributed, we require that the I-th element of the vector X is on the same processor as the (I-1)-th element of the vector Y.

The assignments in the cases (1) and (2) above can thus be done without any inter-processor communication, since all data all the time is on the same processor.

The assignment in case (3) will in most instances be between different processors, only in exceptional cases will the data be on the same processor.

The assignment in case (4) on the other hand will in most instances be between data on the same processor, except in the exceptional cases at block boundaries. The case (5) looks like case (4), but in this case the data is all the time on different processors.

For case (6) we only know that X(I) and Y(I-1) are on the same processor, but this gives no information whatsoever on how X(I) and Y(I) are distributed. At block wise storage both of them are normally on the same processor, at cyclic storage they are always on different processor (provided that the cycles are the same).

We do not know very much about the distribution in case (7). It is however easy to realize that A(INX(I)) and B(INX(I)) always are on the same processor, and that also A(I) and INX(I) are on the same processor. It is however not likely that these two sets will coincide.

Also the formal arguments in a function or a subroutine can be stored in different ways using directives. Assume that we have a subroutine with two arguments, both floating point vectors.

	SUBROUTINE CAROLUS (KARL, CHARLES)
	REAL, DIMENSION (1718) :: KARL, CHARLES
!HPF$   ALIGN WITH * :: KARL
!HPF$   ALIGN WITH KARL :: CHARLES
        ...
        END

This is called with

	REAL, DIMENSION (1718) :: GUSTAV, ADOLF
	...
	CALL CAROLUS (GUSTAV, ADOLF)

This means that the first formal argument KARL is distributed in the same way as the actual argument GUSTAV. The second formal argument CHARLES is distributed in the same way as the first formal argument, in this case in the same way as GUSTAV, and not as the second actual argument ADOLF.

Another example is when you require the four corner submatrices of the size N*N from a larger matrix of the size (N+1)*(N+1) . In this case we call the large matrix EARTH and the submatrices are called NW, NE, SW, and SE. In some cases the large matrix is not required to be used in any calculations, its only purpose is to assist at the distribution of the data in the submatrices. It is therefore a TEMPLATE, which is not stored.

!HPF$   TEMPLATE, DISTRIBUTE (BLOCK, BLOCK) :: EARTH (N+1, N+1)
	REAL, DIMENSION (N, N) :: NW, NE, SW, SE
!HPF$   ALIGN NW(I, J) WITH EARTH(I, J)
!HPF$   ALIGN NE(I, J) WITH EARTH(I, J+1)
!HPF$   ALIGN SW(I, J) WITH EARTH(I+1, J)
!HPF$   ALIGN SE(I, J) WITH EARTH(I+1, J+1)

Since a TEMPLATE does not represent any real storage, it can not be part of any COMMON. The alignment subscripts can be just a little more complicated than the old array indices in Fortran 66, they are of the general form m*i+n , where i is a variable which is permitted to appear only once in the expression, and m and n are constants (PARAMETER or numerical constant).
A somewhat more complicated directive is

!HPF$   ALIGN A(:) WITH D(:,*)

which means that a copy of the vector A is associated with each column of the matrix C. An example on this follows, where it is also shown that the HPF syntax permits several ways for expressing identical requirements.

!HPF$   TEMPLATE, D1(N), D2(N, N)
	REAL, DIMENSION (N, N) :: X, A, B, C, AR1, AR2, &
                                  P, Q, R, S
!HPF$   ALIGN X(:,*) WITH D1(:)
!HPF$   ALIGN (:,*) WITH D1 :: A, B, C, AR1, AR2
!HPF$   ALIGN WITH D2, DYNAMIC :: P, Q, R, S

The following is a more complete example, where an intrinsic function finds the present number of processors, which is used at the distribution of the arrays. In addition a usual (external) function is being used.

        PARAMETER ( N = NUMBER_OF_PROCESSORS() )
!HPF$   PROCESSORS MPP(N)

!HPF$   TEMPLATE T(1000), S(5000)
!HPF$   DISTRIBUTE T(BLOCK) ONTO MPP    ! Block size may
!HPF$   DISTRIBUTE S(CYCLIC) ONTO MPP   ! be specified

        REAL, DIMENSION (1000)  :: X, Y, Z
        REAL, DIMENSION (5000)  :: V, W
!HPF$   ALIGN WITH T :: X, Y, Z
!HPF$   ALIGN WITH S :: V, W

        REAL, DIMENSION (10,1000) :: A
!HPF$   ALIGN WITH T :: A(*,:)          ! Vector of columns
	...
        TEMP = F(V)
	...
        END

	FUNCTION F(X)
	REAL, DIMENSION (:) :: X
!HPF$	ALIGN WITH * :: X       ! Distribute the formal argument
                                ! X as the real argument V
	REAL, DIMENSION (SIZE(X)) :: S
!HPF$	ALIGN WITH X :: S       ! Distribute the local vector S
                                ! as the formal argument X, or 
                                ! as the actual argument V
	...
	RETURN
	END

You can specify different distributions along different dimensions. The following specifications

        REAL A(100,100), B(100,100), C(200)
!HPF$   DISTRIBUTE A(BLOCK,*), B(*,CYCLIC), C(BLOCK(5))

means that the first processor of a four processor computer stores the following array sections

	A(1:25, 1:100)
	B(1:100, 1:97:4)
	C(1:5), C(21:25), C(41:45), C(61,65), C(81:85),
	C(101:105), C(121:125), C(141:145), C(161,165),
	C(181:185)

It is also possible to distribute several dimensions in completely independent ways,

	REAL D(8,100,100)
!HPF$   DISTRIBUTE D(*,BLOCK, CYCLIC)

means that the first processor of a four processor computer, configured as a 2*2 matrix, stores the following array sections

	D(1:8, 1:50, 1:99:2)

In addition to the static directives discussed above, there are also the two dynamic directives REDISTRIBUTE and REALIGN, who permit an array to switch its distribution within the subroutine. At the use of a subroutine it does exist three different possibilities for the formal arguments.

A formal argument can inherit the distribution from the actual argument.
A formal argument can be redistributed to a requested distribution.
A formal argument can be specified to have a certain distribution, and is an error if the actual argument does not have that distribution.

At the exit from a subroutine it is necessary that the effect of any redistribution is removed.

Execution

A new statement FORALL is introduced as an alternative to the DO-loop. The intent with the FORALL statement is that its content can be executed in any order, independent of the index! It therefore gives the possibility of a parallel implementation. Some rules are however introduced in order to avoid side effects. Some simple examples on FORALL follow.

	FORALL (I = 1:N, J = 1:N) H(I, J) = 1.0/REAL(I+J-1)
	FORALL (I = 1:N, J = 1:N, Y(I, J) .NE. 0.0) &
			X(I,J) = 1.0/Y(I,J)
	FORALL (I = 1:N) A(I,I+1:N) = 3.141592654

The first of these define a Hilbert matrix of order N, the second inverts the elements of a matrix, avoiding division with zero. In the third example all elements above the main diagonal of the matrix are assigned the value of the mathematical constant pi.

In all the three statements above, FORALL can be considered as a double loop, which can be executed in arbitrary order. The general form of the FORALL statement is

	FORALL ( v1 = l1:u1:s1, ... , vn = ln:un:sn, mask ) &
		a(e1, ... , em) = right_hand_side

and is evaluated according to certain well specified rules, in principle all indices are evaluated first.

In addition there is FORALL construct, which has to be distinguished from the single line FORALL statement. The construct is interpreted in the same way as if its internal statements were written as FORALL statements in the same order. This restriction is important in order to achieve a unique computational result. Within a FORALL construct calls of functions or subroutines are permitted only if these subprograms are pure (without side effects). First we give a simple example of a FORALL construct.

	REAL, DIMENSION(N, N) :: A, B
	...
	FORALL (I = 2:N-1, J = 2:N-1)
	   A(I,J) = 0.25*(A(I,J-1)+A(I,J+1)+A(I-1,J)+A(I+1,J))
	   B(I,J) = A(I,J)
	END FORALL

When these statements have been executed the arrays A and B have identical values in the internal points, while B has kept its previous values on the boundaries.

In addition a directive INDEPENDENT has been introduced for both DO loops and FORALL constructs. This directive is placed immediately before the DO statement or FORALL construct, and is valid until the corresponding END DO (or the old form of terminating a DO loop) or END FORALL. The directive assures the system that this part of the program can be executed in an arbitrary order, including parallel, without any computational differences in the result (no semantic change).

In the example below it is thus assured that the integer vector P does not have any repeated values (which would have meant that last one wins at a normal sequential execution). A potential conflict at parallel execution is thus avoided. It is also implicitly assured that all values of P are within the permitted bounds of 1 and 200.

	REAL, DIMENSION(200) 	 :: A
	REAL, DIMENSION(100) 	 :: B
	INTEGER, DIMENSION(100)  :: P
	...
!HPF$   INDEPENDENT
	DO I = 1, 100
	   A(P(I)) = B(I)
	END DO

It is also possible to indicate that certain parts of a nested loop shall be considered as independent. In the example below the innermost loop is not independent since each element of A is assigned several times, for all values of I4.

	REAL, DIMENSION(N, N, N) :: A, B, C
        ...
!HPF$   INDEPENDENT, NEW (I2)
	DO I1 = 1, N1
!HPF$     INDEPENDENT, NEW (I3)
	  DO I2 = 1, N2
!HPF$       INDEPENDENT, NEW (I4)
	    DO I3 = 1, N3
	      DO I4 = 1, N4     ! The innermost loop is NOT
                                ! independent !
		 A(I1, I2, I3) = A(I1, I2, I3) &
		   + B(I1, I2, I4) * C(I2, I3, I4)
	      END DO
	    END DO
	  END DO
	END DO

The NEW clauses are required, since the inner loop indices are assigned and used in different iterations of the outer loops.

A DO loop and a FORALL construct with the same content are equivalent if they both have been given the INDEPENDENT directive.

Intrinsic functions

In addition to all the intrinsic functions of Fortran 90 the language extension HPF introduces a few new, and modifies two of those already in Fortran 90.

To the functions MAXLOC and MINLOC have been added an optional parameter DIM, which informs along which dimension the maximum or minimum is to be taken. An example illustrates how it works on MAXLOC. Without the extra parameter the index for the largest element is returned. If the optional parameter DIM = 1 the row numbers for the largest element in each column are returned. If the optional parameter DIM = 2 the column numbers for the largest element in each row are returned. The corresponding applies to MINLOC. These new facilities are the same as those that already exist in Fortran 90 for MAXVAL and MINVAL.

                0	-5	 8	-3
        A =     3	 4	-1	 2
                1	 5	 6	-4

gives the following values

	MAXLOC(A)               = (/ 1, 3 /)
	MAXLOC(A, DIM = 1)      = (/ 2, 3, 1, 2 /)
	MAXLOC(A, DIM = 2)      = (/ 3, 2, 3 /)

The following completely new functions have been added. The inquiry functions are to be intrinsic, but the others may instead be available in a library as external functions.

Inquiry functions:

NUMBER_OF_PROCESSORS(DIM)
Total number of processors
PROCESSORS_SHAPE()
Shape for the processors

Example. For a DECmpp 12000 Model 8B we get

NUMBER_OF_PROCESSORS()           8192
NUMBER_OF_PROCESSORS(DIM=1)      128
NUMBER_OF_PROCESSORS(DIM=2)      64
PROCESSORS_SHAPE()               (/ 128, 64 /)

while on an ordinary workstation we get

NUMBER_OF_PROCESSORS() 	        1
PROCESSORS_SHAPE()              (/ 1 /)

Bit Manipulation Functions:
These three use the model in section 13.5.7 of the Fortran 90 standard. Note that the results are machine dependent, since they require the number of bits in an integer, which in Fortran 90 is available with the intrinsic function BIT_SIZE.
- POPCNT, gives the number of 1-bits in the integer argument
- POPPAR, gives the result 1 if the number of 1-bits in the integer argument is odd, else the result 0.
- LEADZ, gives the number of leading zeros in the integer argument.
Other functions:
- The function ILEN is an integer function which returns the number of bits required to store a signed integer in the form of a 2 complement. Examples of its use are that 2**ILEN(N-1) rounds an arbitrary positive integer N upwards towards the closest power of 2, and 2**(ILEN(N)-1) rounds down.
- The following routines are not required to be intrinsic but may be in a normal external library, and in that case they require a USE statement. They are not explained below, we refer the reader to the HPF standard.
  The array reduction functions IALL, IANY, IPARITY and PARITY are available and they correspond to the following intrinsic functions of Fortran 90, namely IAND, IOR, IEOR and the operator .NEQV.
  A large number of functions are available to gather and scatter data, they have names of the form XXX_SCATTER, where XXX can be SUM, COUNT, PRODUCT, MAXVAL, MINVAL, IALL, IANY, IPARITY, ALL, ANY and PARITY. For parallel operations there are the functions XXX_PREFIX and XXX_SUFFIX, where XXX has the same possibilities as for XXX_SCATTER.
  Example: SUM_PREFIX sums successively the elements of the array, the first remains unchanged, the second becomes the sum of the first two, and so on. With SUFFIX the summation is done in the other direction (backwards). In addition there are two sorting functions, GRADE_UP and GRADE_DOWN. Operations for parallel input and output are being considered, and can be found in the Journal of Development.

Storage rules

For Fortran 90 there exist certain rules about storage of variables, especially for arrays who have to be stored in a well specified order, with the first index varying fastest. There are also special rules for the storage of variables that are in COMMON and/or EQUIVALENCE. These rules on sequential storage can conflict with the distribution requested by the HPF statements ALIGN, REALIGN, DISTRIBUTE and REDISTRIBUTE. The Fortran model works with linear storage with one storage unit for integers and ordinary floating point variables, and two storage units for double precision floating point numbers.

Fortran has storage association and sequence association. The Fortran 90 standard states in (14.6.3) that storage association is the association of two or more data objects that occurs when two or more storage sequences share or are aligned with one or more storage units, and in (12.4.1.4) that sequence association is the order that Fortran requires when an array is associated a formal argument. The rank and shape of the actual argument neeed not agree with the rank and shape of the dummy argument, but the number of elements in the dummy argument must not exceed the number of elements in the element sequence of the actual argument. If the dummy argument is assumed size, the number of elements in the dummy argument is exactly the number of elements in the element sequence.

Note that HPF has no problem with array parameters distributed over the processors, as long as both the actual and the dummy arguments have the same rank and shape. It is when the properties of Fortran, with respect to COMMON and EQUIVALENCE, are used too much, that we get into problems. If we use a subroutine that contains the following specifications

        SUBROUTINE HOME(X)
	DIMENSION X(20,10)

it can be called with CALL HOME(Y(2,1)) provided that the array X is specified SEQUENTIAL in the subroutine HOME and the array Y is also specified SEQUENTIAL in the calling program unit. The simple call CALL HOME(Y) is permitted if X and Y are both sequential, or if X and Y are dimensioned in exactly the same way.

A directive SEQUENCE has therefore been introduced in order to inform the system that a variable or a COMMON block shall be considered sequential. Using HPF the normal case is that COMMON blocks are considered non-sequential, while variables normally also are non-sequential (except in a few cases, as when the variable is part of a sequential COMMON block or it is an array with assumed dimension).

In order to get old programs to produce the same results with HPF the following is recommended:

Include a facility to check that all parts of COMMON blocks agree within the whole program.
Include a facility so that all COMMON blocks are considered sequential, if no explicit directive for the opposite case has been given.

HPF Subset

The following news from Fortran 90 are included in the HPF Subset:

Ordinary concepts:
- Identifiers with up to 31 characters including the underline symbol _
- Comments starting with an exclamation mark !
Array concepts
- Array sections
- Array assignment
- The statements WHERE and ELSEWHERE
- Array valued external functions
- Automatic arrays
- Allocatable arrays
- Arrays with assumed shape
- New intrinsics
- Array valued new intrinsics
Additional concepts from Fortran 90
- Optional arguments
- Keyword arguments
- INTERFACE (but not generic or for modules)
- Type specifications (but not KIND and TYPE)
The military extension MIL-STD-1753 to Fortran 77
- The statements DO WHILE, END DO, IMPLICIT NONE and INCLUDE
- The bit manipulation functions BTEST, IAND, IBCLR, IBITS, IBSET, IEOR, IOR, ISHFT, ISHFTC and NOT
- The bit copy routine MVBITS
- The binary, octal and hexadecimal constants.

The following from the complete HPF are however not included in the HPF Subset:

The directives DYNAMIC, EXTRINSIC, LOCAL, PROCESSOR_VIEW, PURE, REALIGN and REDISTRIBUTE
The construct FORALL ... END FORALL
The HPF-library
The HPF_LIBRARY module

In addition there are certain restrictions in the subset on the permitted values of the argument DIM to MAXLOC and MINLOC, and there is no definition of extrinsic subprograms and no binding to Fortran 90.

Last modified: 30 March 1996
boein@nsc.liu.se

APPENDIX 8

High Performance Fortran