Optimizing compiler. Vectorization
Option for vectorization control
/Qvec-report[n]
control amount of vectorizer diagnostic information
- n=0 no diagnostic information
- n=1 indicate vectorized loops (DEFAULT)
- n=2 indicate vectorized/non-vectorized loops
- n=3 indicate vectorized/non-vectorized loops and prohibiting data dependence information
- n=4 indicate non-vectorized loops
- n=5 indicate non-vectorized loops and prohibiting data dependence information
Usage: icl -c -Qvec_report3 loop.c
- C:\loops\loop1.c(5) (col. 1) : remark: LOOP WAS VECTORIZED.
- C:\loops\loop3.c(5) (col. 1): remark: loop was not vectorized: vectorization possible but seems inefficient.
- C:\loops\loop6.c(5) (col. 1) : remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
Simple criteria of vectorization admissibility
Let’s write vectorization of loop with usage of fortran array sections.
A good criterion for vectorization is the fact that the introduction section of the array does not create dependency.
There is dependency because A(I+1:I+1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 are intersected.
There is no dependency because A(I-1:I-1+VL) on iteration I and A(I+VL:I+2*VL) for I+1 aren’t intersected.
PROGRAM TEST_VEC INTEGER,PARAMETER :: N=1000 #ifdef PERF INTEGER,PARAMETER :: P=4 #else INTEGER,PARAMETER :: P=3 #endif INTEGER A(N) DO I=1,N-P A(I+P)=A(I) END DO PRINT *,A(50) END
Loop can be vectorized, if the dependence distance greater or equal to number of array elements within the vector register.
ifort test.F90 -o a.out –vec_report3 echo ------------------------------------- ifort test.F90 -DPERF -o b.out –vec_report3 ./build.sh test.F90(11): (col. 1) remark: loop was not vectorized: existence of vector dependence. ------------------------------------- test.F90(11): (col. 1) remark: LOOP WAS VECTORIZED.
Dependency analysis and directives
There are two tasks which compiler should perform for dependency evaluation:
- Alias analysis (pointers which can address the same memory should be detected)
- Definition-use chains analysis
Compiler should prove that there are not aliased objects and precisely calculate the dependencies. It is hard task and sometimes compiler isn’t able to solve it.
There are methods of providing additional information to the compiler:
- Option –ansi_alias (the pointers can refer only to the objects of the same or compatible type).
- restrict attributes for pointer arguments (C/C++).
- #pragma ivdep says that there are not dependencies in the following loop. (C/C++)
- !DEC$ IVDEP Fortran analogue of #pragma ivdep
Some performance issues for the vectorized code
INTEGER :: A(1000),B(1000) INTEGER I,K INTEGER, PARAMETER :: REP = 500000 A = 2 DO K=1,REP CALL ADD(A,B) END DO PRINT *,SHIFT,B(101) CONTAINS SUBROUTINE ADD(A,B) INTEGER A(1000),B(1000) INTEGER I !DEC$ UNROLL(0) DO I=1,1000-SHIFT B(I) = A(I+SHIFT)+1 END DO END SUBROUTINE END
Let’s consider some simple test with a assignment which is appropriate for vectorization. Let us obtain vectorized code with usage of Intel Fortran compiler for different values of SHIFT macro.
/fpp – option for preprocessor Intel compiler makes vectorization if level of optimization is 2 or 3. (-O2 or -O3)
Option –Ob0 is used to forbid inlining.
ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=0 -Fea.exe -Qvec_report >a.out 2>&1 ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 -Feb.exe -Qvec_report >b.out 2>&1 time.exe a.exe 0 3 CPU time for command: 'a.exe' real 0.125 sec user 0.094 sec system 0.000 sec time.exe b.exe 1 3 CPU time for command: 'b.exe' real 0.297 sec user 0.281 sec system 0.000 sec
ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=0 /Fas -Ob0 -S –Fafast.s fast.s .B2.5: ; Preds .B2.5 .B2.4 $LN83: ;;; B(I) = A(I+SHIFT)+1 movdqa xmm1, XMMWORD PTR [eax+ecx*4] ;17.11 $LN84: paddd xmm1, xmm0 ;17.4 $LN85: movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4 $LN86: add ecx, 4 ;16.3 $LN87: cmp ecx, 1000 ;16.3 $LN88: jb .B2.5 ; Prob 99% ;16.3 ifort test1.F90 -O2 -Ob0 /fpp /DSHIFT=1 /Fas -Ob0 -S –Faslow.s slow.s .B2.5: ; Preds .B2.5 .B2.4 $LN81: ;;; B(I) = A(I+SHIFT)+1 movdqu xmm1, XMMWORD PTR [4+eax+ecx*4] ;17.11 $LN82: paddd xmm1, xmm0 ;17.4 $LN83: movdqa XMMWORD PTR [edx+ecx*4], xmm1 ;17.4 $LN84: add ecx, 4 ;16.3 $LN85: cmp ecx, 996 ;16.3 $LN86: jb .B2.5 ; Prob 99% ;16.3
CONCLUSION:
MOVDQA—Move Aligned Double Quadword
MOVDQU—Move Unaligned Double Quadword
In fast version aligned instructions are used and vector registers are filled faster.
Unaligned instructions are slower. For latest architectures they shows the same performance as aligned instructions if applied to the aligned data.
Performance of vectorized loop depends on the memory location of the objects used. The important aspect of program performance is the memory alignment of the data
Data Structure Alignment is computer memory data placement. This concept includes two distinct but related issues: alignment of the data (Data alignment) and data structure filling (Data structure padding).
Data alignment specifies how certain data is located relative to the boundaries of memory. This property is usually associated with a data type.
Filling data structures involves insertion of unnamed fields into the data structure in order to preserve the relative alignment of structure fields.