Опубликован: 12.07.2012 | Доступ: свободный | Студентов: 355 / 24 | Оценка: 4.00 / 4.20 | Длительность: 11:07:00
Специальности: Программист
Лекция 5:

Optimizing compiler. Vectorization

< Лекция 4 || Лекция 5: 12345 || Лекция 6 >

Three Sets of Switches to Enable Processor-specific Extensions

  1. Switches –x<EXT> like –xSSE4_1
    • Imply an Intel processor check
    • Run-time error message at program start when launched on processor without <EXT>
  2. Switches –m<EXT> like –mSSE3
    • No processor check
    • Illegal instruction fault when launched on processor without <EXT>
  3. Switches –ax<EXT> like –axSSE4_2
    • Automatic processor dispatch – multiple code paths
    • Processor check only available on Intel processors
    • Non-Intel processors take default path
    • Default path is –mSSE2 but can be modified again by another –x<EXT> switch

Simple estimation of vectorization profitability

A typical vector instruction is an operation on two vectors in memory or in registers of fixed length. These vectors can be loaded from memory in a single operation or in part.

Description of the foundations of SSE technology can be found in CHAPTER 10 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) of the document "Intel 64 and IA-32 Intel Architecture Software Developer's Manual" (Volume 1)

Microsoft Visual Studio supports a set of SSE intrinsics that allows you to use SSE instructions directly from C/C++ code. You need to include xmmintrin.h, which defines the vector type __m128 and vector operations.

For example, we need to vectorize manually the following loop:


To do this:

  1. organize / fill the vector variables
  2. use multiply intrinsic for the vector variables
  3. write the results of calculations back to memory
#include <stdio.h>
#include <xmmintrin.h>
#define N 40
int main() {
float a[N][N][N],b[N][N][N],c[N][N][N];
int i,j,k,rep; __m128 *xa,*xb,*xc;
for(i=0;i< N;i++)
 for(j=0;j< N;j++)
  for(k=0;k< N;k++) {
    a[i][j][k]=1.0;   b[i][j][k]=2.0; }
for(rep=0;rep<10000;rep++) {
#ifdef PERF
    for(k=0;k<N;k+=4) {
      xa=(__m128*)&(a[i][j][k]);  xb=(__m128*)&(b[i][j][k]); xc=(__m128*)&(c[i][j][k]);
      *xc=_mm_mul_ps(*xa,*xb);     }
for(i=0;i< N;i++)
 for(j=0;j< N;j++)

An example illustrating the vectorization with SSE intrinsics

icl -Od test.c -Fetest1.exe
icl -Od test.c -DPERF -Fetest_opt.exe
time test1.exe

CPU time for command: 'test1.exe'

real    3.406 sec
user    3.391 sec
system  0.000 sec
time test_opt.exe

CPU time for command: 'test_opt.exe'

real    1.281 sec
user    1.250 sec
system  0.000 sec

Intel compiler 12.0 was used for this experiment

The resulting speedup is 2.7x.

We used aligned by 16 memory access instructions in this example. Alignment was matched accidently, in the real case, you need to worry about it.

Test-optimized compiler shows the following result:

icl test.c-Qvec_report3-Fetest_intel_opt.exe
 time test_intel_opt.exe

CPU time for command: 'test_intel_opt.exe'

real 0.328 sec
user 0.313 sec
system 0.000 sec

Admissibility of vectorization

Vectorization is a permutation optimization. Initial execution order is changed during vectorization.

Permutation optimization is acceptable, if it preserves the order of dependencies. Thus we have a criterion for the admissibility of vectorization in terms of dependencies.

The simplest case when there are no dependencies inside the processed loop.

In more complicated case there are dependences inside the vectorized loop but its order is the same as inside the initial scalar loop.

Let’s recall the description of the dependency in a loop:

There is a loop dependency between the statements S1 and S2 in the set of nested loops, if and only if

  1. there are two loop nest iteration vectors i and j such that i <j or i = j and there is a path from S1 to S2 inside the loop
  2. statement S1 for iteration i and statement S2 for iteration j refer to the same memory area.
  3. One of these statements writes to this memory.
< Лекция 4 || Лекция 5: 12345 || Лекция 6 >