НОУ ИНТУИТ | Introduction to performance optimization using Intel SW tools. Лекция 5: Optimizing compiler. Vectorization

Учитесь и получайте официальные документы БЕСПЛАТНО. Вы можете поддержать наш проект.

Регистрация Вход

Твой путь к знаниям!

|

Вам нравится? Нравится 9 студентам

| Поделиться |

Поддержать курс

| Скачать электронную книгу

Аннотация: The lecture reviews basic principles of the vectorization. Discussed topics are: the problems of an automatic vectorization; data alignment and kind of memory access influence; the compiler options associated with autovectorizer; preprocessor directives and language constructions related to vectorization; vectorization profitability criterion.

Ключевые слова: presentation, CAN, pipelining, superscalar, vector operation, multi-core, multiprocessor, optimizing compiler, executable module, AND, performance, transformation, sequential program, utilize, execution unit, simultaneous, correctness, typical, vector instruction, operation, memory, fixed, length, FROM, multiplication operator, vectorization, compiler optimization, scalar, this, optimization, Data, vector, scalar operation, with, developer, section, array, very, convenience, vector register, MMX, Instruction, SIMD, instruction set, Line, microprocessor, AS, technology, multimedia, set, perform, Actions, streaming audio, video encode, comparison, unpack, caching controller, prefetcher, integer operation, packed, NOT, SSE, extension, x86, architectural design, Series, digital signal processing, computer graphics, packed data, single precision, SSE2, SSE3, SSE4, double precision, floating point, advanced, instruction set architecture, AMD, processor, new, coding scheme, size, BIT, half, vector length, doubleword, single-precision, appropriate, data type, application performance, data movement instruction, double quadword, aligned, memory location, arithmetic instruction, floating-point arithmetic, square root, RCP, reciprocal, logical instruction, bitwise, comparison operator, equal, shift instruction, shift left, temporal, used, system, illegal instruction, part, DESCRIPTION, chapter, programming, IA-32, architecture, software, manual, volume, visual, intrinsics, include, example, vectorize, cpu time, Command, compiler, experiment, speedup, memory access, Alignment, accident, real, CASE, result, permute, initial, execution order, acceptable, if, criterion, ITS, let, recall, dependency, nested loop, control, amount, vectorizer, diagnostic, information, prohibition, data dependency, usage, remark, inefficiency, Write, fact, create, iteration, CHECK, assumption, dependence, array element, evaluation, prove, calculate, hard, able, additional, analogue, assignment, vectorized code, VALUES, macro, option, preprocessor, level, inlining, fast, important, aspect, memory alignment, data structure, computer, placement, concept, DISTINCT, data alignment, filling, padding, relative, property, insertion, structure, default, Object, derived type, reduce, DESCENDING, loop vectorization, tail, First, argument, enlargement, system performance, demonstration, vector object, Register, loop optimization, compiler directive, vector processing, outer loop, directive

The presentation can be downloaded here.

Рис. 5.1. The trend in the microprocessor development

Different kinds of parallel:

Pipeline
Superscalar
Vector operations
Multi-core and multiprocessor tasks

An optimizing compiler is a tool which translates source code into an executable module and optimize source code for better performance. Parallelization is a transformation of the sequential program into multi-threaded or vectorized (or both) to utilize multiple execution units simultaneously without breaking the correctness of the program.

Рис. 5.2. Vectorization is an example of data parallelism (SIMD)

Рис. 5.3. The approximate scheme of the loop vectorization

A typical vector instruction is an operation on two vectors in memory or in fixed length registers. These vectors can be loaded from memory by a single or multiple operations.

Vectorization is a compiler optimization that inserts vector instructions instead of scalar. This optimization "wraps" the data into vector; scalar operations are replaced by operations with these vectors (packets).

Such optimization can be also performed manually by developer.

A (1: n: k) - section of the array in Fortran is very convenient for vector register representation.

Рис. 5.4.

MMX, SSE vector instruction sets

MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in 1996 for P5-based Pentium line of microprocessors, known as "Pentium with MMX Technology".

MMX (Multimedia Extensions) is a set of instructions perform specific actions for streaming audio/video encoding and decoding.

MMX is:

MMM0-MMM7 64 bit registers (were aliased with existing 80 bit FP stack registers)
the concept of packages (each register can store one 64 bit integer or 2 - 32 bit or 4 - 16 bit or 8 - 8-bit)
57 instructions are divided into groups: data movement, arithmetic, comparison, conversion, logical, shift, shufle, unpack, cache control and prefetch state management.

MMX provides only integer operations. Simultaneous operations with floats and packed integers are not possible.

Streaming SIMD Extensions (SSE) is a SIMD instruction set extension of the x86 architecture, designed by Intel and introduced with Pentium III series processors in 1999. SIMD instructions can greatly increase the performance when exactly the same operations are performed on the multiple data objects. Typical applications are digital signal processing and computer graphics.

SSE is:

8 128-bit registers (xmm0 to xmm7)
set of instructions for operations with scalar and packed data types
70 new instructions for single precision floating point data mainly

SSE2, SSE3, SSEE3, SSE4 are further extensions of SSE.

SSE2 has packed data type with double precision floating point.

Advanced Vector Extensions (AVX) is an extension of the x86 instruction set architecture for Intel and AMD microprocessors proposed by Intel in March 2008. Firstly supported by Intel Westmere processor in Q1 2011 and by AMD Bulldozer processor in Q3 2011.

AVX provides new features, new instructions and a new coding scheme.

The size of vector registers is increased from 128 to 256 bits (YMM0-YMM15).

The existing 128-bit instructions use the lower half of YMM registers.

Дальше >>

Авторизоваться

Introduction to performance optimization using Intel SW tools

Optimizing compiler. Vectorization

MMX, SSE vector instruction sets

Вопросы и ответы