AVX and other SIMD

Does anyone have example of code that tends to generate avx or other simd instructions in the resulting compiled assembly file?
I'm on linux, g++, but can switch to clang if it's more likely to produce simd instructions...

I'm certainly curious if there's simple code in which these instructions tend to occur (A kind of hello world SIMD).

Is it as simple as adding two vectors, or is there more to it than that?

Would they have to be vectors of doubles, or could it also handle floats and ints?
Should I be careful about how many elements are in the vector/array, or is it more arbitrary than that?

Also, do they need to be enabled by the command line at compile time? <- I'm seeing -mtune=native as an option if I can guarantee that the client computer is not less than my own in capabilities.
Last edited on
https://www.codeproject.com/articles/874396/crunching-numbers-with-avx-and-avx

I found this, haven't used it, maybe helpful for you?
For gcc and g++ (and clang), you can use -march=<cpu-type> to set the target CPU type.

See here for a list of available options:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

You'd have to pick one that supports AVX, so that, at least potentially, the compiler may generate AVX instructions. But the resulting binary will then require AVX support from both, the CPU and the OS.

Probably you want to use -O3 (or at least -O2) too, so that the compiler will extensively optimize your code.


If you want to know if a binary does actually contain AVX instructions, you can do something like this:
objdump -d my_program > disassembled.asm

Then simply check whether the assembly code contains any AVX instructions. A list can be found here:
https://docs.oracle.com/cd/E36784_01/html/E36859/gntbd.html


What code is likely to use AVX instructions, or SIMD instructions in general?

I think your best bet is a loop that performs the same computation on a long sequence (array) of elements.

(And make sure that those computations cannot be optimized away!)
Last edited on
This is old, but might still help:

[EDIT]
???

https://gcc.gnu.org/projects/tree-ssa/vectorization.html
Last edited on
Thanks guys, those are exactly the resources I needed.
The article from TheIdeasMan is going to keep me busy for a while.
Much appreciated.

Keskiverto, that just links back to this page... would love to see what it was meant to be.
Last edited on
You could try looking at the generated code for parallel algorithms in the C++ standard library,
For example this:
1
2
3
4
5
6
7
8
9
#include <numeric>
#include <execution>

alignas(16) int xs[1'000'000]; 

int f() 
{ 
    return std::reduce(std::execution::unseq, xs, xs + 1'000'000); 
}

https://godbolt.org/z/xYTnEbdz1
Last edited on
Topic archived. No new replies allowed.