scientific computing - optimize speed

Greetings,

this is a longer thread, so thank you in advance for anyone who takes the time. It might be that no one can help me without sharing the whole github (which I can't do at the moment), but I will still try to summarize the issues.

I need some help with optimizing my simulation code for speed.


Particles move in a cubic simulation box with their positions and velocities being updated in a loop. In each iteration, a force is computed (most expensive part of the simu) for which the distances between all particle pairs are required like
1
2
3
4
5
6
7
8
9
    for (int i = 0; i < model.N_particles; ++i) {
        for (int j = i + 1; j < model.N_particles; ++j) {
        
            // get distance between particle i and j
            
            // compute interaction between i and j 
        
        }
    }


I have implemented the whole simulation in an OO setup, where the force loop uses OpenMP multithreading (for which I had gotten help here: https://legacy.cplusplus.com/forum/beginner/285710/). I am quite happy with the implementation of the full simulation, except for one thing: I have two different code versions, one for the one-dimensional case (where particles move in 1D) and one for the two-dimensional case. I wanted to unify this into a single code, where the dimension is passed in `argv`. This turned out to be surprisingly difficult to do without losing lots of speed (due to dynamic memory allocation).

More details:
In the fixed dimension case, I work with a struct to store positions, velocities, forces etc.:
1
2
3
4
struct coordinate{ 
    double x{0};
    double y{0};
};


This is the multithreaded force function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
inline void simulation:: compute_force_par()    // part of simulation class
{

    // forces_for_all_tasks holds force vectors for each threads.
    for (auto& forces_for_specific_task : forces_for_all_tasks)
        std:: fill(forces_for_specific_task.begin(), forces_for_specific_task.end(), coordinate{0,0});

    // iterate over particle pairs
    #pragma omp parallel for schedule(dynamic) num_threads(THREADS)
    for (int i = 0; i < model.N_particles; ++i) {
        for (int j = i + 1; j < model.N_particles; ++j) {

        coordinate distance = model.get_distances_ij(i, j); // get distance from model
        coordinate force_ij = (model.*(model.get_force_ij))(distance);  // compute interaction 
        
        // now fill the force vectors on each thread
        std:: vector <coordinate>& forces_for_this_task = forces_for_all_tasks[omp_get_thread_num()];
        
        forces_for_this_task[i].x += force_ij.x;
        forces_for_this_task[i].y += force_ij.y;
        forces_for_this_task[j].x += -force_ij.x;  // Newton's law F_ij = -F_ji
        forces_for_this_task[j].y += -force_ij.y;
        
        }
    }

    // Sum all of the task-specific forces into the output parameter
    std:: fill(model.forces.begin(), model.forces.end(), coordinate{0, 0});
    for (auto const& forces_for_specific_task : forces_for_all_tasks)
        for (int i = 0; i < model.N_particles; ++i)
        {
        model.forces[i].x += forces_for_specific_task[i].x;
        model.forces[i].y += forces_for_specific_task[i].y;
        }
}

The distance is computed via
1
2
3
4
5
6
7
8
9
10
11
12
13
14
inline coordinate IPS_model:: get_distances_ij(const size_t i, const size_t j){  // part of IPS_model class.


    double dx {positions[i].x - positions[j].x};   // positions is member of IPS_model and of 
    double dy {positions[i].y - positions[j].y};   // type std::vector <coordinate>  

    // take care of periodic boundaries...

    coordinate distances {dx, dy};

    return distances;

}

In particular, I get away with creating a `coordinate` object from scratch here each time the function is run.
The ugly syntax for the interaction function pointer
coordinate force_ij = (model.*(model.get_force_ij))(distance); is because I have different possible interaction functions get_force_ij which are chosen during runtime (also passed to argv), but the inner structure is similar to get_distances.

Now, the problem with creating a version that works for d dimensions is that the struct coordinate cannot be created like this because the number of double variables it needs to store is not known during compile-time. Naively, in a first attempt, I just got rid of that struct altogether and used std::vector<double> instead. This, however, led to a code that was 10 times slower!! The reason for that seemed to be 2-fold:
1) Recreating a std::vector in the function get_distances_ij instead of the line coordinate distances(dx,dy) seems to be much more expensive. I think this is because the memory for the vector is allocated dynamically there as opposed to the struct whose size is known at compile time.
2) The `positions`, `velocities`, and `forces` vectors are now no longer of type std::vector<coordinate>, each coordinate being a struct holding two `double`, but std::vector<std::vector<double>>. Apparently, this setup is less cache friendly as all of the inner vectors will not be stored in a contiguous way (as opposed to the double values of the struct).

I then created a version where no vectors where recreated in the tight double loop, only filled with elements. I also turned the 2D vectors into a flattened 1D vector to improve cache friendliness. That led to a much faster version, but it was still twice as expensive as the original one.

(end of post for word length limit)
Last edited on
I asked ChatGPT for advice, and it recommended me to use std::array structure like
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
constexpr size_t MAXDIM = 3; // Maximum allowed dimension.

class coordinate {

    public:
        
        // Default constructor.
        coordinate() 
            : dimension {0} {
                coords.fill(0.0); // Initialize all elements to 0.0.
        }

        coordinate(size_t dim)
            : dimension {dim} {
                coords.fill(0.0);
        }

        // Method to set the runtime dimension.
        void set_dimension(size_t dim) {
            if (dim > MAXDIM) {
                throw std::out_of_range("Dimension exceeds MAXDIM.");
            }
            dimension = dim;
        }

        // Access operator for reading/writing within the runtime dimension.
        double& operator[](size_t index) {
            // if (index >= dimension) {
            //     throw std::out_of_range("Index exceeds runtime dimension.");
            // }
            return coords[index];
        }

        const double& operator[](size_t index) const {
            // if (index >= dimension) {
            //     throw std::out_of_range("Index exceeds runtime dimension.");
            // }
            return coords[index];
        }

    private:
        std::array<double, MAXDIM> coords; // Fixed-size array.
        size_t dimension;                  // Actual dimension.

        // Getters
        size_t size() const { return dimension; }

};

The trick here is that while the actual dimension is passed during runtime, one always allocates the memory for the maximum allowed dimension such that memory needed for the struct is known during compile time.

These are the resulting force routines:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
inline void simulation:: compute_force_par()  
{

    #pragma omp parallel num_threads(THREADS)
    {
        thread_local coordinate distance(model.dimension);
        thread_local coordinate force_ij(model.dimension);

        #pragma omp for schedule(dynamic)
        for (int i = 0; i < model.N_particles; ++i) {
            for (int j = i + 1; j < model.N_particles; ++j) {
                
                model.get_distances_ij(i, j, distance);
                (model.*(model.get_force_ij))(distance, force_ij);

                std:: vector <coordinate>& forces_for_this_task = forces_for_all_tasks[omp_get_thread_num()];
                
                for (size_t dim = 0; dim<model.dimension; ++dim){
                    forces_for_this_task[i][dim] += force_ij[dim];
                    forces_for_this_task[j][dim] += -force_ij[dim];
                }

            }
        }
    }

        // Sum all of the task-specific forces into the output parameter.
    std:: fill(model.forces.begin(), model.forces.end(), coordinate(model.dimension));
    for (auto const& forces_for_specific_task : forces_for_all_tasks)
        for (int i = 0; i < model.N_particles; ++i)
            for (size_t dim = 0; dim<model.dimension; ++dim){
                model.forces[i][dim] += forces_for_specific_task[i][dim];
            }

}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
inline void IPS_model:: get_distances_ij(const size_t i, const size_t j, coordinate& distances){

    const double two_L {2*L};
    double dx;

    for (size_t dim = 0; dim<dimension; ++dim){
        dx = positions[i][dim] - positions[j][dim];
        // take care of periodic boundaries...
        distances[dim] = dx;
    }

    return;

}


ChatGPT promised that this will be the most efficient implementation possible. Yet, the code is 2 to 3 times slower than the original version with compile-time known dimension.

I have absolutely no clue why there is such a huge difference in speed.


I compile with `g++ -O0` and use a single thread to run it.

Does anyone have any idea what's going on or what the best way to handle this is?
Using `perf`, I verified that 98% of the work actually comes from the force routine. These are the first lines of the `perf` record:
The new version with flexible dimension:
1
2
3
4
5
6
7
8
9
10
11
12
13
+   99.46%     0.00%  IPS_stdarr.out  libgomp.so.1.0.0      [.] GOMP_parallel                                                                                                                                                                                    
+   99.46%     0.00%  IPS_stdarr.out  [unknown]             [.] 0x00000000000003e8                                                                                                                                                                               
+   99.46%     0.00%  IPS_stdarr.out  [unknown]             [.] 0x0000000000000003                                                                                                                                                                               
+   99.41%     0.00%  IPS_stdarr.out  [unknown]             [.] 0x0000560a677dc290                                                                                                                                                                               
+   99.15%    10.41%  IPS_stdarr.out  IPS_stdarr.out        [.] simulation::compute_force_par() [clone ._omp_fn.0]                                                                                                                                               
+   39.23%    12.42%  IPS_stdarr.out  IPS_stdarr.out        [.] coordinate::operator[](unsigned long)                                                                                                                                                            
+   31.86%    13.47%  IPS_stdarr.out  IPS_stdarr.out        [.] IPS_model::get_distances_ij(unsigned long, unsigned long, coordinate&)                                                                                                                           
+   31.31%     9.97%  IPS_stdarr.out  IPS_stdarr.out        [.] IPS_model::get_force_ij_gauss(coordinate const&, coordinate&)                                                                                                                                    
+   26.26%    13.52%  IPS_stdarr.out  IPS_stdarr.out        [.] std::array<double, 3ul>::operator[](unsigned long)                                                                                                                                               
+   18.38%    13.56%  IPS_stdarr.out  IPS_stdarr.out        [.] std::__array_traits<double, 3ul>::_S_ref(double const (&) [3], unsigned long)                                                                                                                    
+   16.19%     5.31%  IPS_stdarr.out  IPS_stdarr.out        [.] coordinate::operator[](unsigned long) const                                                                                                                                                      
+   11.72%     6.33%  IPS_stdarr.out  IPS_stdarr.out        [.] std::array<double, 3ul>::operator[](unsigned long) const                                                                                                                                         
+    7.03%     4.76%  IPS_stdarr.out  IPS_stdarr.out        [.] std::vector<coordinate, std::allocator<coordinate> >::operator[](unsigned long)  


Original 2D version:
1
2
3
4
5
6
7
8
9
10
+   98.82%     0.00%  IPS_2D_perf.o  [unknown]             [.] 0000000000000000
+   98.82%     0.00%  IPS_2D_perf.o  [unknown]             [.] 0x00000000000003e8
+   98.82%     0.00%  IPS_2D_perf.o  [unknown]             [.] 0x0000000000000003
+   98.82%     0.00%  IPS_2D_perf.o  libgomp.so.1.0.0      [.] GOMP_parallel
+   96.32%    20.63%  IPS_2D_perf.o  IPS_2D_perf.o         [.] simulation::compute_force_par() [clone ._omp_fn.0]
+   32.42%    23.42%  IPS_2D_perf.o  IPS_2D_perf.o         [.] IPS_model::get_distances_ij(unsigned long, unsigned long)
+   23.85%    16.33%  IPS_2D_perf.o  IPS_2D_perf.o         [.] std::vector<coordinate, std::allocator<coordinate> >::operator[](unsigned long)
+   19.33%    17.45%  IPS_2D_perf.o  IPS_2D_perf.o         [.] IPS_model::get_force_ij_gauss(coordinate)
+   10.22%     9.25%  IPS_2D_perf.o  libm.so.6             [.] __ieee754_exp_fma
+    7.65%     5.66%  IPS_2D_perf.o  libm.so.6             [.] exp@@GLIBC_2.29
The time difference disappears when using optimization flag --O3.

It seems to be normal that more abstract code runs slower without optimizations? Then I assume a "good" abstraction is one that the compiler can optimize away.
I guess that was a mostly rhetorical question, but yes abstract code will normally run slower without optimizations, so a "good" abstraction is one that has the least overhead or one that that a compiler can optimize away. But unfortunately, depending on your setup and hardware, there may be times when you have to debug or run the unoptimized version of the software.

Btw: Small thing, probably won't make a difference: If your "double dx" variable is only needed in the inner scope, then define it there, too. Often the compiler can reason about things better if they're in the smallest scope possible. Exceptions to this are things like branching, which the compiler cannot always account for. There's tons of youtube C++ talks about compiler optimizations and what the compiler can and can't reason about. I like Mike Acton's CppCon talk in particular.

Yes, a std::vector<std::vector<whatever>> is almost certainly going to be slow af. There's so much discontiguous data and separately allocated data such that it will grind your cache to a halt. I don't have a lot of experience in OMP so I don't know how to help there.
Thanks for the response Ganado.

Weirdly, even when optimizing with `-O3`, the original 1D version is slower than the new version, by 30%.

Original version:
uses
1
2
3
struct coordinate{
double x  // 1 coordinates for the 1-dimensional case.
}


New version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
constexpr size_t MAXDIM = 3; // Maximum allowed dimension.

class coordinate {

    public:
        
        // Default constructor.
        coordinate() 
            : dimension {0} {
                coords.fill(0.0); // Initialize all elements to 0.0.
        }

        coordinate(size_t dim)
            : dimension {dim} {
                coords.fill(0.0);
        }

        // Method to set the runtime dimension.
        void set_dimension(size_t dim) {
            if (dim > MAXDIM) {
                throw std::out_of_range("Dimension exceeds MAXDIM.");
            }
            dimension = dim;
        }

        // Access operator for reading/writing within the runtime dimension.
        double& operator[](size_t index) {
            return coords[index];
        }

        const double& operator[](size_t index) const {
            return coords[index];
        }

    private:
        std::array<double, MAXDIM> coords; // Fixed-size array.
        size_t dimension;                  // Actual dimension.

        // Getters
        size_t size() const { return dimension; }


The new version is even slower if I set constexpr size_t MAXDIM=1. So using an array with a single element gives surprisingly large overhead to simply using a named double x.
Last edited on
Double x does not even have to exist: it can be just a register, unless it persists for a wide enough swath of code that the compiler needs to make memory for it. An array with a single element, the compiler may not be tricked out to optimize that back down to a double, but instead may use pointer math to access it and odds are it will fail to boil it down to just a register.

as for vector of vector, its about how you access it. The computer is a really simple thing inside. You want to access your data linearly, in order, in blocks. That is, like an array... array[0], array[1], ... is how you want to access it. Random access means cache problems, and convoluted constructs like looking at a small number of the columns in a 2d array is bad -- look at how that accesses memory: array[10][10], to access the third column, it skips 10 values every time to get to the next one, and if those 10 values are big enough or you have a lot of them, it leads to page faults and memory access slowdowns. Transpose the data so instead you go across a row, and it will go much faster. Or for other situations, similar techniques to ensure linear access yields great results. Sometimes you split the objects up and have 1d arrays of object fields that you can iterate linearly but the object itself is made up of field1[object_id_num], field2[object_id_num]... fieldN[...] instead.

sometimes all you need is to redo the accessing loops so they do it more efficiently instead of 'how the human thinks about the problem'. Sometimes you have to redo the data structures. But linear access is what the machine does best.
Last edited on
In the fixed dimension case, I work with a struct to store positions, velocities, forces etc.:
1
2
3
4
struct coordinate{ 
    double x{0};
    double y{0};
};

... and then you can have vector<coordinate> positions;
An alternative is:
1
2
vector<double> pos_x;
vector<double> pos_y;


Depending on hardware and the (random) access pattern in iteration one or the other may be more efficient at the use of cache.

---

"Scientific computing" does also use "math". If floats are in the picture, then "numerical stability" is also a concern. In a computer, a+b+c is not always same as c+b+a, let alone same thing computed with CPU and GPU. Precision is ... naughty.


@keskiverto,
yes I am using std::vector<coordinate>.

I wrote the whole simulation in turns of a template template <size_t DIMENSION> now and use the following object to store the coordinates of a single particle:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 
template <size_t DIMENSION>
class coordinate {

    public:
        
        // Default constructor.
        coordinate() 
        {
                coords.fill(0.0); // Initialize all elements to 0.0.
        }

        // Access operator for reading/writing within the runtime dimension.
        double& operator[](size_t index) {
            return coords[index];
        }

        const double& operator[](size_t index) const {
            return coords[index];
        }

    private:
        std::array<double, DIMENSION> coords; // Fixed-size array.

};

#endif // COORDINATE_H 


This works quite well. It is faster for the 2D and 3D case, but slightly slower for the 1D case, but this is OK.

@jonnin thanks for the advice. I wonder how I can prevent using std::vector<std::vector<coordinate>> in the force routine. Each coordinate denotes one particle. The 2D vector is of shape (T,N) where T is the number of threads and N the number of particles. While the dimension will only be either 1, 2 or 3 (so that I could treat it with a template and make it known during compile time), the parameters T and N are arbitrary integers read in through argv during runtime.

It seems like an obvious step would be to make the vector one dimensional, so use std::vector<coordinate>(N*T).
Last edited on
that ^^ would make sure that its a single, linear chunk of memory, but I am not convinced its necessary. Going to assume a bit, but assuming that each thread is on its own CPU core with its own cache and all, then its ok if each thread just gets its own solid block of memory to chew on and the outer level can be fragmented, it won't change anything.

But if that is not the case, and the cache/memory is some other model or the threads don't 1 to 1 relate to a CPU core, then maybe collapse to 1D is a good idea. I hate to tell you to do it and it have no effect, but on the flipside, doing it should not have any negative effect so worst case scenario of trying it would be about the same result as you get now, and a small chance depending on details that it could run faster.

I hate to say it, because it will make everyone's backside seize up, but ... if it were me, I would try not having coordinate at all. Your object does absolutely nothing except add a layer of crap on top of an array. It may not have any effect on the performance, but then again, it may. I personally would start there, and depending on whether you USE ANY of the std::array features, I would even consider going C array here if you do not use those tools. Again, I can't promise it will be faster doing this, because I don't know how much scaffolding the compiler inserted for the standard array object and again for your wrapper for it, but if its more than zero, there was some small cost to it, and where that may bite you is just accessing the data and whatever the compiler ended up with there in assembly. I dunno if it has a real name, but I think of it as helping the compiler: simplify the code where it matters in hopes that the less confusion/complexity you feed it, the less chance it has to make a mess, so by going C array, you limit the compiler's desire to do anything weird with memory, pointers, hidden bloat, ... it does not always work. Compilers are really good, and the c++ objects have been streamlined well, so there may not be anything to eliminate or discourage.
Last edited on
In absolute time, how long does it take to run? Are we talking minutes or months here?

Next question, how often do you need to run it? Once a millisecond (it's a game) or once a month (it's for the bean counters)?

https://en.wikipedia.org/wiki/Big_O_notation
You have nested loops, so right away you're looking at O(N2) algorithms.

If you want it to be fundamentally quicker, you need a better algorithm, not a better implementation.

The overall time is measured by a function looking like C*N2, where C is some measure of your implementation efficiency, and N is your data size. Turning on the optimiser, or changing std::vector to std::array, or running it in parallel on some small finite number of cores - means you're just messing with C. This is where your "30%" or "50%" improvements come from.

If you want to turn months into minutes, you need to tackle that "squared" part of the problem, because this is the only thing that matters when the amount of data gets large enough.

Since you seem to be modelling forces between particles, perhaps https://en.wikipedia.org/wiki/Binary_space_partitioning

As the distance increases, the more you can aggregate the effect of one localised group of objects on some other more distant localised group.

So if you want it to be fast, you make groups larger and the minimum aggregate distance smaller.

If you want precision, you make the groups smaller and the minimum aggregate distance larger. The limit of this is what you have now, every particle is in a group by itself, and the distance is infinite.

An extreme example, you have two stars 10Ly apart. You don't model 1060 hydrogen atoms, you model two very heavy point-like objects.
@jonnin
yes, every thread is supposed to be worked on by one dedicated core. I did indeed try to use C arrays rather than std::array and found that it does make a small difference (despite reading everywhere that it shouldn't, but I guess many C++ advices are not targeting the HPC case). Thanks for your insight!

@salem
Cheers for the response. How large the systems and required number of iterations will be will depend on the user (we want to turn it into a little app for other researchers in the field to use). I am aware of the N^2 complexity of the algorithm and that one could save a lot by finding a better scheme to iterate over the pairs. For now, this thread was just about making the implementation as efficient as possible - in particular since the risk that I use the language in an inefficient way is still there :)
Last edited on
if a better way to iterate exists, finding that will pay out a lot more than tinkering with the code to eek out a little speed. When I was in highschool, we coded all the sorts on a very old computer... I think it was to sort like 1000 or maybe 10000 items. Bubble sort ran overnight, and gave its time back the next morning as several hours. Quicksort came back in minutes for the same data.

I don't know the math and approaches here. I have some physics but its more ballistics than particles, and my subatomic/CFD/etc type programming days are long behind me. But doing something faster the wrong way is never going to work as well as doing something the right way slowly. That is, the worst written but working and algorithmically correct quicksort in the world (without adding sleeps or something silly to prove me wrong) will never be beat by the best written, fastest bubblesort in the world on a large data set.

I am just going to assume you know what you are doing here, and if you tell me the best it can be is N*N, then so be it. If I notice some way to do it otherwise, I will tell you, but I am not going to try to solve that. If we are just polishing a bubblesort, though, ... you are wasting your time and ours. I don't have a lot to do, though :)

Last edited on
> For now, this thread was just about making the implementation as efficient as possible

How about starting a new thread where we discuss the algorithm?

Because if jonnin's (and my) suspicions are right, you've basically implemented a bubble sort.

As the saying goes, it's time to work smarter, not harder.

What we would likely need:
- an example data set, say a .csv of 10,000 x,y,z points and their corresponding dx,dy,dz velocities.
- your result after say 10,000 iterations, as another .csv file of final position and velocities.
- your reference implementation of your 'distance' and your 'force' functions.
- the opposite corners of "Particles move in a cubic simulation box", like 0,0,0 to 1E6,1E6,1E6

What happens when a particle hits a wall? A simple reflection?

Maybe this simulation could be moved to a graphics card (or cards).
Hi guys,

thanks for your interest and hints that focusing on the algorithm could lead to much greater speedups.

It is indeed the case that the force calculation can be written in a way that it becomes a matrix-vector product.
The procedure would be like
1
2
3
// Compute forces
for (int i=0; i<N_particles; ++i)
    force[i] = Matrix * vector      // could be done on GPU 


There are plans creating a gpu version as well.


Also cheers for the hint to bubblesort. I have never heard the connection of bubblesort with molecular dynamics force calculations, but I will check out whether I truly am just "implementing bubblesort".
Last edited on
Registered users can post here. Sign in or register to post.