Ok, I finally had a chance to take a look. I was able to get a 25x speedup over baseline with only small changes. Because of how the program is structured I think there is plenty more room for improvement, but I think there are correctness issues that need to be addressed first. I used
ny = nx = 128 the whole time.
First I looked at measured problem areas. First I replaced
r2cfft1d and
c2rfft1d to avoid resource leaks, nonessential memory allocations, and redundant work. Another resource leak in
Ecalc_dt was removed by moving the
free call above the
return statement. After that, measurement showed that the program spent 90% of execution time within
potentialk's inner loop, particularly within the call to Eigen's
colPivHouseholderQr. So that was changed to the marginally-faster
householderQr.
The iterations of the inner loop are independent, so I parallelized it. For simplicity, a new batch of worker threads are created each time
potentialk is called, which isn't ideal but it should be alright for now. There may still be a big potential for improvement here because the processor spends 90% of its time inside this loop waiting on I/O. For example it might be beneficial to switch to
float if you can afford the reduced precision. Additionally, the elements of
L are linear functions of the loop variable
l. Maybe this can be exploited to avoid paying full-cost to factor the matrix every time. Finally, Eigen has faster solvers than
householderQr, but most of them have constraints on the types of matrices that can be solved, and I don't know enough linear algebra to know whether they're suitable. See:
https://eigen.tuxfamily.org/dox/group__TutorialLinearAlgebra.html
At this point the computer spends roughly 1/2 of wall-clock time inside
potentialk. This corresponds to about 12ms per call or 50ms per iteration of the main loop, and the processors are severely starved for data during this period. The rest of the main loop consumes the other 50ms per iteration, of which a good chunk of that time is spent in
memset, probably because
setZero is frequently called for no reason.
Then I looked at the rest of the program. As a whole, the program does a lot of redundant work. For example this snippet from
EcalcPotSourcek_inertia:
1 2 3
|
Eigen::MatrixXd D2((ny + 1), nx);
D2.setZero();
D2 = D * D;
|
Could be written
Eigen::MatrixXd D2 = D * D;
Note this does incur a nonessential memory allocation, which can be eliminated if the matrix is allocated just once at the start of the program:
D2 = D * D
In addition
EcalcPotSourcek_inertia doesn't do anything as far as I can tell. In particular it never appears to modify any of its arguments. Besides it does a whole bunch of (unrelated?) work and throws away the results. Same deal with
Ecalc_residualt, which overall seems very broken. I'd strongly recommend that you spend some time to write automated tests for each of your functions and do a
thorough review of the entire program, focusing on correctness.
I think you're well on track to outperform your original program. A performance profile shows that 18% of program time is spent inside
memset, which is mostly unnecessary, another 11% inside
memcpy, which is also mostly unnecessary. 32% inside some implementation of the BLAS routine called
gemm. 5% inside
pow taking integer powers, which is trivial to remove. You'll also be interested to know that when the call to
EcalcPotSourcek_inertia was removed from the main loop, the subsequent (slow) call to
potentialk sped up by 2x, likely because of memory effects.
As far as how to proceed, I would suggest doing a thorough code review & writing some tests. In case you've never done testing before, the most basic way to do it is:
1. Write a new C++ file with an alternate main function.
2. Call each function-under-test individually from this alternate main function, and compare each function's results against expected answers. Print out an error message if something's screwed up.
3. Compile against the "test" main function when you want to do tests; compile against the "real" main function when you want to run the program.
You would be able to run these tests for some assurance that your code is correct.
As far as performance improvements are concerned, try to avoid unnecessary work. Let Eigen do your matrix math and get rid of the hand-written for loops. Stop reallocating matrices. Stop calling
setZero by rote. Avoid computing stuff that's not used later.
Then you can look at changing the program to better use the machine's full capabilities, parallelization and so on.
If you have clang++ available, try that. For me it offered a consistent 1.15x speedup over g++.
The exact compiler command I used is
clang++ -std=c++17 -D_CRT_SECURE_NO_DEPRECATE -Wall -Wextra -pedantic -Wno-comment -Wno-unused-parameter -Wno-unused-variable arithmeticFunctions3D.cpp spectralFunctions3D.cpp 2DPACMAN.cpp -I D:\opt\vcpkg\installed\x64-windows\include -lfftw3 -pthread -Ofast -march=native -ffast-math -flto -fno-color-diagnostics -isystem d:\jamieal\eigen -o bin/main.exe -LD:\opt\vcpkg\installed\x64-windows\lib -fuse-ld=lld
Hope this helps, the modified code is at
https://mbozzi.com/t1G8gXj.zip
along with this commentary & the original code & patches.