I've just currently been to multi-threading and kept myself revolving around it to learn and get through that. A couple questions about the code below:
#include <iostream>
#include <thread>
#include <vector>
#include <numeric>
void AccumulateRange(uint64_t& sum, uint64_t start, uint64_t end) {
sum = 0;
for (uint64_t i = start; i < end; ++i)
sum += i;
}
void printVector(const std::vector<uint64_t>& vec) {
for (auto v : vec)
std::cout << v << ' ';
std::cout << '\n';
}
int main() {
const uint64_t number_of_elements{ 1000000000 }; // one billion elements
const uint16_t number_of_threads{ 10 };
constint step{ number_of_elements / number_of_threads };
std::vector<std::thread> threads;
std::vector<uint64_t> partial_sums(number_of_threads);
for (int i = 0; i < number_of_threads; ++i)
threads.push_back(std::thread(AccumulateRange, std::ref(partial_sums[i]),
i * step, (i + 1) * step));
for (auto& t : threads)
if (t.joinable())
t.join();
// for (int i = 0; i < 10; ++i)
// AccumulateRange(partial_sums[i], i * step, (i + 1) * step);
uint64_t total = std::accumulate(partial_sums.begin(), partial_sums.end(), uint64_t(0));
printVector(partial_sums);
std::cout << "Total = " << total << '\n';
system("pause");
return 0;
}
1) Is it a good simple example of multi-threading in C++, please?
2) Here 10 other threads, than the main() thread, have been used to enhance performance. But the time needed to perform the task with or without those ten helper threads is the same! In either case, my PC takes about 6 or 7 seconds to carry out the task! Why!? I thought the program would be much faster with using those helper threads!
> const uint16_t number_of_threads{ 10 };
How many actual cores do you have?
> printVector(partial_sums);
You need to get the times without any I/O.
I/O is expensive, and I/O to your console screen is even more expensive than just writing to a file.
>How many actual cores do you have?
(corei-3 4160): 2 cores, 4 threads
> You need to get the times without any I/O. If you want to measure the effect of threads... https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now
I did these. The output:
With helper threads: 3.9 seconds
Without them: 4.15 seconds.
Still no big difference!
> With helper threads: 3.9 seconds
> Without them: 4.15 seconds.
> Still no big difference!
TBH, without showing your latest code, it's not that informative.
With cpu-bound thread(s), increasing the number of threads beyond the physical number of cores will not further increase performance and could degrade performance! Note that there is a performance hit in actually creating and setting up a thread. That's one reason why moving from single to multi-thread doesn't always give the expected performance improvement - and in some cases can make things worse! Also why using a multi-thread execution-policy of aware functions doesn't necessarily give improved performance.
your needs and whatnot vary, but after an analysis like the above, I would cut the threads where the linear progression stops -- so when 484/#threads isnt close to the time -- around 7 threads, and no more than 10, is where I would stop this one. You also should always leave 1 thread on the computer free unless its a special purpose and the only thing you care about is THIS program. Note that 12 threads is about 10x faster, not too far off linear.
@Salem C
> TBH, without showing your latest code, it's not that informative.
Here you are. You can comment out the threads part (line 28 to 35) and uncomment the for-loop on line 37 to see the difference in their performance times.
#include <iostream>
#include <thread>
#include <vector>
#include <numeric>
void AccumulateRange(uint64_t& sum, uint64_t start, uint64_t end) {
sum = 0;
for (uint64_t i = start; i < end; ++i)
sum += i;
}
void printVector(const std::vector<uint64_t>& vec) {
for (auto v : vec)
std::cout << v << ' ';
std::cout << '\n';
}
int main() {
const uint64_t number_of_elements{ 1000000000 }; // one billion elements
const uint16_t number_of_threads{ 10 };
constint step{ number_of_elements / number_of_threads };
std::vector<uint64_t> partial_sums(number_of_threads);
// record start time
auto start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads;
for (int i = 0; i < number_of_threads; ++i)
threads.push_back(std::thread(AccumulateRange, std::ref(partial_sums[i]),
i * step, (i + 1) * step));
for (auto& t : threads)
if (t.joinable())
t.join();
// for (int i = 0; i < 10; ++i)
// AccumulateRange(partial_sums[i], i * step, (i + 1) * step);
uint64_t total = std::accumulate(partial_sums.begin(), partial_sums.end(), uint64_t(0));
// printVector(partial_sums);
std::cout << "Total = " << total << '\n';
// record end time
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout << diff.count() << " s\n";
system("pause");
return 0;
}
@Seeplus:
1) I completely agree with you and what you mentioned is exactly what I was thinking of when experiencing the execution of the program with or without those helper threads!
So in what kinds of programs do you step forward to using multi-threading, please?
2) And do you intend to use the number of threads, when using multi-threading, based on the number of actual cores? That is, since my system now uses two actual cores, so it's better to add only one extra thread, apart from main(), for execution performance when needed. Right?
@jonnin
I'm not sure I got what you meant completely, but will be thankful if you use my code to prove performance enhancement using multi threads, if you think the other way around my thinking above, . Unfortunately I can't understand JLBorges's code; it's too complex for me.
Unfortunately I can't understand JLBorges's code; it's too complex for me.
- I consider JL to be one of our best coders. It is very worth your time to go through it slowly and try to understand it, look up what you don't know, and ask if you still don't get it. You very likely WILL learn something from his posts.
-All you needed to get from my ramble was that you need to find the point of diminishing returns. Ignoring thread overhead, a typical routine runs twice as fast one 2 processors as 1 when split evenly. It runs 3 times faster on 3 processors than 1 when split evenly. This is what I mean by linear returns from threading it. After doing a study like that one that runs the code over different numbers of threads, find the LEAST number of threads you need to get a reasonable speedup. As you can see, on my 20 core processor, the differences between 10 threads and 24 is pretty much nothing gained for throwing more threads on the pile. The difference between 7 and 10 threads is tangible but small. So, just common sense: pick a value between 7 and 10 that you feel will do your code 'well enough' and run that many, and no more. Also, if the cpu does not support that many .. say its an old 4 core machine, then you run 3 and leave one free. Always leave a core free for the OS to do its background garbage... there is always something eating a CPU in modern OS unless you are on an embedded computer with a minimal OS or a R&D box.
so you need to know how many your computer supports, which the test code detects, and run one less than that up N where N was found by the testing results, in our case N is 7 to 10. This isnt perfect, as hardware varies, but its about as good as I know how to do it without running a test program when you install to find the best values for that specific machine (and even then, that varies when the machine has more or less programs running on it).
also note that this is for O(n) problems. this matters ;) a NlgN sort split actually does fewer operations split than on one cpu ... sorting 1 million items and reassembling them on 4 cpus, for example, turns about 20 million operations (n=1M, lg(1m)~= 20) into about 19M (250k * lg(250k)(18)) + N to reassemble the 4 parts back into 1) ... its pretty much the same concepts as above but the actual gains from threading vary depending on the algorithm and complexity of reassembly if any and so on. It doesn't matter too much since you do the same kind of analysis to choose the # of threads to split into... choose where the diminishing returns of more threads isnt worthwhile.
#include <chrono>
#include <type_traits>
// the highest resolution clock that is suitable for measuring time intervals
// the high resolution clock if it is monotonic, otherwise the steady clock
using clock_type = std::conditional< std::chrono::high_resolution_clock::is_steady,
std::chrono::high_resolution_clock,
std::chrono::steady_clock >::type ;
@JLBorges
Please let's have this post and the answers you will be writing as replies in a simple language.
volatile std::uintmax_t sum = 0; // volatile: inhibit a clever optimiser from optimising the loop away
1) volatile means, hey compiler, try not to consider me as const and step forward to optimize the code, right?
2) Why did you think a good compiler may optimize the for loop code and avoid going through it?
3) Why didn't you use sum += start; instead of sum = sum + start;, please? (Any reason!?)
4) Why did you use static storage area here, please? staticconstunsignedlonglong N = 2*3*5*7*2'000'000LL;. You wanted an object which is initialized once and doesn't change throughout the program's life time. So why not only const? Why not constexpr or even static constexpr?
5) In first_val += N_PER_THREAD;, why did you think unsigned int can embrace the data inside unsigned long long, please? If they both can hold the same size of data why did you define N as type of unsigned long long?
6) Why did you declare async this way: ... std::async( std::launch::async, .... I mean why using std::launch::async?
These are my questions until line 30 of your code. When the above questions are sorted out properly so that I can understand them well, I will go further for the next questions until the end of your code.
Thanks for your time.
> volatile means, hey compiler, try not to consider me as const and step forward to optimize the code, right?
Reads and writes to a volatile object are deemed to be part of the 'observable behaviour' of the program.
So, sum + sum i; would involve updating (reading and writing) the value of sum in each iteration of the loop.
> Why did you think a good compiler may optimize the for loop code and avoid going through it?
I've observed similar optimisations; optimisers are capable of doing quite clever things.
1 2 3 4 5 6 7 8 9 10
int foo()
{
int sum = 0 ;
for( int i = 0 ; i < 1000 ; ++i ) sum += i ;
return sum ;
// the optimiser knows that the sum of the series 0+1+2+..+999 is 499500
// optimise the loop (and sum) away; apply the as-if rule and rewrite he function as
// return 499500 ;
}
Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access.
Not sure what you mean by "not consider me as const" , const is const, unless: one casts it away with a const cast, which is strongly discouraged ; or the variable is mutable. But neither of those affect volatile. const volatile is non mutable.
volatile is often used because we want to see the visible side effect.
> I've observed similar optimisations; optimisers are capable of doing quite clever things.
OK, here in the example you wrote, albeit the compiler was clever and optimised the for loop returning the value 499500, it's not bad, since sum is a local variable, whether its value hasn't change and is still 0 or has change, right? I guess it'd be undesirable in situations where we also need the variable's value apart from the return value, like:
1 2 3 4 5 6 7 8 9 10 11 12
int foo(int& sum)
{
int sum = 0 ;
for( int i = 0 ; i < 1000 ; ++i ) sum += i ;
return sum ;
}
int main() {
int sum {0};
std::cout<< foo(sum) << ' ';
// do something with sum
}
Agree?
>'Usual arithmetic conversions'.
So in this case, you meant: no need to declare first_val as unsigned long long. If the value it gets is small enough (say, when nthreads == 10), well, it can place it in itself, otherwise if the value is big (say, when nthreads == 9), the usual conversion is applied and the type of first_val will become unsigned long long. Right?
>We want the task to be executed on a different thread (different from the calling thread)..
You mean we're doing wrong in the example below, and are not calling the task to be executed on a different thread? https://coliru.stacked-crooked.com/a/590c559f0a221b12
@TheIdeasMan:
I mean that volatile is not const, it may constantly change and we want the compiler to know this.
I mean that volatile is not const, it may constantly change and we want the compiler to know this.
But the point is: not to optimise the code, which is opposite to your original statement.
Edit:
Variables may be constvolatile , this is called cv qualified. So volatile does not mean non const .
Perhaps it is an interesting choice of word for a keyword, given it's English meanings seem to be opposite to it's meaning in C++. https://www.dictionary.com/browse/volatile
@JLBorges
noted your statement about strength reduction versus optimised away completely - cheers :+)
> I guess it'd be undesirable in situations where we also need the variable's value apart from the return value, like:
The loop can still be optimised away.
1 2 3 4 5 6 7 8 9 10
int& foo( int& sum )
{
sum = 0 ;
for( int i = 0 ; i < 1000 ; ++i ) sum += i ;
return sum ;
// the optimiser knows that the sum of the series 0+1+2+..+999 is 499500
// optimise the loop (and sum) away; apply the as-if rule and rewrite he function as
// return sum = 499500 ;
}
> and are not calling the task to be executed on a different thread?
It is left to the implementation.
The policy defaults to std::launch::async | std::launch::deferred (ie. both flags are set).
If more than one flag is set, it is implementation-defined which policy is selected. For the default (both the std::launch::async and std::launch::deferred flags are set in policy), standard recommends (but doesn't require) utilizing available concurrency, and deferring any additional tasks. https://en.cppreference.com/w/cpp/thread/async
> ... If the value it gets is small enough ...
Usual arithmetic conversions are always applied;
these conversions are based on the types of the operands, not their values.
> But the point is: not to optimise the code
Reads from and writes to volatile objects can't be optimised away.
Other optimisations may be applied to operations on volatile objects. For example, strength reduction:
1 2 3 4 5 6
void foo(volatileunsignedint& sum )
{
sum = sum / 16 ;
//optimise: strength reduction; replace division by a bitwise shift
// sum = sum >> 4U ;
}