When I run my project on Visual Studio 2022 using 4 threads (the shared data is immutable) since std::cout << std::thread::hardware_concurrency(); prints 4, I get almost no time improvements compared to the time I only use the sole main thread. But when the project is run on Coliru (online C++ compiler) using the same number of threads, it runs much faster.
Is there any reason for that, please?
Impossible to say, unless you explain how exactly you use threads to "speed up" your computation.
Threads do not "magically" make your program run faster! You have to carefully split the work across the threads — which may be possible or not, depending on how "parallelizable" the specific task that you are trying to solve is. Also, the work must be large enough to benefit from threading; otherwise it is very well possible that the thread creation/synchronization overhead counteracts any speed up you might expect...
I'm just theorizing here, but another factor might be that online compilers are potentially bombarded with hundreds of requests every second, and the server needs to have some way of handling all these requests, by splitting up work. But as I type this, I suppose that could only make the online code slower, not faster.
It could be that due to different architectures, you end up getting some false sharing on your computer, but the bullet is dodged on the online compiler. https://en.wikipedia.org/wiki/False_sharing
the project is run on Coliru (online C++ compiler) using the same number of threads
Oranges.
Two entirely different setups in software and hardware, so why the differences is hard to um-possible to quantify because of the differences based on two machines.
Run the tests mega-multiple times on different Windows machines, different hardware and OS, and chances are each run on each of the machines will exhibit different timing.
Not every task is amiable to delegating to multiple threads.
The data is immutable and huge, like a data structure with millions of elements. Since the hardware_concurrency() prints 4 on my IDE, I divide the elements into roughly 4 parts and launch a thread for each part. This way the time complexity is almost the same as using the main thread only! :|
Good to know that the problem is not with the data or algorithm since when I run the two projects on Coliru (one with only the main thread and another with 4 threads) the time needed to do the job for the latter is less than half the time needed for the former.
But how long does a single thread need to complete the computation? Starting and synchronizing threads has a certain overhead! So, if the single thread completes very fast (e.g. less than a few seconds), then the computation probably won't benefit from multi-threading, because the overhead counteracts any benefit.
Also, how exactly do you split your elements into 4 parts? You give the first thread the elements #0 to #(N/4)-1, the second thread the elements #N/4 to #2(N/4)-1, and so on? If so, then maybe try giving the first thread the elements #0, #4, #8..., the second thread the elements #1, #5, #9..., and so on.
Last but not least, if your computation is actually bounded by memory bandwidth rather than the CPU, then using multiple threads will not resolve the memory bottleneck. It may even make things worse, because of the less "sequential" access pattern that follows from running multiple parallel threads...
Also, STL algorithms may already have concurrency built into them, so if your code uses them exclusively then there may be no performance gain, or possibly worse performance as already mentioned.
The data is immutable and huge, like a data structure with millions of elements.
A few million data items? If so , that is not huge. Routine testing starts at 1 million, then 10's of millions, then 100's of millions. I use a library that does non trivial calculations (Latitude, Longitude to UTM grid coordinates) - it can do 1 million of these in less than a second on an ordinary laptop. So try with a much bigger data set - like 100 million say.