I'm sure it can, thanks.
I didn't bother optimizing it because it only runs one time ever.
I would leave the call to sanity_check_task_endpoints in because we're using floating point math to compute indices into an array, and we ought to detect rounding error if possible.
I would leave the call to sanity_check_task_endpoints in
Ah. I only commented it out so that the code would compile for me to make sure I had no typo/syntax errors (like unmatched brackets!). I forgot to remove the // for line 22 after I pasted the code. Doh!
Thanks mbozzi for the interesting reference and your prototyped implementation.
Since you "only" reached a speedup of 4% and the results in the paper don't look completely groundbreaking either, I think I will stick to the implemented dynamic schedule.
But I will keep this in mind for future optimization, I will come across loops like this again and again.
I think there are some typos in the paper, though. For example, the total number of iterations for N elements should be 0.5*N(N-1) not 0.5*N(N+1).
Also, in eq. (7) it says one needs to upround the resulting K1 to obtain the I1, whereas in (8) they round down.
I will likely open another thread soon to talk about modularity/utility of the code written with help of this thread.