You know yourself that to run it at all you have to increase default stack size to some massive number (as it overflows stack with default 1MB size). IMO, this is most likely a compiler bug (that seem to exist in all compilers imo). TurboPFor has huge switch with blocks of code, and compiler calculates stack use for each block and adds them all together as if separate switch blocks could also execute by the same function instead of merging these stack blocks. This results in that insane stack requirement. IMO, most projects might even skip on TurboPFor after seeing it stackoverflow without even trying to figure it out what happened.
Anyways, TurboPFor doesn't need this insane massive stack to operate with some code restructuring. I fixed it in my branch, and I'm not sure if that's something would be OK to take, as the change is massive and not trivial at all (uses some preprocessor magic to make it work, I actually wrote most of the repetitive code using editor macro).
This is the change that changed it from 1+MB stack to 40KB:
9c116bc
basically it makes all the switch blocks into standalone forceinline functions. When merging forceinline functions compiler doesn't not add stack that each of the function could use, instead it uses max size of them all which results in this reduction from 1+MB to 40KB
Also, lots of bitunpack functions should be static: fd60325
I have lots of other improvements and real bugfixes: master...pps83:TurboPFor:master
You know yourself that to run it at all you have to increase default stack size to some massive number (as it overflows stack with default 1MB size). IMO, this is most likely a compiler bug (that seem to exist in all compilers imo). TurboPFor has huge switch with blocks of code, and compiler calculates stack use for each block and adds them all together as if separate switch blocks could also execute by the same function instead of merging these stack blocks. This results in that insane stack requirement. IMO, most projects might even skip on TurboPFor after seeing it stackoverflow without even trying to figure it out what happened.
Anyways, TurboPFor doesn't need this insane massive stack to operate with some code restructuring. I fixed it in my branch, and I'm not sure if that's something would be OK to take, as the change is massive and not trivial at all (uses some preprocessor magic to make it work, I actually wrote most of the repetitive code using editor macro).
This is the change that changed it from 1+MB stack to 40KB:
9c116bc
basically it makes all the switch blocks into standalone forceinline functions. When merging forceinline functions compiler doesn't not add stack that each of the function could use, instead it uses max size of them all which results in this reduction from 1+MB to 40KB
Also, lots of bitunpack functions should be static: fd60325
I have lots of other improvements and real bugfixes: master...pps83:TurboPFor:master