i tried hard with GCC again.
seconds. lower is better.
Code:
137.0 no assembly
47.0 (default)
45.5 (PGO build) -mtune=ivybridge (default here is -O3 which makes 1st pass PGO .exe crash, thus no better speed i guess)
44.5 (PGO build) -mtune=ivybridge -O2
43.9 (PGO build) -mtune=ivybridge -funroll-loops -finline-functions -ftree-loop-vectorize -O2
39.5 LigH
so i get little improvement with all that fiddling, but still far away from LigH's GCC builds.
giving up here, i have no ideas left.