11-22-2016, 10:55 PM
Hum, potentially there are some margin of improvements for newer CPU. Texture conversion is done with a gather operation.
So far it is emulated with 8 instructions (for 1 texel lookup, bilinear is 4)
AVX2 adds a native gather instruction. However it isn't fast on my CPU (haswell), but skylake got intesting number.
Haswell : 20 uops, 9 cycles of latency
broadwell: 10 uops, 6 cycles of latency
skylake: 4 uops, 4 cycles of latency.
So far it is emulated with 8 instructions (for 1 texel lookup, bilinear is 4)
AVX2 adds a native gather instruction. However it isn't fast on my CPU (haswell), but skylake got intesting number.
Haswell : 20 uops, 9 cycles of latency
broadwell: 10 uops, 6 cycles of latency
skylake: 4 uops, 4 cycles of latency.