It looks some very hard long strategic question on core and plugin development:
Because of current execution hardware architecture moving to large number of processing cores with small per-core L1d cache and slow main memory access can be the new scan formats be added like blocks scan instead of full-frame line scans ?
Currently for algorithms for vertical data accessing it is require to read memory with large strides and it even 1 full line of 8K frame in float32 eats all 32 kB L1d cache.
The size of scan blocks is a new question. It looks must be < L1d cache size like 32..48 kB maximum and the ratio of V/H may be from 1:1 to may be natural for 4/3..16/9 frame.
The scan re-formatting may be made inside each plugin to and back but it will slow performance. May be also some plans in this direction exist ?
|