Doom9's Forum - View Single Post

pinterf · 26th January 2018, 18:21

Meanwhile I accidentally observed a bug, which appeared on the right-side of a clip as random color blocks/lines.
It was appearing only in 32bit float test clips and only at specific sequence of resizing.

Check and encode this script and look at the right side of the bottom right clip (32 bit float)

Code:

x=ColorBarsHD.ConvertToYUV444().Trim(0,100)
Function Resize(clip c)
{
Return c.Spline64Resize(2802,1501).\
  Spline16Resize(904,487).\
  BilinearResize(402,500).\
  Spline36Resize(200,489).\
  LanczosResize(400,300).ConvertBits(8)
}
a8 = x.ConvertBits(8).Resize()
a10 = x.ConvertBits(10).Resize()
a16 = x.ConvertBits(16).Resize()
a32 = x.ConvertBits(32).Resize()
Stackvertical(StackHorizontal(a8,a10),StackHorizontal(a16,a32))

Unfortunately it was perfect when I encoded float-only clips. The uglyness in the reproduction was that there had to be a 8-16 bit clip in the script (I usually do the tests for 8-10-16-32 bits then stack it together in 8 bits to see any difference) Obviously there had to be something that had left other patterns in memory than a 32bit float format clip.

Narrowing the problem down, it turned out that the SIMD code that handles the pixels in 4/8 units was run into some garbage at the right side, where not all the 4/8 units are visible pixels, depending of the clip width (modulo 4 or 8).

During the resizing process, the unused pixels are masked out with a zero multiplier - existing 8-16 bit code worked fine like that -, but it was not enough for 32 bit float pixels. When such a pixel is undefined, the processor would report it NaN (Not a Number), and multiplying it by 0 would still result in NaN.
Thus such pixel in the resized clip turned into undefined (garbage)

So I had to fix the float resizer code - uhh, it was old, one of my early attempts.
Since the 10-16 bit resizer parts were affected as well, they had to be touched, too.
Finally whe whole 10-16 and float resizer code got rewritten - unfortunately I couldn't see the time that the bug chasing needed, sure, with less effort I could port Zimg resizers into avs+.

Btw zimg.
Since I had to benchmark the new code if it is any better than the one in r2580, I have included the z_XXX resizers.
https://forum.doom9.org/showthread.php?t=173986

Results are interesting.
Look at the 400x2800 -> 900x400 case (16 bit)

Resizing always happens in one horizontal and one vertical pass. Or first vertical resizing, then horizontal. it depends.

In Avisynth - probably for quality reasons - there is a strategy: "// ensure that the intermediate area is maximal"

Avs resizer gave a 103 fps, while zimg had 402 fps. What?

Then it was made clear that Avs chose the 400x2800->900x2800->900x400 sequence,
while zimg chose
400x2800->400x400->900x400.

When I turned the resizing command into two resizing (first V then H), avisynth+ gave a quite comparable result of 427 fps.
First three columns (AVX2, SSE4, noSSE4 contains results of the new resizers) Code was run on an i7-7700, Avs+ x64. I built specific avs+ versions for the test to ignore AVX2..SSE4.1 CPU flags.

EDIT: this benchmark data contains the comparison of a current "under construction" version avs+ and the z-lib resizer I had access (r1a, from 2016).
They both have faster variants since then.

Code:

#32bit float           AVX2    SSE4   noSSE4   v2580  Zimg
#400x2800 -> 900x400:  103     64.3   64.3     65.9              Lanczos
#1920x1080-> 1280x720  143.1   83.7   83.2     95.7   158.4      Spline64

#16 bit                AVX2    SSE4   noSSE4   v2580  Zimg
#400x2800 -> 900x400:  103.9    84     75.8    56.2   402.6      Lanczos **see comment
#400x2800 -> 400x400->
#            900x400:  427.2   315.3  289.9    243.4  405.1      Lanczos 

#1920x1080-> 1280x720  152.2   125.0  118.9    80.6   129        Spline64
#1920x1080-> 1280x720  160.6   134.3  129.4    85.9   134.5      Lanczos
#1920x1080-> 1280x1080 240.2   190.6  185.2    113.2  191.8      Lanczos H
#1920x1080-> 1920x720  335.5   320.8  281      272.2  203.0      Lanczos V

#10 bit                AVX2    SSE4   noSSE4   v2580  Zimg
#400x2800 -> 900x400:  105.4   77.5    53.5                      Lanczos 
#1920x1080-> 1280x720  155.2   146.6  120.4    78.9   133        Spline64
#1920x1080-> 1280x720  163.6   156.2                             Lanczos

#8                     AVX2    SSE4   noSSE4   Old    Zimg
#400x2800 -> 900x400:  93.8    93.4   96.2     93.8   402       **see comment
#1920x1080-> 1280x720  104.7   105.1  106.0    105.5  120.2

# ** Avisynth - unlike zimg - always orders H/V resizers for max intermediate area!

(8 bit resizer code is untouched by me, there is no avx2 option there but I included their measurements)

Now it's to be decided that the slower H/V or V/H decision strategy should be kept or not. Is the difference really visible and when?

26th January 2018, 18:21	#3909 \| Link
pinterf Registered User Join Date: Jan 2014 Posts: 2,314	Meanwhile I accidentally observed a bug, which appeared on the right-side of a clip as random color blocks/lines. It was appearing only in 32bit float test clips and only at specific sequence of resizing. Check and encode this script and look at the right side of the bottom right clip (32 bit float) Code: x=ColorBarsHD.ConvertToYUV444().Trim(0,100) Function Resize(clip c) { Return c.Spline64Resize(2802,1501).\ Spline16Resize(904,487).\ BilinearResize(402,500).\ Spline36Resize(200,489).\ LanczosResize(400,300).ConvertBits(8) } a8 = x.ConvertBits(8).Resize() a10 = x.ConvertBits(10).Resize() a16 = x.ConvertBits(16).Resize() a32 = x.ConvertBits(32).Resize() Stackvertical(StackHorizontal(a8,a10),StackHorizontal(a16,a32)) Unfortunately it was perfect when I encoded float-only clips. The uglyness in the reproduction was that there had to be a 8-16 bit clip in the script (I usually do the tests for 8-10-16-32 bits then stack it together in 8 bits to see any difference) Obviously there had to be something that had left other patterns in memory than a 32bit float format clip. Narrowing the problem down, it turned out that the SIMD code that handles the pixels in 4/8 units was run into some garbage at the right side, where not all the 4/8 units are visible pixels, depending of the clip width (modulo 4 or 8). During the resizing process, the unused pixels are masked out with a zero multiplier - existing 8-16 bit code worked fine like that -, but it was not enough for 32 bit float pixels. When such a pixel is undefined, the processor would report it NaN (Not a Number), and multiplying it by 0 would still result in NaN. Thus such pixel in the resized clip turned into undefined (garbage) So I had to fix the float resizer code - uhh, it was old, one of my early attempts. Since the 10-16 bit resizer parts were affected as well, they had to be touched, too. Finally whe whole 10-16 and float resizer code got rewritten - unfortunately I couldn't see the time that the bug chasing needed, sure, with less effort I could port Zimg resizers into avs+. Btw zimg. Since I had to benchmark the new code if it is any better than the one in r2580, I have included the z_XXX resizers. https://forum.doom9.org/showthread.php?t=173986 Results are interesting. Look at the 400x2800 -> 900x400 case (16 bit) Resizing always happens in one horizontal and one vertical pass. Or first vertical resizing, then horizontal. it depends. In Avisynth - probably for quality reasons - there is a strategy: "// ensure that the intermediate area is maximal" Avs resizer gave a 103 fps, while zimg had 402 fps. What? Then it was made clear that Avs chose the 400x2800->900x2800->900x400 sequence, while zimg chose 400x2800->400x400->900x400. When I turned the resizing command into two resizing (first V then H), avisynth+ gave a quite comparable result of 427 fps. First three columns (AVX2, SSE4, noSSE4 contains results of the new resizers) Code was run on an i7-7700, Avs+ x64. I built specific avs+ versions for the test to ignore AVX2..SSE4.1 CPU flags. EDIT: this benchmark data contains the comparison of a current "under construction" version avs+ and the z-lib resizer I had access (r1a, from 2016). They both have faster variants since then. Code: #32bit float AVX2 SSE4 noSSE4 v2580 Zimg #400x2800 -> 900x400: 103 64.3 64.3 65.9 Lanczos #1920x1080-> 1280x720 143.1 83.7 83.2 95.7 158.4 Spline64 #16 bit AVX2 SSE4 noSSE4 v2580 Zimg #400x2800 -> 900x400: 103.9 84 75.8 56.2 402.6 Lanczos see comment #400x2800 -> 400x400-> # 900x400: 427.2 315.3 289.9 243.4 405.1 Lanczos #1920x1080-> 1280x720 152.2 125.0 118.9 80.6 129 Spline64 #1920x1080-> 1280x720 160.6 134.3 129.4 85.9 134.5 Lanczos #1920x1080-> 1280x1080 240.2 190.6 185.2 113.2 191.8 Lanczos H #1920x1080-> 1920x720 335.5 320.8 281 272.2 203.0 Lanczos V #10 bit AVX2 SSE4 noSSE4 v2580 Zimg #400x2800 -> 900x400: 105.4 77.5 53.5 Lanczos #1920x1080-> 1280x720 155.2 146.6 120.4 78.9 133 Spline64 #1920x1080-> 1280x720 163.6 156.2 Lanczos #8 AVX2 SSE4 noSSE4 Old Zimg #400x2800 -> 900x400: 93.8 93.4 96.2 93.8 402 see comment #1920x1080-> 1280x720 104.7 105.1 106.0 105.5 120.2 # ** Avisynth - unlike zimg - always orders H/V resizers for max intermediate area! (8 bit resizer code is untouched by me, there is no avx2 option there but I included their measurements) Now it's to be decided that the slower H/V or V/H decision strategy should be kept or not. Is the difference really visible and when? __________________ AviSynth+ on github, Other repos: RgTools, Masktools2, MvTools2, TIVTC, Average Last edited by pinterf; 28th January 2018 at 19:57. Reason: Comment on benchmarks, new resizers exist since then