MVTools-pfmod - Page 17

feisty2 · 4th July 2017, 01:21

Your original post was "asm shit is fast", and I been saying, intrinsics are equally fast, without having to use an assembler
Obviously you realized that, then you changed your point to, "you can't automatically convert raw asm to intrinsics"
Get a room with Katie already, troll

Groucho2004 · 4th July 2017, 01:45

Quote:

Originally Posted by feisty2

Your original post was "asm shit is fast", and I been saying, intrinsics are equally fast, without having to use an assembler
Obviously you realized that, then you changed your point to, "you can't automatically convert raw asm to intrinsics"
Get a room with Katie already, troll

I didn't change my point, you missed it. Also, you didn't write "intrinsics are equally fast, without having to use an assembler", you wrote "use intrinsics instead" which is very vague and implies that this could be done instantly.

What's wrong with using an assembler? Do you realize that for example the speed of libx264 is based on its highly efficient asm code?

My point is that there is perfectly good and fast asm code in mvtools2 (32 and 64 bit). Having this converted to intrinsics would be good but it's a lot of work.

tebasuna51 · 4th July 2017, 13:04

Quote:

Originally Posted by feisty2

...
Get a room with Katie already, troll

Please guys stop that way.

The question is clear, stop the discussion.

Groucho2004 · 4th July 2017, 17:08

Quote:

Originally Posted by tebasuna51

Please guys stop that way.

The question is clear, stop the discussion.

I'm not sure to which question you're referring, Mystery's or feisty's. Either way, nothing wrong with having a discussion. I still don't quite understand feisty's troll accusation but I suppose there was some kind of misinterpretation of something I posted...

TheFluff · 4th July 2017, 17:42

The main benefit of intrinsics over handwritten assembly is that it's easier to write and maintain, as well as easier to integrate into your C++ stuff (such as templates - a lot of the VS multi-bitdepth stuff uses templated intrinsics). A minor bonus is that you don't need a separate assembler in addition to your regular compiler. However, if you already have a bunch of well tested and functioning .asm (in separate files, not some inline monstrosity pain in the rear) and that you have no intention of changing, then porting to intrinsics is just a lot of busy-work that's probably going to introduce a lot of new and exciting bugs. Not even the VS port of MVTools has gotten rid of all the .asm files, because there was simply no need. New code has been ported to intrinsics though.

Groucho2004 · 4th July 2017, 19:12

Quote:

Originally Posted by TheFluff

However, if you already have a bunch of well tested and functioning .asm (in separate files, not some inline monstrosity pain in the rear) and that you have no intention of changing, then porting to intrinsics is just a lot of busy-work that's probably going to introduce a lot of new and exciting bugs. Not even the VS port of MVTools has gotten rid of all the .asm files, because there was simply no need.

That was exactly my point. I think pinterf replaced most (if not all) inline asm with intrinsics so the remaining problem seems to be that some people have trouble producing a few .obj files using yasm/nasm. It's just bizarre.

yup · 10th July 2017, 08:40

pinterf

for update.
Now some my scripts work stable.
But I see strange behaviour when open scripts in VirtualDubMod, I do not see error message if I writen script with error related to MVTools functions, during this VirtatualDub hung and not response.
I can not close video in Vitualdub, only close Vitualdub.
Job control also do not work.
Please advice.

yup.

pinterf · 11th July 2017, 09:45

Quote:

Originally Posted by feisty2

any particular reason you cant just get rid of all that asm shit and use intrinsics instead?

Because the porting is done in my free time which is limited.

Asm vs intrinsics.

SAD and SATD code (which are the most important routines regarding mvtools2 speed) written in intrinsics is _much_ slower than using existing asm, I'm talking about VS2015/2017 code generator.

I have experienced the opposite case as well when the generated code from intrinsics is faster than the original asm (experienced in FFT3DFilter and TIVTC) perhaps because of smarter instruction ordering. Even a C version can be faster than the old asm (TIVTC).

I usually have a look at the generated assembler code of the intrinsics.

There are cases when the optimizer uses too many xmm registers, so the prolog/epilog register save/restore (which we cannot control) takes significant time relative to the actual task, as experienced in 16 bit SAD intrinsics routines. I had to play with less-than-optimal loop unrolling until I found out the fastest result for a particular SAD blocksize.

pinterf · 11th July 2017, 10:11

Quote:

Originally Posted by MysteryX

DCT=0 is "fine" with 68% CPU usage. DCT=1 is what gives problems with multi-threading with 37% CPU usage, choppy playback, and occasional freezes -- but DCT=1 is definitely better than before the recent Pinterf fix!

DCT=1 is using integer arithmetic for 8 bit video and 8x8 block sizes.
In all other cases (such as for block size 16x16) the routines from the FFTW3 library are used.
I don't know which fftw3 version are you using (i can see 3.3.6 as the latest one in http://www.fftw.org/ ), perhaps you could try comparing different versions.

MysteryX · 12th July 2017, 05:21

Quote:

Originally Posted by pinterf

I don't know which fftw3 version are you using (i can see 3.3.6 as the latest one in http://www.fftw.org/ ), perhaps you could try comparing different versions.

I don't know which version but it is from March 2014

I'll try the latest and see how it behaves.

Still the same problem. Stuck at 37% CPU usage.

It is the libfftw3f-3.dll file in C:\Windows\SysWOW64, correct?

If I use BlkSize=8, I get 47% CPU usage.

shae · 13th July 2017, 14:04

What's supposed to happen if FFTW is missing?

QTGMC seems to work without it. Is it because MvTools2 doesn't always need it or something else?

And can it load libfftw3f-3.dll from the same directory as mvtools2.dll instead of the system dir?
AvsMeter says the FFTW DLL cannot be loaded, but maybe it only looks for it in the system directory.

real.finder · 13th July 2017, 14:45

Quote:

Originally Posted by shae

What's supposed to happen if FFTW is missing?

QTGMC seems to work without it. Is it because MvTools2 doesn't always need it or something else?

And can it load libfftw3f-3.dll from the same directory as mvtools2.dll instead of the system dir?
AvsMeter says the FFTW DLL cannot be loaded, but maybe it only looks for it in the system directory.

yes, MvTools2 doesn't always need it

you can load FFTW DLL by using this, x64 here

shae · 13th July 2017, 23:09

I think I'll just go by "it's probably fine if it the script loads, doesn't crash, and the beginning of the video look okay".

GMJCZP · 20th July 2017, 03:28

When I use this hello_hello script (# 301) in 16 bits:

Quote:

tr = 1 # Temporal radius
mt = true # Internal multithreading
lsb = false # 16-bit
thSAD = 200 # denoising strength
blksize = 16 # block size
overlap = 4 # block overlap
super = MSuper (mt=mt)
multi_vec = MAnalyse (super, mt=mt, multi=true, blksize=blksize, overlap=overlap, delta=tr)
MDegrainN (super, multi_vec, tr, mt=mt, lsb=lsb, thSAD=thSAD, thSAD2=150)

By putting lsb = true, delta =1 and tr> 1 I get artifacts.

hello_hello · 20th July 2017, 04:50

GMJCZP,
Delta and TR have to be the same. From the help file:

MDeGrainN has a temporal radius given by the tr parameter, and uses a special motion vector clip.
tr
Temporal radius, > 0. Must match the mvmulti content, i.e. the delta parameter in MAnalyse.

GMJCZP · 20th July 2017, 13:27

Problem solved.

GMJCZP · 20th July 2017, 16:04

Now I have another problem, I do not present the image correctly if I do not use f3kdb and DitherPost together, I'm still a rookie at this 16-bit:

Code:

Dither_convert_8_to_16()
Temporalsoften(2,1,2,mode=2,scenechange=10)
dither_resize16(720,480,kernel="spline16",invks=true,invkstaps=3,src_left=0.0,u=3,v=3)
MDegrainLight(2,lsb=true,thSAD=200)
f3kdb(range=15, grainY=0, grainC=0, keep_tv_range=True, input_depth=16, output_depth=8)
DitherPost()

# MDegrainLight
# https://forum.doom9.org/showthread.php?p=1810543#post1810543
# Original idea by hello_hello

function MDegrainLight(clip input, int "tr", bool "mt", bool "lsb", int "thSAD", int "thSAD2", int "blksize", int "overlap")
{
tr = Default(tr, 1) # Temporal radius
mt = Default(mt, true) # Internal multithreading
lsb = Default(lsb, false) # 16-bit
thSAD = Default(thSAD, 200) # Denoising strength
thSAD2 = Default(thSAD2, 150)
blksize = Default(blksize, 16) # Block size
overlap = Default(overlap, 4) # Block overlap

super = input.MSuper (mt=mt)
multi_vec = MAnalyse (super, mt=mt, multi=true, blksize=blksize, overlap=overlap, delta=tr)
input.MDegrainN (super, multi_vec, tr, mt=mt, lsb=lsb, thSAD=thSAD, thSAD2=thSAD2)
return last
}

In truck, if I use dfttest(sigma=2, tbsize=1, lsb_in=true, lsb=true, Y=true, U=true, V=true, opt=3, dither=0), instead of MDegrain, DitherPost is no longer necessary.
Can anyone please explain to me if I am redundant with DitherPost, or if my script is correct?

Or is there a way to use only, or f3kdb or DitherPost?

blaze077 · 20th July 2017, 19:04

1. Afaik, TemporalSoften does not support 16 bit stacked input, so you should apply it before the dither_convert_8_to_16 call.

2. I don't think MDegrainN takes in 16 bit stacked input. It can only output it using lsb=true. (No lsb_in parameter)

3. Ditherpost simply turns a 16 bit clip into an 8 bit clip. In your f3kdb call, you already output 8 bit video so you don't need ditherpost.

Alternatively, you could change f3kdb's output_depth to 16, and then ditherpost would work as expected.

GMJCZP · 20th July 2017, 21:07

1. Temporal soften does not have anything to do with the problem.
2. The script works perfectly as it is, the problem is if I use f3kdb and DitherPost together, I said it before.
3. I repeat it again, if I do not use DitherPost the video is poorly displayed.

Anyway thanks for the reply.

I repeat my doubt, Is my script okay and I'm not messing with DitherPost?, Because I can not get an f3kdb command that correctly displays the image.

EDIT: I solved the problem, the MVTools documentation says:

Quote:

lsb

Generates 16-bit data when set to true. The picture made of the most siginificant bytes (MSB) is stacked on the top of the least significant byte (LSB) block. Hence a twice taller resulting picture. You can extract the MSB or the LSB with a simple Crop() call. This mode helps recovering the full bitdepth of temporally dithered data.

Then the definitive script looks like this:

Quote:

Dither_convert_8_to_16()
Temporalsoften(2,1,2,mode=2,scenechange=10)
dither_resize16(720,480,kernel="spline16",invks=true,invkstaps=3,src_left=0.0,u=3,v=3)
MDegrainLight(2,lsb=true,thSAD=200).Crop(0,0,0,960)
f3kdb(range=15, grainY=0, grainC=0, keep_tv_range=True, input_depth=16, output_depth=8)

# MDegrainLight
# https://forum.doom9.org/showthread.p...43#post1810543
# Original idea by hello_hello

function MDegrainLight(clip input, int "tr", bool "mt", bool "lsb", int "thSAD", int "thSAD2", int "blksize", int "overlap")
{
tr = Default(tr, 1) # Temporal radius
mt = Default(mt, true) # Internal multithreading
lsb = Default(lsb, false) # 16-bit
thSAD = Default(thSAD, 200) # Denoising strength
thSAD2 = Default(thSAD2, 150)
blksize = Default(blksize, 16) # Block size
overlap = Default(overlap, 4) # Block overlap

super = input.MSuper (mt=mt)
multi_vec = MAnalyse (super, mt=mt, multi=true, blksize=blksize, overlap=overlap, delta=tr)
input.MDegrainN (super, multi_vec, tr, mt=mt, lsb=lsb, thSAD=thSAD, thSAD2=thSAD2)
return last
}

In short, DitherPost was not necessary.

blaze077 · 20th July 2017, 21:26

I just tried to run your script and the cause is indeed your MDegrainLight function.
As you know, 16 bit stacked is double the height of the normal video (MSB and LSB).
You pass a 16 bit stacked clip to MDegrainN, but MDegrain does not have any way of knowing that you passed a 16 bit stacked clip to it. It just assumes that you gave it an 8 bit clip and processes it accordingly.
Since you pass lsb=true to MDegrainN, it tries to convert the already 16 bit stacked clip to 16 bit stacked again.
The result is that your video is now 4 times it's normal height!
With the f3kdb call, the video is back to double height and with the ditherpost call, it is back to normal height.

A solution can be to call ditherpost() before all the MVTools calls (MSuper, analyze and degrainN) inside your MDegrainLight function.

10th July 2017, 08:40	#327 \| Link
yup Registered User Join Date: Feb 2003 Location: Russia, Moscow Posts: 854	strange behaviour last MVTools pinterf for update. Now some my scripts work stable. But I see strange behaviour when open scripts in VirtualDubMod, I do not see error message if I writen script with error related to MVTools functions, during this VirtatualDub hung and not response. I can not close video in Vitualdub, only close Vitualdub. Job control also do not work. Please advice. yup.

20th July 2017, 13:27	#336 \| Link
GMJCZP Registered User Join Date: Apr 2010 Location: I have a statue in Hakodate, Japan Posts: 744	Problem solved. __________________ By law and justice! GMJCZP's Arsenal

20th July 2017, 19:04	#338 \| Link
blaze077 Registered User Join Date: Jan 2016 Posts: 79	1. Afaik, TemporalSoften does not support 16 bit stacked input, so you should apply it before the dither_convert_8_to_16 call. 2. I don't think MDegrainN takes in 16 bit stacked input. It can only output it using lsb=true. (No lsb_in parameter) 3. Ditherpost simply turns a 16 bit clip into an 8 bit clip. In your f3kdb call, you already output 8 bit video so you don't need ditherpost. Alternatively, you could change f3kdb's output_depth to 16, and then ditherpost would work as expected. Last edited by blaze077; 20th July 2017 at 19:12.

4th July 2017, 01:21	#321 \| Link
feisty2 I'm Siri Join Date: Oct 2012 Location: void Posts: 2,633	Your original post was "asm shit is fast", and I been saying, intrinsics are equally fast, without having to use an assembler Obviously you realized that, then you changed your point to, "you can't automatically convert raw asm to intrinsics" Get a room with Katie already, troll

4th July 2017, 17:42	#325 \| Link
TheFluff Excessively jovial fellow Join Date: Jun 2004 Location: rude Posts: 1,100	The main benefit of intrinsics over handwritten assembly is that it's easier to write and maintain, as well as easier to integrate into your C++ stuff (such as templates - a lot of the VS multi-bitdepth stuff uses templated intrinsics). A minor bonus is that you don't need a separate assembler in addition to your regular compiler. However, if you already have a bunch of well tested and functioning .asm (in separate files, not some inline monstrosity pain in the rear) and that you have no intention of changing, then porting to intrinsics is just a lot of busy-work that's probably going to introduce a lot of new and exciting bugs. Not even the VS port of MVTools has gotten rid of all the .asm files, because there was simply no need. New code has been ported to intrinsics though.

13th July 2017, 14:04	#331 \| Link
shae Registered User Join Date: Jun 2006 Posts: 397	What's supposed to happen if FFTW is missing? QTGMC seems to work without it. Is it because MvTools2 doesn't always need it or something else? And can it load libfftw3f-3.dll from the same directory as mvtools2.dll instead of the system dir? AvsMeter says the FFTW DLL cannot be loaded, but maybe it only looks for it in the system directory.

13th July 2017, 23:09	#333 \| Link
shae Registered User Join Date: Jun 2006 Posts: 397	I think I'll just go by "it's probably fine if it the script loads, doesn't crash, and the beginning of the video look okay".

20th July 2017, 04:50	#335 \| Link
hello_hello Registered User Join Date: Mar 2011 Posts: 4,829	GMJCZP, Delta and TR have to be the same. From the help file: MDeGrainN has a temporal radius given by the tr parameter, and uses a special motion vector clip. tr Temporal radius, > 0. Must match the mvmulti content, i.e. the delta parameter in MAnalyse.

20th July 2017, 21:26	#340 \| Link
blaze077 Registered User Join Date: Jan 2016 Posts: 79	I just tried to run your script and the cause is indeed your MDegrainLight function. As you know, 16 bit stacked is double the height of the normal video (MSB and LSB). You pass a 16 bit stacked clip to MDegrainN, but MDegrain does not have any way of knowing that you passed a 16 bit stacked clip to it. It just assumes that you gave it an 8 bit clip and processes it accordingly. Since you pass lsb=true to MDegrainN, it tries to convert the already 16 bit stacked clip to 16 bit stacked again. The result is that your video is now 4 times it's normal height! With the f3kdb call, the video is back to double height and with the ditherpost call, it is back to normal height. A solution can be to call ditherpost() before all the MVTools calls (MSuper, analyze and degrainN) inside your MDegrainLight function.