Frankly, what I hear is very similar to the results of classic spectral denoising, even with the characteristic FFT artifacts (for Linux, there's Noise Repellent [1] available for advanced spectral denoising; there's also a ton of commercial spectral processors available).
The demonstration could use more random background noises to separate it from FFT noise suppressors (as it's the primary benefit of ML-based filters), and more varied speech to separate it from RNNoise [2] which tends to suppress breath and cut the sibilants in an unnatural manner. The latency is also important - is it as low as in RNNoise? What about the CPU load?
It's so excellent how many moats are just getting obliterated.
I absolutely have been a real snarky hater against AI, as a horrible fuedal unobservable black box that has way too much power in the world. But open source has been doing amazing at reading the papers & reproducing & it's glorious to see.
Amazing examples of a peership culture in action. Rising each other up is so divine. Share the knowledge & means.
I recently replaced an image classification pipeline that leaned heavily on classical computer vision techniques (like you'd find in OpenCV) with a neural network based approach using open models. There were about 4 years of developer effort invested in that old pipeline and I got better results with 2 months of effort invested in the new NN based system.
Later this year I plan on revamping an old NLP system with even more man-years of effort invested in it. I think I can beat it with neural networks too. The main reason I haven't started already is that open language model progress is so fast that I expect significantly better building blocks in 4 months. Using these new tools feels magical, particularly when you have experienced how much effort it took to get half-as-good results with older techniques.
The ML/DL software ecosystem is dominated by Python. Python's dependency management can be especially tricky, so try to limit the dependencies you're pulling in for the final deployed artifact if your final artifact is also Python based.
CUDA can be difficult to set up correctly on your personal development machine and if you rely on it you're also limiting the development machines that other people can use. You're limiting the deployment options and the CI options. This may differ if you work at a larger company that has a team specializing in these things, but I had to work out everything from initial proof-of-concept to final deployment. Some applications absolutely need the higher performance from GPU execution but it's worth seeing if you can get away with CPU-only execution because it avoids operational complications.
I used the ONNX runtime (as this DeepFilterNet project does, indirectly, according to a top level comment by WiSaGaN) and I was able to get adequate inference speed running on plain CPU. It's a small Python service that just wraps the inference logic with a command interface and a connection to Redis. It takes protocol buffer inputs from a Redis based queue, does a little bit of control logic followed by inference, and writes the results as protocol buffers to another Redis based queue.
The only ML libraries I have on the Python side are the ones for ONNX. This saved over a gigabyte (!) of transitive dependencies compared to the first proof-of-concept I had that relied on PyTorch for runtime inference.
Final advice: you actually don't need to know much theory to start doing something useful. I hadn't studied neural networks since graduate school 20 years ago so my theory is hopelessly outdated. I just started hacking together a little demo for myself and it was good enough that I was encouraged to take it all the way to production.
We recently migrated to Poetry [1] for dependency management and so far it's been a breath of fresh air - it feels like what Python deps should've always been!
You can even have dependency groups, to separate main/dev dependencies for instance. It also brings env management, and plays very nicely with Docker if you use containers.
Poetry doesn't really work well if you try to ssh in. It tries to setup a keyring for some reason whenever you use it, but even if just to Runa project. I feel like it shouldn't be that hard to get going, but after trying to hack it for four hour I used a workaround of telling the keyrings to use a null provider & then was able to download deps & run the project I'd downloaded.
Admittedly only a single nit but it was still one of thr saddest most frustrating python experiences I've ever had.
Not a week goes by without a new issue popping up. A dependency manager is a critical piece of infrastructure, it should not be the main developer experience bottleneck when building an application. A good package manager is out of sight, out of mind. You don't hear people complaining so much about Cargo every day.
I don't really feel like defending poetry, I don't even use it. But in this particular case, I think the wheel format and the way people try to host it on their servers and the entitled researchers who act like 80 year olds and "just want things done" instead of taking a little bit of time to learn to work with their tools are to blame. None of this exists with the cargo infra.
Isolate the prior logic into functional components. Find the inputs and outputs to each box. Identify which component ML can replace. Replace one after the next, sometimes merging if the ML can do it all on the gpu
As someone who worked professionally using top of the line audio denoising for speech in cinema productions I have to say that I am underwhelmed by the results. This is very similar to what traditional algorithms would have achieved a decade ago.
Of course there might be potential for improvement there, maybe it is more performant or was developed way faster etc. But just listening to the audible result is not too cinvincing — yet.
Note that this isn't just a paper reproduction but a new paper in and of itself (the github is by one of the paper's authors). This is unbelievably amazing. I wonder how it compares to rnnoise, which is also open source and also targets real time settings.
It looks like the library in Rust is using `tract-onnx` to do the inference: https://github.com/Rikorose/DeepFilterNet/blob/2a84d2a1750a5... I am wondering whether using Python for research, training in big data center, and Rust at edge for efficient inference would be a trend in the future. We do have a larger community of C++ right now for inference (e.g. ggml). But Rust crate as component to build applications of AI is joy to use.
You can use the onnx cpu runtime in python or c++ too. It doesn’t have to be rust. And if you want GPU support you can even run models saved in the onnx format on Nvidia GPUs with the TensorRT runtime.
Honestly while ggml is super cool. It started as a hobby project and you probably shouldn’t use it in production. ONNX has been the defacto standard for ML inference for years. What it is missing (compared to ggml) is 2-6bit inference which is helpful for large scale transformers on edge devices (and is what helped ggml gain adoption so fast).
Yeah I've only used it with networks in ONNX format (converted from tensorflow or torch). I was looking for high perf low latency / real-time, the C or C++ APIs for OpenVINO are quite OK if you spend some time playing with it. I hope Intel keeps investing on it...
Edit: often if you go through the ONNX intermediate format, be prepared to perform some 'network surgery' to clean up some conversion cruft, but also to remove training-only stuff left in the network...
Since it does the signal processing in the Fourier domain, does this suffer from audio artefacts e.g. hissing in the output? Torch's inverse STFT uses Griffin-Lim which is probabilistic and if you don't train it sufficiently, you may sometimes get noise in the output.
Not all spectral methods have such artifact. The type of artifacts you mention happens when you need to do phase retrieval or try to reconstruct waveforms from melspectrogram. Deepfilternet does spectral masking on the complex spectrogram so there is no need for phase retrieval.
I sometimes wonder if all those filters optimise for a wrong thing. Removing noise is meaningless, unless the overall discernability improves. If you remove noise with the price of the voice becoming choppy, "robotic" and unnatural, you didn't improve the situation, and in some cases you can say only made it worse.
What even further deteriorates legibility for most noise suppression filters is the discrepancy between the completely dry pauses and the remaining ambiance "under" the voice. It would be much more interesting to see some style transfer for voice ambience as an alternative to current de-verbs.
When dealing with voice processing I advocate for restraining from noise suppression filters for as long as possible, and I haven't seen a publicly available noise suppression filter which could change my position yet.
One of the challenges we face in this research problem is the lack of reliable metric to evaluate the quality of the NN model. In recent times, i came to know of [3Quest metric](https://cdn.head-acoustics.com/fileadmin/data/global/Datashe...) being helpful in this regard. Anybody have any experience with this metric ? May be in comparison with Microsoft's DNSMOS ?
If you (especially on behalf of any hip, popular platforms like Discord) undertake any projects to aggressively denoise or compress audio, please (PLEASE) do us people with auditory processing difficulties a favor, and include such people in your testing.
What’s wrong with the noise suppression offered by Discord? I use it for work meetings as well (via the Krisp app) but I’d hate to cause anyone distress.
Most of these codec/signal processing projects tend to strike a nerve for me due to the ignorance of how unintelligible the processed output is (as someone who has a very difficult time with this and very regularly has to ask people to repeat themselves 4-5x even in meatspace interaction), so from the particular angle my outburst came from, it's admittedly being unfair to Discord.
However, I find the aggressiveness of these things to still be a problem.
In Discord's case, I can't wrap my head around how people find those tearing/crunching sounds that result from trying to smooth out keyboard strokes desirable; particularly because it largely washes out the speaker's output feed and sounds really bizarre and out of place. I would rather just hear their keystrokes.
If you want to smooth out the really acute/jarring occurrences a bit: fine. But as it is, people seem to want to suppress/compress to the point of throwing the baby out with the bathwater. I chalk it up to their desire to squeeze every bit of bit-compression for their marketing/employment/performance metrics. my 2c. /shrug
Much much better, I really recommend DeepFilterNet, it's the most well-rounded open-source AI noise suppression tool out there. Big caveat, it won't help ASR model, e.g., Whisper.
Hey thatsadude, thanks for your input! I'm also working in this research area and it would be great to connect with you on LinkedIn. Here's my proxy LinkedIn Profile - https://bit.ly/3ChXFcm. If you're interested, could you send me a connection request? Looking forward to connecting and discussing more in the future!
Do you know a framework to quantitatively compare noise suppression quality produced by different algos? Or maybe there is a industry-standard test suite?
The golden standard is ITU-T P.808 subject test https://github.com/microsoft/P.808
Of course, running a subjective test is expensive, and hence, there are objective scores such as DNSMOS and UTMOS which use neural networks to predict P808 values.
1. Speech distortion is extremely detrimental to ASR models while human listeners may not be able to notice. Noise reduction models such as RNNNoise and DeepFilterNet try to reduce "perceptible" noises. And doing that will create imperceptible distortion which ASR models does not like at all.
2. Many noise reduction models apply on raw spectrograms or ERB band such as RNNNoise and DeepFilterNet (equivalent rectangular bands). On the other hands, ASR models mostly run on melspectrograms. This mismatch tends to create problem. I have seen many papers from Google claimed that reducing noise in melspectrogram often helps their keyword spotting.
The demonstration could use more random background noises to separate it from FFT noise suppressors (as it's the primary benefit of ML-based filters), and more varied speech to separate it from RNNoise [2] which tends to suppress breath and cut the sibilants in an unnatural manner. The latency is also important - is it as low as in RNNoise? What about the CPU load?
[1] https://github.com/lucianodato/noise-repellent
[2] https://github.com/werman/noise-suppression-for-voice