Seems similar to that Moshi model from 6 months ago, but this is more refined than that, Moshi is a little crazy, but still it was an impressive demo of how low latency responses, continuous listening and interruptions can improve the voice chat and make it more real or uncanny, (sometimes its "latency" is even too low because is interrupts you before you finish)
https://www.youtube.com/watch?v=-XoEQ6oqlbE
Saying this is similar to Moshi is like saying GPT2 is similar to GPT4. You can't have any sort of conversation longer than 30s with moshi before it goes banana. You can talk to this model for an hour and it remains completely coherent.
They even released some models on huggingface:
https://huggingface.co/collections/kyutai/moshi-v01-release-...