You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
o I think I think I understand based on Jason's answer on a different question..... for zeroshot TTS , it looks like DIFFERENT model is trained without causal mask. like you can see for edits and tts there are two different weights !
As title mentioned, I wonder if we not mask the audio, namely y, then how can the model know there is a tts going to be conducted?
The text was updated successfully, but these errors were encountered: