Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-edit capabilities to Speech Editing #94

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

pgosar
Copy link
Contributor

@pgosar pgosar commented Apr 17, 2024

This pull request implements a heavily modified edit distance algorithm to handle doing multiple edits at the same time.
It also gets rid of the need for the user to specify the edit type(s), everything is handled automatically.

Known issues:

  1. Like the previous implementation, edits to the last index of the input sentence do not work. This looks like an issue of the model's inference, as in both my and the original implementation these changes are simply not recognized.

  2. Furthermore, multiple edit types cannot happen at the same time. For example, mix and matching substitutions with insertions crashes in inference. This is again something I need to look into still. Is this a limitation of the model itself?

I'd appreciate some help testing any other edge cases in the speech editing jupyter notebook if anyone is interested - I believe I have them all covered but more testing can't hurt :)

I will update the Google Colab for speech editing once this is merged.

@jasonppy jasonppy self-assigned this Apr 17, 2024
@jasonppy
Copy link
Owner

Thanks! Really helpful contribution!

  1. can't edit the last index of input utterance: Yes, in the edit mode, the model doesn't supports that. However, editing a span that contain the last index is basically zero-shot TTS, so TTS mode supports that natively. We can simply flag an error when a user try to edit the last index and encourage them to use the TTS mode
  2. multiple edits cannot happen at the same time. Do you mean when you call inference_one_sample it will crash if mask_interval contains multiple entries e.g. [[15,94], [142,309]]? That shouldn't happen as long as the spans are not overlapping.

@pgosar
Copy link
Contributor Author

pgosar commented Apr 17, 2024

I meant as in multiple types of edits. If I try to do a deletion and a substitution at the same time, for example:

original: "But when I had approached so near"
new: "But had I approached so near" (substitute when->had, delete the had in the original)

The inference fails with the following error (I'll edit once it finishes running again)

However if I want multiple different insertions or deletions or substitutions, everything will just work as long as I don't mix and match.
for example
new: "insertion But had I approached insertion so near" works fine, with two separate insertions

@jasonppy
Copy link
Owner

I see
for this example
original: "But when I had approached so near"
new: "But had I approached so near" (substitute when->had, delete the had in the original)

the reason it fails is probably because I used margin to extend the masked span, since there is only one word "I" in between the two edited spans, with margin, the two spans end up overlapping

@pgosar
Copy link
Contributor Author

pgosar commented Apr 17, 2024

I see, I am doing more testing right now and I think you're right, supplying multiple different types of edits seems to work as long as there is a sizeable gap between them.

So doing something like this on words right next to each other can only work if the margin size is small enough? Not sure if this is something I can fix - do you have any suggestions? I can probably just throw an error instead and suggest they lower the margin, along with when editing the very last word like you mentioned.

@jasonppy
Copy link
Owner

regarding the issue of spans being two close:
approach 1: set a threshold, say 2 words, and it the gap between two spans is less than or equal to 2 words, you will merge that into one span
approach 2: margin is a hyperparameter that can be specified by the user (it's default at 0.08 second), and if the two spans will be overlapped with the specified margin, we automatically change it to a smaller value to make sure they don't overlap

Both approaches are sensible to me.

@jasonppy
Copy link
Owner

If you want to do large scale testing https://github.com/jasonppy/VoiceCraft/blob/master/RealEdit.txt contains 310 speech editing examples, and there are 40 2-span edits examples.

to interpret the example:

ah, but we'll talk about it because i kind of believe in a unity of knowledge.|ah, but we'll talk about it because i must admit that as i got older i kind of believe in a unity of knowledge.	ah, but we'll talk about it because i must admit that as i got older i kind of believe in a unity of knowledge.|ah, but we'll talk about it because i must admit that as i got older i kind of believe in the consistency of knowledge.	7,8|12,13	8,15|20,21	insertion|substitution

| is used as separation symbol. the above example should be interpreted as:

a|b	b|c	orig_start1,orig_end1|orig_start2,orig_end2	new_start1,new_end1|new_start2,new_end2

where a is the original transcript, c is the target transcript. [orig_start1,orig_end1] is the word index of the first span to mask etc.

@pgosar
Copy link
Contributor Author

pgosar commented Apr 17, 2024

Are there any drawbacks to lowering the margin I should be aware of? The cases where my algorithm breaks don't break if I lower it to 0.02secs, so this should be an easy solution. I can constantly lower the margin until the spans align properly to make sure it works in all cases.

orig: But when I had
new: But I did

@jasonppy
Copy link
Owner

The only drawback is that the forced alignment might not be perfect, and a larger margin gives room for such a mistake, also a large margin ensure modification of the neighboring (but not changing) words to have a smooth transition next to the changing words.

Therefore default it at 0.02sec wouldn't be great

@pgosar
Copy link
Contributor Author

pgosar commented Apr 18, 2024

I used the margin fix. Regenerates the mask_interval as necessary with decreasing margins until no overlaps happen. The amount to decrease by is a hyperparameter, 0.01 by default.

@allisonth
Copy link

Hi I'm interested in testing multi-span editing algorithm.

@pgosar
Copy link
Contributor Author

pgosar commented May 5, 2024

@jasonppy should be ready to merge.

The example original and target transcripts uses a pretty complex set of changes just to show what is now possible

@allisonth
Copy link

The algorithm seems to work from my testing.
@jasonppy For more extensive testing could I get the wav files from the RealEdit dataset? I can only find the txt file mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants