Add multi-edit capabilities to Speech Editing #94

pgosar · 2024-04-17T17:50:34Z

This pull request implements a heavily modified edit distance algorithm to handle doing multiple edits at the same time.
It also gets rid of the need for the user to specify the edit type(s), everything is handled automatically.

Known issues:

Like the previous implementation, edits to the last index of the input sentence do not work. This looks like an issue of the model's inference, as in both my and the original implementation these changes are simply not recognized.
Furthermore, multiple edit types cannot happen at the same time. For example, mix and matching substitutions with insertions crashes in inference. This is again something I need to look into still. Is this a limitation of the model itself?

I'd appreciate some help testing any other edge cases in the speech editing jupyter notebook if anyone is interested - I believe I have them all covered but more testing can't hurt :)

I will update the Google Colab for speech editing once this is merged.

jasonppy · 2024-04-17T19:14:02Z

Thanks! Really helpful contribution!

can't edit the last index of input utterance: Yes, in the edit mode, the model doesn't supports that. However, editing a span that contain the last index is basically zero-shot TTS, so TTS mode supports that natively. We can simply flag an error when a user try to edit the last index and encourage them to use the TTS mode
multiple edits cannot happen at the same time. Do you mean when you call inference_one_sample it will crash if mask_interval contains multiple entries e.g. [[15,94], [142,309]]? That shouldn't happen as long as the spans are not overlapping.

pgosar · 2024-04-17T19:30:42Z

I meant as in multiple types of edits. If I try to do a deletion and a substitution at the same time, for example:

original: "But when I had approached so near"
new: "But had I approached so near" (substitute when->had, delete the had in the original)

The inference fails with the following error (I'll edit once it finishes running again)

However if I want multiple different insertions or deletions or substitutions, everything will just work as long as I don't mix and match.
for example
new: "insertion But had I approached insertion so near" works fine, with two separate insertions

jasonppy · 2024-04-17T19:45:19Z

I see
for this example
original: "But when I had approached so near"
new: "But had I approached so near" (substitute when->had, delete the had in the original)

the reason it fails is probably because I used margin to extend the masked span, since there is only one word "I" in between the two edited spans, with margin, the two spans end up overlapping

pgosar · 2024-04-17T19:58:39Z

I see, I am doing more testing right now and I think you're right, supplying multiple different types of edits seems to work as long as there is a sizeable gap between them.

So doing something like this on words right next to each other can only work if the margin size is small enough? Not sure if this is something I can fix - do you have any suggestions? I can probably just throw an error instead and suggest they lower the margin, along with when editing the very last word like you mentioned.

jasonppy · 2024-04-17T20:39:20Z

regarding the issue of spans being two close:
approach 1: set a threshold, say 2 words, and it the gap between two spans is less than or equal to 2 words, you will merge that into one span
approach 2: margin is a hyperparameter that can be specified by the user (it's default at 0.08 second), and if the two spans will be overlapped with the specified margin, we automatically change it to a smaller value to make sure they don't overlap

Both approaches are sensible to me.

jasonppy · 2024-04-17T20:45:05Z

If you want to do large scale testing https://github.com/jasonppy/VoiceCraft/blob/master/RealEdit.txt contains 310 speech editing examples, and there are 40 2-span edits examples.

to interpret the example:

ah, but we'll talk about it because i kind of believe in a unity of knowledge.|ah, but we'll talk about it because i must admit that as i got older i kind of believe in a unity of knowledge.	ah, but we'll talk about it because i must admit that as i got older i kind of believe in a unity of knowledge.|ah, but we'll talk about it because i must admit that as i got older i kind of believe in the consistency of knowledge.	7,8|12,13	8,15|20,21	insertion|substitution

| is used as separation symbol. the above example should be interpreted as:

a|b	b|c	orig_start1,orig_end1|orig_start2,orig_end2	new_start1,new_end1|new_start2,new_end2

where a is the original transcript, c is the target transcript. [orig_start1,orig_end1] is the word index of the first span to mask etc.

pgosar · 2024-04-17T21:11:55Z

Are there any drawbacks to lowering the margin I should be aware of? The cases where my algorithm breaks don't break if I lower it to 0.02secs, so this should be an easy solution. I can constantly lower the margin until the spans align properly to make sure it works in all cases.

orig: But when I had
new: But I did

jasonppy · 2024-04-17T21:16:32Z

The only drawback is that the forced alignment might not be perfect, and a larger margin gives room for such a mistake, also a large margin ensure modification of the neighboring (but not changing) words to have a smooth transition next to the changing words.

Therefore default it at 0.02sec wouldn't be great

pgosar · 2024-04-18T00:03:09Z

I used the margin fix. Regenerates the mask_interval as necessary with decreasing margins until no overlaps happen. The amount to decrease by is a hyperparameter, 0.01 by default.

allisonth · 2024-04-19T16:25:25Z

Hi I'm interested in testing multi-span editing algorithm.

pgosar · 2024-05-05T22:21:56Z

@jasonppy should be ready to merge.

The example original and target transcripts uses a pretty complex set of changes just to show what is now possible

allisonth · 2024-05-07T21:37:03Z

The algorithm seems to work from my testing.
@jasonppy For more extensive testing could I get the wav files from the RealEdit dataset? I can only find the txt file mentioned above.

pgosar added 3 commits April 17, 2024 12:33

add multi edit capability

b5f9744

remove cell outputs

2168efc

move import up

73fac7c

jasonppy self-assigned this Apr 17, 2024

fix overlapping margins

8814295

jasonppy assigned allisonth Apr 20, 2024

pgosar mentioned this pull request Apr 23, 2024

Add standalone python scripts for local usage #95

Merged

5 tasks

pgosar added 3 commits May 5, 2024 15:44

merging

da8c441

fix syntax error

00d1b11

add runtime error and fix merge regressions

dc2239c

remove ending insertion

0bf07d2

fix regex for contractions

503f3ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-edit capabilities to Speech Editing #94

Add multi-edit capabilities to Speech Editing #94

pgosar commented Apr 17, 2024 •

edited

Loading

jasonppy commented Apr 17, 2024

pgosar commented Apr 17, 2024 •

edited

Loading

jasonppy commented Apr 17, 2024

pgosar commented Apr 17, 2024 •

edited

Loading

jasonppy commented Apr 17, 2024

jasonppy commented Apr 17, 2024

pgosar commented Apr 17, 2024

jasonppy commented Apr 17, 2024

pgosar commented Apr 18, 2024

allisonth commented Apr 19, 2024

pgosar commented May 5, 2024 •

edited

Loading

allisonth commented May 7, 2024

Add multi-edit capabilities to Speech Editing #94

Are you sure you want to change the base?

Add multi-edit capabilities to Speech Editing #94

Conversation

pgosar commented Apr 17, 2024 • edited Loading

jasonppy commented Apr 17, 2024

pgosar commented Apr 17, 2024 • edited Loading

jasonppy commented Apr 17, 2024

pgosar commented Apr 17, 2024 • edited Loading

jasonppy commented Apr 17, 2024

jasonppy commented Apr 17, 2024

pgosar commented Apr 17, 2024

jasonppy commented Apr 17, 2024

pgosar commented Apr 18, 2024

allisonth commented Apr 19, 2024

pgosar commented May 5, 2024 • edited Loading

allisonth commented May 7, 2024

pgosar commented Apr 17, 2024 •

edited

Loading

pgosar commented Apr 17, 2024 •

edited

Loading

pgosar commented Apr 17, 2024 •

edited

Loading

pgosar commented May 5, 2024 •

edited

Loading