multiple speaker compatability #191

pgegg02 · 2024-05-23T09:56:19Z

Hi so first of all great work,
the diarization works great for me on audio files with less than 3 speakers. Given an audio file with more than or close to 8 speakers, results in a very good transcription, but there are still only 3 speakers assigned to more than 8 people (so it just assigns them all as speaker 0 to 2). Changing the max_num_speakers variable in the config yaml doesnt seem to change that, so does creating a num_speakers variable in the manifest file. Is there something i'm missing/ how would adjusting the speaker count work?

MahmoudAshraf97 · 2024-05-23T13:34:15Z

the speaker count in the config is for evaluation purposes only, in order to tune the diarization performance, you need to play with speaker_embeddings parameters such as window_length_in_sec and shift_length_in_sec. diar_window_length and sigmoid_threshold might be worth giving a shot too

pgegg02 · 2024-05-23T15:23:45Z

thanks, i'll give that a try :)

famda · 2024-06-19T10:40:04Z

Hi,
I'm have the same issue. Can you provide some instructions on how to tune those parameters to have a higher number of speakers by default, please?

Thanks in advance.

MahmoudAshraf97 · 2024-06-19T12:40:44Z

Hi @famda , I haven't tinkered with these before, but it's totally trial and error

famda · 2024-06-19T12:43:32Z

No worries, I can make some tests with it. I just need to understand where to start. Can you guide me a little so I can play around with it? 😀

pgegg02 · 2024-06-19T13:15:29Z

From just playing around with the shift_length and lowering it to about 0.25 I was able to detect 7 speakers when running it on an audiofile it only detected 3 before (theres 9 actual people speaking in it). Going Lower than that, didn't make much of a difference, but increased inference time dramatically (running locally on CPU). So i would start there. Changing the sigmoid_threshold didn't do much, but you can try that as well.

MahmoudAshraf97 · 2024-06-19T13:41:25Z

You should also try playing with scale windows and weights

francescocassini · 2024-07-23T18:53:18Z

Hi, how do you configure the possibility to separate the speaker?
Which is the config flag to set? because in the instruction there is nothing about this..
Can you help me?

MahmoudAshraf97 · 2024-07-23T22:01:33Z

Hello, what do you mean exactly by separating the speaker?

francescocassini · 2024-07-24T05:10:13Z

Sorry for my bad english!
Actually I have this:
{
"speaker": "Speaker 0",
"start_time": 433460,
"end_time": 435480,
"text": "Adesso approfondiamo un pochino meglio. "
},
{
"speaker": "Speaker 0",
"start_time": 435520,
"end_time": 437420,
"text": "Allora, innanzitutto, da dove venite ragazzi? "
},

both sentences are categorized as “speaker0,” but the first sentence is spoken by a woman and the second by a male speaker.
Is it necessary to set some special configuration parameter to get speaker0 and speaker1?

01Ashish · 2024-07-27T07:27:55Z

Hi, I want know can i use task='translate' in Whisper_Transcription_+_NeMo_Diarization.ipynb file. actually i want to pass a non-english (hindi) audio in model and then after getting english transcription(by using task = translate). I want to perform speaker diarization.

but i think because my translated transcript and audio both have different language english and hindi respectively I am not able to achieve this.

can somebody help me. How can i perform speaker diarization and translation of transcription both.
for testing purpose i am using Whisper_Transcription_+_NeMo_Diarization.ipynb

MahmoudAshraf97 · 2024-07-27T07:32:24Z

@francescocassini usually the default settings work as expected, but you can check my second comment about what to change if they dont

@01Ashish I haven't tested it for translate task yet, but as a starter, you should enable word timestamps in whisper and remove the alignment model and see how it goes

hayata-yamamoto · 2024-08-17T08:34:12Z

I want to use max_speaker parameter as cli argument like whisperx. Do you have any plan or solution?

Our case is clear for how many peoples speak in audio, so I expect model perform well by assigning correct max_speaker number for each execution. However, I don't know well how to do that.

MahmoudAshraf97 · 2024-08-17T08:59:48Z

I want to use max_speaker parameter as cli argument like whisperx. Do you have any plan or solution?

Our case is clear for how many peoples speak in audio, so I expect model perform well by assigning correct max_speaker number for each execution. However, I don't know well how to do that.

You can modify this parameter in the telephonic YAML config found in the configs folder, can you try it on an audio that predicts the wrong number of speakers and see if it makes a difference?

hayata-yamamoto · 2024-08-17T09:26:17Z

@MahmoudAshraf97
Thank you for quick reply. but, unfortunately I didn't face wrong number of speaker problem. just considering how to use this repository well.

I use this repository for creating transcription with docker on cloud services. So..., changing yaml is bit difficult for my case.
Ideally, If I can assign max_speaker value from cli, I can change behavior dynamically via docker run command.

That kind of operation should be supported in this repository in future? If no plan, I will use larger speaker count

MahmoudAshraf97 · 2024-08-17T10:56:47Z

It's easy to add, not a big deal, but I want to make sure first that it does affect the inference, because it was reported earlier that this parameter has no effect on inference, only evaluation which is not used here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple speaker compatability #191

multiple speaker compatability #191

pgegg02 commented May 23, 2024 •

edited

Loading

MahmoudAshraf97 commented May 23, 2024

pgegg02 commented May 23, 2024

famda commented Jun 19, 2024

MahmoudAshraf97 commented Jun 19, 2024

famda commented Jun 19, 2024

pgegg02 commented Jun 19, 2024 •

edited

Loading

MahmoudAshraf97 commented Jun 19, 2024

francescocassini commented Jul 23, 2024

MahmoudAshraf97 commented Jul 23, 2024

francescocassini commented Jul 24, 2024

01Ashish commented Jul 27, 2024

MahmoudAshraf97 commented Jul 27, 2024

hayata-yamamoto commented Aug 17, 2024 •

edited

Loading

MahmoudAshraf97 commented Aug 17, 2024 •

edited

Loading

hayata-yamamoto commented Aug 17, 2024 •

edited

Loading

MahmoudAshraf97 commented Aug 17, 2024

multiple speaker compatability #191

multiple speaker compatability #191

Comments

pgegg02 commented May 23, 2024 • edited Loading

MahmoudAshraf97 commented May 23, 2024

pgegg02 commented May 23, 2024

famda commented Jun 19, 2024

MahmoudAshraf97 commented Jun 19, 2024

famda commented Jun 19, 2024

pgegg02 commented Jun 19, 2024 • edited Loading

MahmoudAshraf97 commented Jun 19, 2024

francescocassini commented Jul 23, 2024

MahmoudAshraf97 commented Jul 23, 2024

francescocassini commented Jul 24, 2024

01Ashish commented Jul 27, 2024

MahmoudAshraf97 commented Jul 27, 2024

hayata-yamamoto commented Aug 17, 2024 • edited Loading

MahmoudAshraf97 commented Aug 17, 2024 • edited Loading

hayata-yamamoto commented Aug 17, 2024 • edited Loading

MahmoudAshraf97 commented Aug 17, 2024

pgegg02 commented May 23, 2024 •

edited

Loading

pgegg02 commented Jun 19, 2024 •

edited

Loading

hayata-yamamoto commented Aug 17, 2024 •

edited

Loading

MahmoudAshraf97 commented Aug 17, 2024 •

edited

Loading

hayata-yamamoto commented Aug 17, 2024 •

edited

Loading