3.3_diarization.qmd

# 3.3 Diarization {.unnumbered}

This is a minimal example script showing how to detect speakers
in the audio track of a video file. To start, we will load in a few
modules that will be needed for the task.

```{python}
from pyannote.audio import Pipeline
from pydub import AudioSegment

import polars as pl
import tempfile
```

Next, we need to load the model.

```{python}
pipeline = Pipeline.from_pretrained("statsmaths/diarize")
```

The algorthm only takes audio inputs, so we need to convert our
video into a temporary wave file. If you have a wave file, this can
be loaded and passed directly into the model. We will pass the
audio file directly to the diarization model here as well.

```{python}
with tempfile.NamedTemporaryFile(suffix='.wav') as temp_file:
    audio = AudioSegment.from_file('video/sotu.mp4', format="mp4")
    audio.export(temp_file.name, format="wav")
    diarization = pipeline(temp_file.name)
```

The output requires a little bit of parsing before it is ready
to use. Here we produce an output dictionary.

```{python}
data = diarization.to_lab().split('\n')
data = [x.split(' ') for x in data]
data = [x for x in data if len(x) == 3]
output = {
  'start_time': [float(x[0]) for x in data],
  'end_time': [float(x[1]) for x in data],
  'speaker': [x[2] for x in data]
}
```

The output is constructed such that we can call the `from_dict`
method from **polars** to construct a data frame. If needed, this
can be saved as a CSV file with the `write_csv` method of the
resulting data frame.

```{python}
dt = pl.from_dict(output)
dt
```