Skip to content

Conversation

catileptic
Copy link
Collaborator

@catileptic catileptic commented Apr 15, 2025

The purpose of this experiment is to see if we can make audio & video files searchable using transcriptions.

We use WhisperCpp for transcriptions.

Why WhisperCpp?

I experimented with the vanilla OpenAI implementation of Whisper, as well as Faster-Whisper. Both seem to suffer from a memory leak. In my experience, the OpenAI implementation of Whisper wouldn't even run the transcription process in a Docker container. Faster-Whisper, on the other hand, would leak memory until it kills the container.

Since OpenAI does not allow Issues to be open in their repo, I have opened an Issue in the Faster-Whisper repo, which hasn't received any answers at the time of writing this.

WhisperCpp is the only implementation that I have found to be properly documented, reliable, and extremely efficient. Kudos to the developers!

I have also "benchmarked" three different models against each other. I ran the ggml-large-v3-turbo-q8_0.bin, ggml-large-v3-q5_0.bin and ggml-medium-q8_0.bin models, with the language either set explicitly or set to "auto" (which prompts the model to detect the language). The ggml-medium-q8_0.bin had, overall, the best transcription accuracy. These aren't proper benchmarks, since LLM technology doesn't allow that kind of rigour.

OpenAleph changes

The model and the language setting have been implemented as environment variables, to allow individual OpenAleph instances to configure these details independently of each other. The transcription timeout is also an environment variable.

The ingest-file image is a multi-stage build now, since I couldn't manage to build WhisperCpp inside the python:3.11-slim image.

The duration of each transcription is logged, for debugging purposes and in order to allow users to estimate how long a transcription of their entire dataset may take based on previous transcriptions.

The processing of an audio or video file isn't considered failed if the transcription process errors out. However, any exceptions raised during the transcription step are recorded in the processingError FTM attribute.

The transcription step is mocked and, thus, skipped during the tests, because it pushes the overall duration of the test past a reasonable limit. A transcription test has been added and marked as Skipped.

@catileptic catileptic merged commit f162611 into main Apr 21, 2025
1 check passed
@catileptic catileptic deleted the feature/whisperaicpp branch April 21, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant