Make audio and video searchable #5
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The purpose of this experiment is to see if we can make audio & video files searchable using transcriptions.
We use WhisperCpp for transcriptions.
Why WhisperCpp?
I experimented with the vanilla OpenAI implementation of Whisper, as well as Faster-Whisper. Both seem to suffer from a memory leak. In my experience, the OpenAI implementation of Whisper wouldn't even run the transcription process in a Docker container. Faster-Whisper, on the other hand, would leak memory until it kills the container.
Since OpenAI does not allow Issues to be open in their repo, I have opened an Issue in the Faster-Whisper repo, which hasn't received any answers at the time of writing this.
WhisperCpp is the only implementation that I have found to be properly documented, reliable, and extremely efficient. Kudos to the developers!
I have also "benchmarked" three different models against each other. I ran the
ggml-large-v3-turbo-q8_0.bin
,ggml-large-v3-q5_0.bin
andggml-medium-q8_0.bin
models, with the language either set explicitly or set to "auto" (which prompts the model to detect the language). Theggml-medium-q8_0.bin
had, overall, the best transcription accuracy. These aren't proper benchmarks, since LLM technology doesn't allow that kind of rigour.OpenAleph changes
The model and the language setting have been implemented as environment variables, to allow individual OpenAleph instances to configure these details independently of each other. The transcription timeout is also an environment variable.
The
ingest-file
image is a multi-stage build now, since I couldn't manage to build WhisperCpp inside thepython:3.11-slim
image.The duration of each transcription is logged, for debugging purposes and in order to allow users to estimate how long a transcription of their entire dataset may take based on previous transcriptions.
The processing of an audio or video file isn't considered failed if the transcription process errors out. However, any exceptions raised during the transcription step are recorded in the
processingError
FTM attribute.The transcription step is mocked and, thus, skipped during the tests, because it pushes the overall duration of the test past a reasonable limit. A transcription test has been added and marked as Skipped.