Make audio and video searchable #5

catileptic · 2025-04-15T18:05:14Z

The purpose of this experiment is to see if we can make audio & video files searchable using transcriptions.

We use WhisperCpp for transcriptions.

Why WhisperCpp?

I experimented with the vanilla OpenAI implementation of Whisper, as well as Faster-Whisper. Both seem to suffer from a memory leak. In my experience, the OpenAI implementation of Whisper wouldn't even run the transcription process in a Docker container. Faster-Whisper, on the other hand, would leak memory until it kills the container.

Since OpenAI does not allow Issues to be open in their repo, I have opened an Issue in the Faster-Whisper repo, which hasn't received any answers at the time of writing this.

WhisperCpp is the only implementation that I have found to be properly documented, reliable, and extremely efficient. Kudos to the developers!

I have also "benchmarked" three different models against each other. I ran the ggml-large-v3-turbo-q8_0.bin, ggml-large-v3-q5_0.bin and ggml-medium-q8_0.bin models, with the language either set explicitly or set to "auto" (which prompts the model to detect the language). The ggml-medium-q8_0.bin had, overall, the best transcription accuracy. These aren't proper benchmarks, since LLM technology doesn't allow that kind of rigour.

OpenAleph changes

The model and the language setting have been implemented as environment variables, to allow individual OpenAleph instances to configure these details independently of each other. The transcription timeout is also an environment variable.

The ingest-file image is a multi-stage build now, since I couldn't manage to build WhisperCpp inside the python:3.11-slim image.

The duration of each transcription is logged, for debugging purposes and in order to allow users to estimate how long a transcription of their entire dataset may take based on previous transcriptions.

The processing of an audio or video file isn't considered failed if the transcription process errors out. However, any exceptions raised during the transcription step are recorded in the processingError FTM attribute.

The transcription step is mocked and, thus, skipped during the tests, because it pushes the overall duration of the test past a reasonable limit. A transcription test has been added and marked as Skipped.

* Remove README & LICENSE from .dockerignore * Refactor ingest-file to be compatible with nomenklatura * Make linter happy * Add poetry.lock

catileptic and others added 12 commits February 28, 2025 13:30

Refactor ingest-file to prepare it for using nomenklatura (#2)

0fc3567

* Remove README & LICENSE from .dockerignore * Refactor ingest-file to be compatible with nomenklatura * Make linter happy * Add poetry.lock

Build whisper.cpp

9781da4

Add Whispercpp to Dockerfile from multi-stage build

7cc2c65

Launch subprocess for transcription. Refactor error handling.

a434292

Make image build architecture-independent. Set model.

c3466ee

Aesthtic adjustments to Dockerfile

1a1a9ff

Apply transcription logic to audio and video. Add tests.

ae51e7b

The Processing of audio/video isn't a failure is transcription fails

dc85baf

Make linter happy

02ed639

Transcription timeout as env var

baabfec

Fix wrong import from settings

be75daa

Cast env var timeout to int

f2ef726

catileptic merged commit f162611 into main Apr 21, 2025
1 check passed

catileptic deleted the feature/whisperaicpp branch April 21, 2025 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make audio and video searchable #5

Make audio and video searchable #5

Uh oh!

catileptic commented Apr 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Make audio and video searchable #5

Make audio and video searchable #5

Uh oh!

Conversation

catileptic commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why WhisperCpp?

OpenAleph changes

Uh oh!

Uh oh!

Uh oh!

catileptic commented Apr 15, 2025 •

edited

Loading