Scriberr
Transcription / Self-hosted
AI/ML
Development

Scriberr is an App that wraps around Whisper.cpp and pyannote. The app itself is built uisng SvelteKit. It consists of the following core systems:

  • Backend (Sveltekit)
  • Database (Pocketbase)
  • Job Handler (BullMQ)

The app itself is built as a Single Page Application (SPA). I’ll try to populate this section with some documentation about the architecture and flow of the app.

I’m not an App developer and am very new to this. So I apologize in advance for any stupidities you find in the code. I encourage folks to help make the app better. PRs are most welcome.

Diarization Algorithm

Currently, the diarization is done using the most basic approach. First whisper.cpp outputs transcript with word level timestamps. Then pyannote provides speaker intervals as an rttm file. We correlate the two by doing a very basic check: Does the current words utterance end between a speaker interval ? If it does, then it’s considered part of that time interval.
By going through all words chornologically based on the timestamps, we accumulate words belonging to a speakers turn and when we detect a change of speaker, the previous speakers turn is terminated and we start accumulating words for the new speaker. Ties are split by adding to the current speaker. In this context, ties are when a word’s time interval is split over two speakers.

This is a very naive approach and there’s plenty of scope for improvement. You will see errors as they arise from multiple sources. Whisper isn’t great at detecting words correctly especially when accent comes into play. It can split a word halfway, or sometimes even catch single characters as words.
The next source of error is of course the diarization process itself. Although it’s pretty good there will be a few errors as it’s a machine learning model. On top of that we are doing this naive algorithm which doesn’t help much.

In short this is a very shitty version that surprisingly works decently in my experience. The transitions see a few errors but as you go higher in model size the errors reduce.

Speaker Matching Algorithms planned

  • Greedy Algorithm
  • Word2Vec nearest neighbor
  • Without using word level timestamps

Please feel free to open an issue if you have any ideas. We basically have two sets of time intervals and the goal is to align and correlate them as best as possible.

Contributing

Contributions are most welcome. Feel free to open issues or create a PR for helping with development. I originally built this for myself and will continue developing it. However, I would love for the community to guide the direction of the app so we could bring in features useful for everyone.

Running locally

You need nodejs and npm installed.
Pull the project from github. Run npm install.

Install redis. On Mac you can use brew brew install redis. You could also use docker.

Download and extract pocketbase from here.
Start the pocketbase instance with ./pocketbase serve --http 127.0.0.1:8080 --dev.

Run npm run dev

This will start the webapp which will be accessible at http://localhost:5173