Why I Built a Private Dictation App for macOS
About two years ago, I was trying to dictate a long email on my Mac. I'd been using Apple Dictation for months because it was there and it was free and I didn't want to deal with installing another app. I pressed the dictation key, started talking, and about 40 seconds in — it just stopped. Cut me off mid-sentence.
That's Apple Dictation's timeout. It cuts out after roughly 30 to 60 seconds of continuous speech. If you've used it for anything beyond a quick text message, you know exactly what I'm talking about. You re-trigger it, start again, and hope it catches everything this time. Sometimes it does. Sometimes words vanish after they appear on screen (that's another fun bug). Sometimes the accuracy just tanks for no obvious reason.
That day, I lost about three paragraphs of a carefully worded email. Not because the technology failed catastrophically, but because it failed in the most annoying way possible — silently, partially, and without warning.
I decided to look for something better.
The Cloud Problem
The first thing I tried was Wispr Flow. It's the most talked-about dictation app in tech circles, and the transcription quality is genuinely impressive. But as I dug into how it works, I got uncomfortable.
Wispr Flow sends your audio to cloud servers run by OpenAI and Meta for processing. Every word you speak gets uploaded. On top of that, it captures screenshots of your active window to "understand context." So your voice and your screen are leaving your machine.
I code for a living. I'm often dictating comments about proprietary stuff, or drafting messages about projects that aren't public yet. The idea of that audio sitting on OpenAI's servers — even temporarily — didn't sit right with me. And screenshots? Of my IDE? No thanks.
There was also the price: $15/month. That's $180 a year. For dictation. On my own computer. Using my own microphone. But the processing happens on someone else's server. It felt wrong.
I looked at a few other cloud options. Same story. Monthly subscriptions, audio uploaded to external servers, privacy policies full of vague language about "improving our services." Nope.
What About Offline Tools?
There were some offline options. Superwhisper is good — it runs Whisper models locally and the transcription quality is solid. But it had so many modes and settings that I spent more time configuring it than using it. I don't want to pick between "email mode" and "coding mode" every time I start dictating. I just want to talk.
MacWhisper is great for transcribing audio files, but it's not really designed for live dictation. It's a different use case entirely.
None of them nailed the workflow I wanted: press one key, talk, text appears where my cursor is, done. No mode selection. No window switching. No configuration dialogs. Just the simplest possible loop between my voice and my text.
Then I Found WhisperKit
In late 2023, a team at Argmax released WhisperKit — an optimized implementation of OpenAI's Whisper model specifically designed to run on Apple hardware. It uses the Neural Engine (the dedicated ML chip in every Apple Silicon Mac) for inference, which means it's fast. Really fast. Sub-second transcription on an M1 or later.
And it's completely local. The model runs on your device. No server calls. No internet required. No data going anywhere.
When I first got WhisperKit running in a test project, I said a full paragraph into my MacBook's mic and had accurate text back in about 800 milliseconds. On my machine. Offline. No account, no API key, nothing.
That was the moment I thought: this should be an app. Not a complicated app with twelve features and three pricing tiers. Just the simplest, most private dictation tool that could possibly exist.
Building Voxpen
I had a few rules for myself:
One hotkey, one action. Press Fn (the Globe key on newer Macs) to start recording. Press it again to stop. Text gets transcribed and pasted wherever your cursor is. That's the entire interaction. No floating windows to manage. No mode switches. No menus to navigate while you're trying to think and talk at the same time.
No account. No sign-up. No analytics. I hate apps that make you create an account before you can do anything. Voxpen downloads, opens, and works. There's no account system because there's no server. There's no analytics because I don't want to know what you're dictating. I genuinely don't care.
Audio never leaves the device. This wasn't a marketing decision. It was an architecture decision. There is no server component. There is no networking code for sending audio anywhere. The app literally can't upload your voice because I never wrote the code to do it. It physically doesn't exist.
Auto-paste into any app. Most dictation tools either have their own text field where you type/dictate, or they work only in specific apps. I wanted Voxpen to work everywhere. Slack, VS Code, Chrome, Notes, Mail — wherever your cursor is, that's where text goes. It uses the system clipboard and a paste command, which means it works with any app that accepts paste.
Keep it small. The app itself is under 10 MB. The Whisper model downloads separately (about 50 MB for the default model), but the app footprint stays tiny. I didn't want Voxpen to be one of those apps that quietly eats 500 MB of RAM in the background.
The Hard Parts
Building Voxpen wasn't all smooth. A few things gave me trouble.
Microphone permissions on macOS are a pain. Apple (rightly) gates microphone access behind system permissions, but the way macOS handles this for apps that need to toggle recording on and off quickly is clunky. Getting the permission flow right — where the app asks once, the user grants access, and then recording just works every time after that — took more iterations than I expected.
The global hotkey system was tricky. I wanted Fn to work as the trigger from any app, without interfering with whatever else Fn does in that context. macOS doesn't make this easy. The Globe/Fn key is special — the system uses it for emoji picker, dictation, and other things. Getting Voxpen to capture it cleanly, without breaking other keyboard shortcuts, took some careful engineering.
Latency matters more than you think. When you stop talking and wait for text, every 100 milliseconds feels like a second. I spent a lot of time optimizing the pipeline: when to stop recording (silence detection), how to pass audio to WhisperKit efficiently, and how to paste text as fast as possible. The goal was sub-second from the moment you stop talking to the moment text appears. On Apple Silicon, we consistently hit that.
What's Next
Voxpen does one thing and it does it well. But there's more I want to do.
Text polishing. Sometimes you want dictated text to be cleaned up — fix grammar, remove filler words, format it properly. I'm working on an optional (still local) text polishing step that happens after transcription.
Better model options. WhisperKit keeps improving. Newer models are more accurate and faster. I want Voxpen to make it easy to switch between models based on what you need — speed vs. accuracy, smaller vs. larger.
iOS version. Honestly, the same problems exist on iPhone. Apple Dictation on iOS has the same timeout, the same accuracy issues. A private, local dictation keyboard for iOS is something I want to build.
But the core promise won't change: your voice stays on your device. Always.
If you've ever been frustrated with Apple Dictation cutting you off, or uncomfortable with cloud tools recording your voice, give Voxpen a try. It's completely free, it's private, and it takes about ten seconds to set up. Press Fn, talk, and the text appears.
That's all I ever wanted from a dictation app. No more, no less.
FAQ
Does Voxpen record or store my audio?
No. Voxpen processes audio in real time and discards it immediately after transcription. Nothing is saved to disk, nothing is sent anywhere. Your voice data exists only in memory for the few seconds it takes to transcribe.
What is WhisperKit and how accurate is it?
WhisperKit is Apple's optimized implementation of OpenAI's Whisper speech recognition model. It runs directly on the Neural Engine in Apple Silicon chips, giving near-cloud accuracy without any network connection. Accuracy varies by model size, but the default model handles everyday dictation very well.
Is Voxpen open source?
Not currently. Voxpen is completely free to use, but the source code is not publicly available. The underlying speech model (Whisper) is open source, and the WhisperKit framework is maintained by Apple.
Why not just use a cloud dictation service?
Cloud services send your audio — and sometimes screenshots of your screen — to remote servers for processing. If you dictate passwords, medical notes, legal documents, or anything personal, that data leaves your machine. Voxpen keeps everything local so you don't have to think about it.