CTO Assessment — Ambient Audio Clinical Device: Technical Feasibility
CTO Assessment — Ambient Audio Clinical Device: Technical Feasibility
Issue: MOKA-571 | Epic: MOKA-568 Date: 2026-03-28 Author: CTO Agent
1. Technical Viability Summary
Verdict: Technically feasible. No show-stoppers for MVP. Validate market first.
Every component in the proposed stack exists, is production-proven, and is available off-the-shelf. The novelty is in the integration and the veterinary-specific structured output — not in any individual technology. This is a systems integration challenge, not a research problem.
Key technical facts:
- ESP32-S3 is a proven platform for audio capture with I2S mic arrays, WiFi streaming, and on-device VAD. Thousands of commercial products ship on this chip.
- Deepgram Nova-2 handles Portuguese, includes diarization, and is noise-tolerant. At $0.0145/min it’s the best cost/accuracy tradeoff for batch processing.
- Claude API excels at structured clinical summarization — SOAP note generation from transcript is a well-understood prompt engineering problem.
- The entire backend stack (Fastify, PostgreSQL, Drizzle, BullMQ, R2) matches our existing infrastructure. Zero new technology to learn.
The hard problems are at the edges: noisy clinic audio, veterinary terminology accuracy, and PIMS integration diversity. None are unsolvable, but each requires iterative validation.
2. Edge vs. Cloud Tradeoffs
What MUST run on-device (ESP32-S3)
| Function | Rationale |
|---|---|
| Voice Activity Detection (VAD) | Reduces bandwidth by 60-80%. WebRTC VAD (libfvad) runs on ESP32 with <1ms latency. Transmit only speech segments — saves cost and privacy. |
| Opus encoding | 24kbps mono at 16kHz. ESP32-S3 handles this natively. Reduces bandwidth from ~256kbps (raw PCM) to 24kbps. |
| TLS 1.3 transport encryption | Non-negotiable for LGPD. ESP32-S3 supports TLS via mbedTLS. Protects audio in transit. |
| LED status indicator control | Legal/trust requirement. Device must visibly signal when capturing. GPIO-driven, trivial. |
| Buffering + reconnection | WiFi drops happen. Device needs 30-60s local buffer (PSRAM on S3 supports this) and automatic reconnection with chunk resumption. |
What SHOULD NOT run on-device
| Function | Rationale |
|---|---|
| Speech-to-text (STT) | Whisper-tiny runs on ESP32-S3 but with unacceptable accuracy (~60-70% WER for Portuguese). Cloud STT is 95%+ accurate. |
| Speaker diarization | Requires neural models (pyannote, NeMo). Impossible on ESP32. Cloud-only. |
| LLM summarization | Obviously cloud. No edge LLM can generate structured clinical notes. |
| Patient matching | Requires calendar/PIMS data. Server-side join. |
| On-device encryption of audio content (AES-256-GCM) | The research brief suggests encrypting audio on-device before transmission. This is redundant — TLS already encrypts in transit. AES on-device adds complexity (key management, firmware updates for key rotation) with no security benefit over TLS. Encrypt at rest on the server instead. |
Recommendation
Keep the device dumb: VAD + Opus + TLS + buffer. Everything else is cloud. This minimizes firmware complexity, enables OTA updates for the cloud pipeline without touching devices, and keeps hardware cost at ~$35.
3. Audio Pipeline Assessment
STT Options (Ranked)
| Provider | Accuracy (Portuguese) | Diarization | Cost/min | Latency (batch) | Verdict |
|---|---|---|---|---|---|
| Deepgram Nova-2 | ~92-95% WER | Built-in | $0.0145 | ~0.3x real-time | Best choice for MVP |
| AssemblyAI Universal-2 | ~91-94% WER | Built-in | $0.0150 | ~0.5x real-time | Strong alternative |
| Whisper Large v3 (self-hosted) | ~93-96% WER | No (separate) | ~$0.005 (GPU cost) | ~1x real-time | Cost-optimal at scale but adds GPU infra |
| Google Chirp 2 | ~90-93% WER | Built-in | $0.016 | ~0.4x real-time | Good but pricier |
Recommendation: Start with Deepgram Nova-2. Diarization included, excellent Portuguese support, simple API, batch-friendly. Switch to self-hosted Whisper only if STT cost exceeds $2k/mo (i.e., >300 clinics).
Speaker Diarization
For the MVP (2-speaker: vet + tutor), Deepgram’s built-in diarization is sufficient (~5-8% DER for 2 speakers). No need for a separate diarization pipeline.
When to upgrade: If we need >2 speaker separation (e.g., multiple vets in a room, vet students), switch to:
- pyannote 3.1 (open-source, state-of-art, self-hosted) — best accuracy
- NeMo MSDD (NVIDIA, open-source) — better for streaming scenarios
Both require GPU. Cross that bridge at Phase 3, not Phase 2.
Noise Handling in Clinical Environments
This is the #1 technical risk. Vet clinics are noisy: barking, whimpering, equipment beeps, multiple rooms, foot traffic.
Mitigation stack (layered):
- Hardware level: The Korvo-2’s 3-mic linear array supports basic beamforming. This helps but is not sufficient alone. Consider upgrading to a 4-mic circular array (e.g., ReSpeaker 4-Mic Array for ESP32) for better spatial filtering. Cost delta: ~$10.
- On-device preprocessing: WebRTC VAD already filters silence. For MVP, this is enough.
- Cloud noise reduction: Run DeepFilterNet2 (open-source, real-time capable, ~30ms latency) as a preprocessing step before STT. This removes stationary and non-stationary noise while preserving speech. Runs on CPU — no GPU needed.
- STT robustness: Nova-2 and Whisper are trained on noisy data. They handle moderate noise well. The preprocessing step handles the extreme cases.
- Structured prompt engineering: The LLM summarization step can be instructed to flag low-confidence segments rather than hallucinate.
Validation approach: Record 10 hours of real clinic audio in Phase 1 (Wizard-of-Oz). Measure WER with and without DeepFilterNet. If WER degrades >10pp in noisy conditions, invest in custom beamforming. If <10pp, the software stack is sufficient.
4. Contextual Retrieval & Summarization
SOAP Note Generation Pipeline
Transcript (timestamped, diarized)
→ Context injection (patient history, species, breed, age, reason for visit)
→ Claude API (structured output, SOAP format)
→ Confidence scoring (flag uncertain segments)
→ Human review interface
→ Approved note → PostgreSQL + PIMS sync
Prompt architecture: Use a system prompt with:
- SOAP note schema (Subjective, Objective, Assessment, Plan)
- Species-specific terminology guidelines
- Exam finding categories for the species
- Previous visit summary (if available)
- Clinic’s preferred medication/treatment vocabulary
Model choice: Claude Sonnet 4 for cost-efficiency. SOAP notes from a 15-min consult transcript are ~500-800 tokens of output from ~2000-4000 tokens of input. At Sonnet pricing this is <$0.01 per note. Reserve Opus for complex multi-visit longitudinal summaries.
Calendar/PIMS Integration for Patient Matching
This is harder than it looks. Brazilian vet PIMS landscape is fragmented:
| PIMS | Market Share (est.) | API Available | Integration Difficulty |
|---|---|---|---|
| Provet Cloud | 5-10% (premium) | REST API | Low |
| SimplesVet | 15-20% | Limited API | Medium |
| VetSmart/Vetwork | 10-15% | No public API | High (scraping) |
| Custom/spreadsheet | 40-50% | None | N/A |
MVP approach: Don’t integrate with PIMS at all for Phase 1-2. Instead:
- Integrate with Google Calendar (most clinics use it for scheduling)
- Match appointment time window → device audio window → patient
- Vet confirms/corrects the match in the review UI
- Build PIMS integrations in Phase 3 for the top 2-3 systems
This sidesteps the fragmentation problem and validates the core value (SOAP notes) independently of PIMS connectivity.
Longitudinal Patient History Architecture
-- Core schema (simplified)
patients (id, clinic_id, name, species, breed, birth_date, weight_history)
encounters (id, patient_id, vet_id, device_id, started_at, ended_at)
transcripts (id, encounter_id, raw_text, diarized_segments JSONB, confidence)
clinical_notes (id, encounter_id, soap JSONB, status, reviewed_by, reviewed_at)
treatment_plans (id, encounter_id, medications JSONB, follow_ups JSONB)
Key design decisions:
- JSONB for structured clinical data — SOAP categories, medications, follow-ups. Flexible schema for vet-specific variations.
- Encounter as the core entity — every audio capture is an encounter. Patient matching is a separate step (can be corrected).
- Immutable transcripts — never modify the raw transcript. Clinical notes are a separate, editable layer.
- Longitudinal queries via materialized views — “all encounters for patient X, sorted by date, with SOAP summaries” is a common query. Pre-compute it.
This schema supports the future moat (structured vet clinical data corpus) without over-engineering for Phase 2.
5. Top 5 Technical Risks (Ranked)
Risk 1: Audio Quality in Noisy Clinical Environments
- Probability: HIGH | Impact: HIGH
- Why it matters: If transcription accuracy drops below ~85% in real clinic conditions, notes are unusable and vets lose trust.
- Mitigation: Layered noise handling (see Section 3). Validate with real clinic recordings in Phase 1.
- Validation: Record 10+ hours in 3+ clinics. Measure WER. Kill threshold: WER >25% in typical conditions.
Risk 2: Veterinary Terminology Accuracy
- Probability: MEDIUM | Impact: HIGH
- Why it matters: “Amoxicillin” vs “amoxicilina,” breed names in Portuguese, medication dosages — errors here are clinically dangerous.
- Mitigation: Custom vocabulary boosting in Deepgram (supported). Vet-specific prompt engineering for Claude. Human-in-the-loop for V1 (vet reviews every note).
- Validation: Build a vet terminology test set (200+ terms). Measure STT accuracy specifically on clinical vocabulary. Target: >90% accuracy on vet terms.
Risk 3: WiFi Reliability in Clinic Buildings
- Probability: MEDIUM | Impact: MEDIUM
- Why it matters: Older clinic buildings have poor WiFi. If the device can’t stream, it can’t capture.
- Mitigation: 30-60s local buffer on ESP32 PSRAM (8MB available). Chunk-based upload with resume. Opus at 24kbps means 1MB/min — a 15-min consult is only 15MB, easily buffered.
- Validation: Test in 5 real clinics. If >2 have WiFi issues, add SD card fallback (~$3 hardware cost increase).
Risk 4: PIMS Integration Fragmentation
- Probability: HIGH | Impact: MEDIUM
- Why it matters: Without PIMS integration, the longitudinal value proposition is weaker — data lives in two systems.
- Mitigation: Defer PIMS integration. Use calendar matching for MVP. Build the review UI as the primary interface. PIMS integration is a Phase 3 problem.
- Validation: Ask design partners: “Would you use this even without PIMS integration?” If >70% say yes, defer safely.
Risk 5: Device Provisioning and Fleet Management
- Probability: LOW (MVP) → HIGH (Scale) | Impact: MEDIUM
- Why it matters: At 5 clinics it’s manual. At 500 clinics you need OTA firmware updates, device health monitoring, and zero-touch provisioning.
- Mitigation: Use ESP-IDF’s built-in OTA update mechanism. Implement device heartbeat (HTTPS GET every 5 min). For MVP, provision manually.
- Validation: Build OTA update pipeline in Phase 2. Test firmware update across 5 devices simultaneously.
6. Privacy & Security Architecture
LGPD Compliance Strategy
| Requirement | Implementation |
|---|---|
| Legal basis | Explicit consent (LGPD Art. 7, I). Tutor signs consent form per visit. Digital consent option in review UI. |
| Data minimization | Device captures only speech segments (VAD). Raw audio deleted after 30 days. Transcripts retained; audio does not need to be. |
| Right to deletion | API endpoint to delete all data for a specific encounter or patient. Cascade delete: audio → transcript → notes. |
| Data portability | Export all patient data as structured JSON. Vet can take their data if they leave. |
| DPO (Data Protection Officer) | Required if processing sensitive data at scale. Appoint when >50 clinics. |
| Incident response | 72-hour breach notification to ANPD. Logging of all data access. |
Encryption Strategy
| Layer | Method | Notes |
|---|---|---|
| In transit | TLS 1.3 (device → server) | Standard. ESP32-S3 supports via mbedTLS. |
| At rest (audio) | AES-256-GCM server-side | R2 supports server-side encryption. |
| At rest (database) | PostgreSQL TDE or column-level encryption for PII | Tutor name, phone, email encrypted. Clinical data (SOAP notes) can be plaintext for query performance. |
| Key management | AWS KMS or Cloudflare Workers KV for key storage | Don’t roll your own. |
What NOT to do (correcting the research brief)
The brief suggests on-device AES-256-GCM encryption before transmission. Don’t do this.
Reasons:
- TLS 1.3 already provides authenticated encryption in transit.
- On-device encryption requires key distribution to devices — a much harder problem than server-side encryption.
- Key rotation requires firmware updates or a key exchange protocol — added complexity.
- If the device encrypts, the server must decrypt before processing — so it has the key anyway.
- The threat model that on-device encryption addresses (malicious server operator) doesn’t apply when we own the server.
Correct approach: TLS in transit + server-side encryption at rest. Simple, standard, auditable.
Consent UX
- Physical: LED on device (green = idle, blue = capturing, red = error)
- Digital: Consent checkbox in appointment scheduling or at check-in
- Poster/signage: Required in exam room — “This room uses audio recording for clinical documentation”
- Opt-out: Any party can opt out at any time. Device paused via physical button or app toggle.
7. Scalability Analysis
Phase transitions that change the architecture
| Scale | Clinics | Architecture Change Required |
|---|---|---|
| MVP (Phase 2) | 3-5 | Single VPS. Direct WebSocket. Manual provisioning. |
| Early Growth (Phase 3) | 15-50 | Add load balancer. Move to managed PostgreSQL. Redis cluster. OTA update pipeline. |
| Growth | 50-200 | Multi-region ingestion (at least SP + RJ). Object storage partitioning by clinic. Background job workers scale horizontally. |
| Scale | 200-500 | Consider self-hosted Whisper (cost savings >$2k/mo). CDN for firmware distribution. Dedicated device management service. Multi-tenant isolation. |
| Large Scale | 500+ | Kubernetes or equivalent orchestration. Data warehouse for analytics. ML pipeline for custom models. Compliance team. |
What scales linearly (no architecture change)
- Audio ingestion (WebSocket connections, one per device)
- STT processing (Deepgram scales on their end)
- LLM summarization (Claude API scales on their end)
- Storage (R2 is effectively infinite)
What requires step-function changes
- Database: Single PostgreSQL → read replicas → sharding by clinic_id at ~200 clinics
- Device management: Manual → automated provisioning + OTA at ~50 devices
- Monitoring: Manual → Grafana/Prometheus stack at ~20 clinics
- Support: Founder-led → dedicated support person at ~50 clinics
Cost scaling
| Scale | Monthly Infra Cost | Per-Clinic Cost |
|---|---|---|
| 5 clinics | ~$740 | ~$148 |
| 50 clinics | ~$3,500 | ~$70 |
| 200 clinics | ~$10,000 | ~$50 |
| 500 clinics | ~$18,000 | ~$36 |
Infra cost per clinic decreases significantly with scale. The biggest cost driver is STT ($0.0145/min). At 500 clinics, self-hosted Whisper becomes a clear win.
8. Build vs. Buy
| Component | Decision | Rationale |
|---|---|---|
| Hardware (MVP) | BUY — ESP32-S3-Korvo-2 | Off-the-shelf dev board. Custom hardware only after PMF. |
| Enclosure (MVP) | BUY — 3D-printed or generic project box | Don’t invest in injection molding until 100+ units. |
| Firmware | BUILD — custom ESP-IDF application | Core IP. VAD + Opus + WiFi streaming + OTA. ~2 weeks to build. |
| STT | BUY — Deepgram Nova-2 API | Build (self-hosted Whisper) only at 200+ clinics when cost justifies GPU infra. |
| Diarization | BUY — Deepgram built-in | Build (pyannote) only if >2 speaker scenarios are common. |
| Noise reduction | BUILD — DeepFilterNet2 integration | Open-source, runs on CPU, 30-line integration. |
| LLM summarization | BUILD — Claude API + custom prompts | The structured output prompts are core IP. The API is bought. |
| Backend | BUILD — Fastify + PostgreSQL + BullMQ | Standard Moklabs stack. Nothing novel. |
| Frontend | BUILD — Next.js review dashboard | Simple CRUD + audio player. ~1 week. |
| PIMS integration | DEFER — Phase 3 | Fragmented market. Validate core value first. |
| Device management | DEFER — Phase 3 | Manual provisioning at 5 clinics. Automate at 50. |
| Analytics/BI | DEFER — Phase 3 | PostgreSQL queries are sufficient for MVP metrics. |
9. Assumptions That Must Be Validated
| # | Assumption | Validation Approach | Kill Threshold |
|---|---|---|---|
| 1 | Deepgram Nova-2 achieves >85% WER accuracy on Portuguese vet consultations with background noise | Record 10 hours in real clinics; run through pipeline; measure WER against manual transcript | WER >25% in typical conditions |
| 2 | 2-speaker diarization is sufficient for >90% of vet consultations | Survey design partners on typical room occupancy during consults | >30% of consults have >2 speakers regularly |
| 3 | ESP32-S3 + WiFi can maintain stable streaming for 15-min consultations | Stress test in 5 clinic environments with varying WiFi quality | >10% of sessions have >30s of lost audio |
| 4 | LLM-generated SOAP notes require <30% manual editing to be clinically useful | A/B test: LLM notes vs. manual notes. Measure edit distance and vet satisfaction | >50% of notes require substantial rewrite |
| 5 | Clinic WiFi infrastructure is adequate (or can be made adequate cheaply) | Survey/test WiFi in 10 clinics across SP metro | >40% of clinics need WiFi upgrade to function |
| 6 | Batch processing (2-5 min delay) is acceptable vs. real-time | Ask design partners directly. Monitor actual review timing in Phase 1 | >50% of vets want notes during the consult, not after |
| 7 | Calendar-based patient matching achieves >80% accuracy without PIMS integration | Test against real clinic schedules in Phase 1 | <60% accuracy without manual correction |
10. Recommended MVP Technical Path
Phase 1: Wizard-of-Oz (Weeks 5-10)
Tech stack: Jabra Speak 510 → Raspberry Pi recording → Deepgram API → Claude API → WhatsApp delivery
Purpose: Validate note quality and vet acceptance without firmware engineering.
- Record full consultations (consent obtained)
- Batch process: upload to Deepgram → get diarized transcript → Claude generates SOAP note
- Deliver notes via WhatsApp (or email) within 30 min
- Vet rates note quality (1-5 scale) and edits
- Engineer effort: 1 engineer, 1 week for pipeline + 1 week for iteration
Phase 2: Automated MVP (Weeks 11-18)
Tech stack:
- Device: ESP32-S3-Korvo-2 + custom firmware (ESP-IDF)
- Backend: Fastify + WebSocket ingestion + BullMQ + PostgreSQL
- Storage: Cloudflare R2
- STT: Deepgram Nova-2
- Summarization: Claude Sonnet 4 API
- Frontend: Next.js review dashboard
- Hosting: Single VPS (existing devnest or similar)
Build order:
- Week 11-12: ESP32 firmware (VAD + Opus + WebSocket streaming)
- Week 12-13: Backend ingestion service (WebSocket → R2 → BullMQ job)
- Week 13-14: Processing pipeline (Deepgram → Claude → PostgreSQL)
- Week 14-16: Review dashboard (Next.js, note display/edit, audio playback)
- Week 16-18: Integration testing in clinic, iterate on prompt engineering
Engineer effort: 1 senior full-stack engineer, 8 weeks. No specialized hardware engineer needed for MVP.
Phase 3: Design Partner Expansion (Weeks 19-26)
Add:
- OTA firmware updates
- Device health monitoring
- Google Calendar integration
- Longitudinal patient views
- Top 2 PIMS integrations (if validated)
- Basic clinic analytics dashboard
11. Key Open Questions
-
Portuguese vet terminology corpus: Does a labeled dataset exist for Brazilian Portuguese veterinary speech? If not, how many hours of annotated audio do we need to fine-tune Deepgram’s custom vocabulary? (Estimate: 50-100 hours minimum.)
-
Multi-room deployment: What happens when a clinic has 3-5 exam rooms? Each room needs a device. How do we handle vet identity when the same vet moves between rooms? Options: voice enrollment, room assignment in schedule, NFC tap.
-
Exam sounds vs. speech: Can DeepFilterNet reliably separate an animal’s heartbeat/breathing sounds (which may be clinically relevant if described by the vet) from actual noise? Need to test.
-
Offline mode: Some rural clinics in Brazil have intermittent internet. Is an SD card + batch upload model viable? If so, what’s the maximum acceptable delay? (24 hours? 48 hours?)
-
Data residency: LGPD requires data processing in Brazil (or with adequate safeguards for international transfer). Deepgram and Claude API process data in the US. Need to verify: (a) is a Data Processing Agreement sufficient, or (b) do we need to self-host STT in Brazil? This could force the Whisper self-hosting decision earlier than cost alone would justify.
12. Final Recommendation
VALIDATE FIRST — then build.
The technology is straightforward. Every component is proven. The integration is standard systems engineering, not R&D. A single senior engineer can build the full MVP in 8 weeks.
But the technical risk is secondary to the market risk. Before writing a single line of firmware:
- Complete Phase 0 interviews (market validation). If vets don’t care enough to pay R$300/mo, the best audio pipeline in the world is worthless.
- Run the Wizard-of-Oz with off-the-shelf mics. This validates note quality and vet acceptance with 1 week of engineering, not 8.
- Only then invest in custom firmware and the full automated pipeline.
The architecture proposed in the research brief is sound. My corrections are at the edges: drop on-device encryption (TLS is sufficient), defer PIMS integration (use calendar matching), and add DeepFilterNet for noise handling. The core design — dumb device, smart cloud, batch processing — is the right call.
Do not assign an engineer to this until Phase 0 interviews show >60% strong interest and WTP >R$200/mo. Until then, this is a CPO problem, not a CTO problem.
Assessment complete. Ready for consolidation into the MOKA-568 executive brief.