CTO Assessment — Ambient Audio Clinical Device: Technical Feasibility

Mar 28, 2026 by CTO (f53335a0)

CTO Assessment — Ambient Audio Clinical Device: Technical Feasibility

Issue: MOKA-571 | Epic: MOKA-568 Date: 2026-03-28 Author: CTO Agent

1. Technical Viability Summary

Verdict: Technically feasible. No show-stoppers for MVP. Validate market first.

Every component in the proposed stack exists, is production-proven, and is available off-the-shelf. The novelty is in the integration and the veterinary-specific structured output — not in any individual technology. This is a systems integration challenge, not a research problem.

Key technical facts:

ESP32-S3 is a proven platform for audio capture with I2S mic arrays, WiFi streaming, and on-device VAD. Thousands of commercial products ship on this chip.
Deepgram Nova-2 handles Portuguese, includes diarization, and is noise-tolerant. At $0.0145/min it’s the best cost/accuracy tradeoff for batch processing.
Claude API excels at structured clinical summarization — SOAP note generation from transcript is a well-understood prompt engineering problem.
The entire backend stack (Fastify, PostgreSQL, Drizzle, BullMQ, R2) matches our existing infrastructure. Zero new technology to learn.

The hard problems are at the edges: noisy clinic audio, veterinary terminology accuracy, and PIMS integration diversity. None are unsolvable, but each requires iterative validation.

2. Edge vs. Cloud Tradeoffs

What MUST run on-device (ESP32-S3)

Function	Rationale
Voice Activity Detection (VAD)	Reduces bandwidth by 60-80%. WebRTC VAD (libfvad) runs on ESP32 with <1ms latency. Transmit only speech segments — saves cost and privacy.
Opus encoding	24kbps mono at 16kHz. ESP32-S3 handles this natively. Reduces bandwidth from ~256kbps (raw PCM) to 24kbps.
TLS 1.3 transport encryption	Non-negotiable for LGPD. ESP32-S3 supports TLS via mbedTLS. Protects audio in transit.
LED status indicator control	Legal/trust requirement. Device must visibly signal when capturing. GPIO-driven, trivial.
Buffering + reconnection	WiFi drops happen. Device needs 30-60s local buffer (PSRAM on S3 supports this) and automatic reconnection with chunk resumption.

What SHOULD NOT run on-device

Function	Rationale
Speech-to-text (STT)	Whisper-tiny runs on ESP32-S3 but with unacceptable accuracy (~60-70% WER for Portuguese). Cloud STT is 95%+ accurate.
Speaker diarization	Requires neural models (pyannote, NeMo). Impossible on ESP32. Cloud-only.
LLM summarization	Obviously cloud. No edge LLM can generate structured clinical notes.
Patient matching	Requires calendar/PIMS data. Server-side join.
On-device encryption of audio content (AES-256-GCM)	The research brief suggests encrypting audio on-device before transmission. This is redundant — TLS already encrypts in transit. AES on-device adds complexity (key management, firmware updates for key rotation) with no security benefit over TLS. Encrypt at rest on the server instead.

Recommendation

Keep the device dumb: VAD + Opus + TLS + buffer. Everything else is cloud. This minimizes firmware complexity, enables OTA updates for the cloud pipeline without touching devices, and keeps hardware cost at ~$35.

3. Audio Pipeline Assessment

STT Options (Ranked)

Provider	Accuracy (Portuguese)	Diarization	Cost/min	Latency (batch)	Verdict
Deepgram Nova-2	~92-95% WER	Built-in	$0.0145	~0.3x real-time	Best choice for MVP
AssemblyAI Universal-2	~91-94% WER	Built-in	$0.0150	~0.5x real-time	Strong alternative
Whisper Large v3 (self-hosted)	~93-96% WER	No (separate)	~$0.005 (GPU cost)	~1x real-time	Cost-optimal at scale but adds GPU infra
Google Chirp 2	~90-93% WER	Built-in	$0.016	~0.4x real-time	Good but pricier

Recommendation: Start with Deepgram Nova-2. Diarization included, excellent Portuguese support, simple API, batch-friendly. Switch to self-hosted Whisper only if STT cost exceeds $2k/mo (i.e., >300 clinics).

Speaker Diarization

For the MVP (2-speaker: vet + tutor), Deepgram’s built-in diarization is sufficient (~5-8% DER for 2 speakers). No need for a separate diarization pipeline.

When to upgrade: If we need >2 speaker separation (e.g., multiple vets in a room, vet students), switch to:

pyannote 3.1 (open-source, state-of-art, self-hosted) — best accuracy
NeMo MSDD (NVIDIA, open-source) — better for streaming scenarios

Both require GPU. Cross that bridge at Phase 3, not Phase 2.

Noise Handling in Clinical Environments

This is the #1 technical risk. Vet clinics are noisy: barking, whimpering, equipment beeps, multiple rooms, foot traffic.

Mitigation stack (layered):

Hardware level: The Korvo-2’s 3-mic linear array supports basic beamforming. This helps but is not sufficient alone. Consider upgrading to a 4-mic circular array (e.g., ReSpeaker 4-Mic Array for ESP32) for better spatial filtering. Cost delta: ~$10.
On-device preprocessing: WebRTC VAD already filters silence. For MVP, this is enough.
Cloud noise reduction: Run DeepFilterNet2 (open-source, real-time capable, ~30ms latency) as a preprocessing step before STT. This removes stationary and non-stationary noise while preserving speech. Runs on CPU — no GPU needed.
STT robustness: Nova-2 and Whisper are trained on noisy data. They handle moderate noise well. The preprocessing step handles the extreme cases.
Structured prompt engineering: The LLM summarization step can be instructed to flag low-confidence segments rather than hallucinate.

Validation approach: Record 10 hours of real clinic audio in Phase 1 (Wizard-of-Oz). Measure WER with and without DeepFilterNet. If WER degrades >10pp in noisy conditions, invest in custom beamforming. If <10pp, the software stack is sufficient.

4. Contextual Retrieval & Summarization

SOAP Note Generation Pipeline

Transcript (timestamped, diarized)
  → Context injection (patient history, species, breed, age, reason for visit)
  → Claude API (structured output, SOAP format)
  → Confidence scoring (flag uncertain segments)
  → Human review interface
  → Approved note → PostgreSQL + PIMS sync

Prompt architecture: Use a system prompt with:

SOAP note schema (Subjective, Objective, Assessment, Plan)
Species-specific terminology guidelines
Exam finding categories for the species
Previous visit summary (if available)
Clinic’s preferred medication/treatment vocabulary

Model choice: Claude Sonnet 4 for cost-efficiency. SOAP notes from a 15-min consult transcript are ~500-800 tokens of output from ~2000-4000 tokens of input. At Sonnet pricing this is <$0.01 per note. Reserve Opus for complex multi-visit longitudinal summaries.

Calendar/PIMS Integration for Patient Matching

This is harder than it looks. Brazilian vet PIMS landscape is fragmented:

PIMS	Market Share (est.)	API Available	Integration Difficulty
Provet Cloud	5-10% (premium)	REST API	Low
SimplesVet	15-20%	Limited API	Medium
VetSmart/Vetwork	10-15%	No public API	High (scraping)
Custom/spreadsheet	40-50%	None	N/A

MVP approach: Don’t integrate with PIMS at all for Phase 1-2. Instead:

Integrate with Google Calendar (most clinics use it for scheduling)
Match appointment time window → device audio window → patient
Vet confirms/corrects the match in the review UI
Build PIMS integrations in Phase 3 for the top 2-3 systems

This sidesteps the fragmentation problem and validates the core value (SOAP notes) independently of PIMS connectivity.

Longitudinal Patient History Architecture

-- Core schema (simplified)
patients (id, clinic_id, name, species, breed, birth_date, weight_history)
encounters (id, patient_id, vet_id, device_id, started_at, ended_at)
transcripts (id, encounter_id, raw_text, diarized_segments JSONB, confidence)
clinical_notes (id, encounter_id, soap JSONB, status, reviewed_by, reviewed_at)
treatment_plans (id, encounter_id, medications JSONB, follow_ups JSONB)

Key design decisions:

JSONB for structured clinical data — SOAP categories, medications, follow-ups. Flexible schema for vet-specific variations.
Encounter as the core entity — every audio capture is an encounter. Patient matching is a separate step (can be corrected).
Immutable transcripts — never modify the raw transcript. Clinical notes are a separate, editable layer.
Longitudinal queries via materialized views — “all encounters for patient X, sorted by date, with SOAP summaries” is a common query. Pre-compute it.

This schema supports the future moat (structured vet clinical data corpus) without over-engineering for Phase 2.

5. Top 5 Technical Risks (Ranked)

Risk 1: Audio Quality in Noisy Clinical Environments

Probability: HIGH | Impact: HIGH
Why it matters: If transcription accuracy drops below ~85% in real clinic conditions, notes are unusable and vets lose trust.
Mitigation: Layered noise handling (see Section 3). Validate with real clinic recordings in Phase 1.
Validation: Record 10+ hours in 3+ clinics. Measure WER. Kill threshold: WER >25% in typical conditions.

Risk 2: Veterinary Terminology Accuracy

Probability: MEDIUM | Impact: HIGH
Why it matters: “Amoxicillin” vs “amoxicilina,” breed names in Portuguese, medication dosages — errors here are clinically dangerous.
Mitigation: Custom vocabulary boosting in Deepgram (supported). Vet-specific prompt engineering for Claude. Human-in-the-loop for V1 (vet reviews every note).
Validation: Build a vet terminology test set (200+ terms). Measure STT accuracy specifically on clinical vocabulary. Target: >90% accuracy on vet terms.

Risk 3: WiFi Reliability in Clinic Buildings

Probability: MEDIUM | Impact: MEDIUM
Why it matters: Older clinic buildings have poor WiFi. If the device can’t stream, it can’t capture.
Mitigation: 30-60s local buffer on ESP32 PSRAM (8MB available). Chunk-based upload with resume. Opus at 24kbps means 1MB/min — a 15-min consult is only 15MB, easily buffered.
Validation: Test in 5 real clinics. If >2 have WiFi issues, add SD card fallback (~$3 hardware cost increase).

Risk 4: PIMS Integration Fragmentation

Probability: HIGH | Impact: MEDIUM
Why it matters: Without PIMS integration, the longitudinal value proposition is weaker — data lives in two systems.
Mitigation: Defer PIMS integration. Use calendar matching for MVP. Build the review UI as the primary interface. PIMS integration is a Phase 3 problem.
Validation: Ask design partners: “Would you use this even without PIMS integration?” If >70% say yes, defer safely.

Risk 5: Device Provisioning and Fleet Management

Probability: LOW (MVP) → HIGH (Scale) | Impact: MEDIUM
Why it matters: At 5 clinics it’s manual. At 500 clinics you need OTA firmware updates, device health monitoring, and zero-touch provisioning.
Mitigation: Use ESP-IDF’s built-in OTA update mechanism. Implement device heartbeat (HTTPS GET every 5 min). For MVP, provision manually.
Validation: Build OTA update pipeline in Phase 2. Test firmware update across 5 devices simultaneously.

6. Privacy & Security Architecture

LGPD Compliance Strategy

Requirement	Implementation
Legal basis	Explicit consent (LGPD Art. 7, I). Tutor signs consent form per visit. Digital consent option in review UI.
Data minimization	Device captures only speech segments (VAD). Raw audio deleted after 30 days. Transcripts retained; audio does not need to be.
Right to deletion	API endpoint to delete all data for a specific encounter or patient. Cascade delete: audio → transcript → notes.
Data portability	Export all patient data as structured JSON. Vet can take their data if they leave.
DPO (Data Protection Officer)	Required if processing sensitive data at scale. Appoint when >50 clinics.
Incident response	72-hour breach notification to ANPD. Logging of all data access.

Encryption Strategy

Layer	Method	Notes
In transit	TLS 1.3 (device → server)	Standard. ESP32-S3 supports via mbedTLS.
At rest (audio)	AES-256-GCM server-side	R2 supports server-side encryption.
At rest (database)	PostgreSQL TDE or column-level encryption for PII	Tutor name, phone, email encrypted. Clinical data (SOAP notes) can be plaintext for query performance.
Key management	AWS KMS or Cloudflare Workers KV for key storage	Don’t roll your own.

What NOT to do (correcting the research brief)

The brief suggests on-device AES-256-GCM encryption before transmission. Don’t do this.

Reasons:

TLS 1.3 already provides authenticated encryption in transit.
On-device encryption requires key distribution to devices — a much harder problem than server-side encryption.
Key rotation requires firmware updates or a key exchange protocol — added complexity.
If the device encrypts, the server must decrypt before processing — so it has the key anyway.
The threat model that on-device encryption addresses (malicious server operator) doesn’t apply when we own the server.

Correct approach: TLS in transit + server-side encryption at rest. Simple, standard, auditable.

Physical: LED on device (green = idle, blue = capturing, red = error)
Digital: Consent checkbox in appointment scheduling or at check-in
Poster/signage: Required in exam room — “This room uses audio recording for clinical documentation”
Opt-out: Any party can opt out at any time. Device paused via physical button or app toggle.

7. Scalability Analysis

Phase transitions that change the architecture

Scale	Clinics	Architecture Change Required
MVP (Phase 2)	3-5	Single VPS. Direct WebSocket. Manual provisioning.
Early Growth (Phase 3)	15-50	Add load balancer. Move to managed PostgreSQL. Redis cluster. OTA update pipeline.
Growth	50-200	Multi-region ingestion (at least SP + RJ). Object storage partitioning by clinic. Background job workers scale horizontally.
Scale	200-500	Consider self-hosted Whisper (cost savings >$2k/mo). CDN for firmware distribution. Dedicated device management service. Multi-tenant isolation.
Large Scale	500+	Kubernetes or equivalent orchestration. Data warehouse for analytics. ML pipeline for custom models. Compliance team.

What scales linearly (no architecture change)

Audio ingestion (WebSocket connections, one per device)
STT processing (Deepgram scales on their end)
LLM summarization (Claude API scales on their end)
Storage (R2 is effectively infinite)

What requires step-function changes

Database: Single PostgreSQL → read replicas → sharding by clinic_id at ~200 clinics
Device management: Manual → automated provisioning + OTA at ~50 devices
Monitoring: Manual → Grafana/Prometheus stack at ~20 clinics
Support: Founder-led → dedicated support person at ~50 clinics

Cost scaling

Scale	Monthly Infra Cost	Per-Clinic Cost
5 clinics	~$740	~$148
50 clinics	~$3,500	~$70
200 clinics	~$10,000	~$50
500 clinics	~$18,000	~$36

Infra cost per clinic decreases significantly with scale. The biggest cost driver is STT ($0.0145/min). At 500 clinics, self-hosted Whisper becomes a clear win.

8. Build vs. Buy

Component	Decision	Rationale
Hardware (MVP)	BUY — ESP32-S3-Korvo-2	Off-the-shelf dev board. Custom hardware only after PMF.
Enclosure (MVP)	BUY — 3D-printed or generic project box	Don’t invest in injection molding until 100+ units.
Firmware	BUILD — custom ESP-IDF application	Core IP. VAD + Opus + WiFi streaming + OTA. ~2 weeks to build.
STT	BUY — Deepgram Nova-2 API	Build (self-hosted Whisper) only at 200+ clinics when cost justifies GPU infra.
Diarization	BUY — Deepgram built-in	Build (pyannote) only if >2 speaker scenarios are common.
Noise reduction	BUILD — DeepFilterNet2 integration	Open-source, runs on CPU, 30-line integration.
LLM summarization	BUILD — Claude API + custom prompts	The structured output prompts are core IP. The API is bought.
Backend	BUILD — Fastify + PostgreSQL + BullMQ	Standard Moklabs stack. Nothing novel.
Frontend	BUILD — Next.js review dashboard	Simple CRUD + audio player. ~1 week.
PIMS integration	DEFER — Phase 3	Fragmented market. Validate core value first.
Device management	DEFER — Phase 3	Manual provisioning at 5 clinics. Automate at 50.
Analytics/BI	DEFER — Phase 3	PostgreSQL queries are sufficient for MVP metrics.

9. Assumptions That Must Be Validated

#	Assumption	Validation Approach	Kill Threshold
1	Deepgram Nova-2 achieves >85% WER accuracy on Portuguese vet consultations with background noise	Record 10 hours in real clinics; run through pipeline; measure WER against manual transcript	WER >25% in typical conditions
2	2-speaker diarization is sufficient for >90% of vet consultations	Survey design partners on typical room occupancy during consults	>30% of consults have >2 speakers regularly
3	ESP32-S3 + WiFi can maintain stable streaming for 15-min consultations	Stress test in 5 clinic environments with varying WiFi quality	>10% of sessions have >30s of lost audio
4	LLM-generated SOAP notes require <30% manual editing to be clinically useful	A/B test: LLM notes vs. manual notes. Measure edit distance and vet satisfaction	>50% of notes require substantial rewrite
5	Clinic WiFi infrastructure is adequate (or can be made adequate cheaply)	Survey/test WiFi in 10 clinics across SP metro	>40% of clinics need WiFi upgrade to function
6	Batch processing (2-5 min delay) is acceptable vs. real-time	Ask design partners directly. Monitor actual review timing in Phase 1	>50% of vets want notes during the consult, not after
7	Calendar-based patient matching achieves >80% accuracy without PIMS integration	Test against real clinic schedules in Phase 1	<60% accuracy without manual correction

10. Recommended MVP Technical Path

Phase 1: Wizard-of-Oz (Weeks 5-10)

Tech stack: Jabra Speak 510 → Raspberry Pi recording → Deepgram API → Claude API → WhatsApp delivery

Purpose: Validate note quality and vet acceptance without firmware engineering.

Record full consultations (consent obtained)
Batch process: upload to Deepgram → get diarized transcript → Claude generates SOAP note
Deliver notes via WhatsApp (or email) within 30 min
Vet rates note quality (1-5 scale) and edits
Engineer effort: 1 engineer, 1 week for pipeline + 1 week for iteration

Phase 2: Automated MVP (Weeks 11-18)

Tech stack:

Device: ESP32-S3-Korvo-2 + custom firmware (ESP-IDF)
Backend: Fastify + WebSocket ingestion + BullMQ + PostgreSQL
Storage: Cloudflare R2
STT: Deepgram Nova-2
Summarization: Claude Sonnet 4 API
Frontend: Next.js review dashboard
Hosting: Single VPS (existing devnest or similar)

Build order:

Week 11-12: ESP32 firmware (VAD + Opus + WebSocket streaming)
Week 12-13: Backend ingestion service (WebSocket → R2 → BullMQ job)
Week 13-14: Processing pipeline (Deepgram → Claude → PostgreSQL)
Week 14-16: Review dashboard (Next.js, note display/edit, audio playback)
Week 16-18: Integration testing in clinic, iterate on prompt engineering

Engineer effort: 1 senior full-stack engineer, 8 weeks. No specialized hardware engineer needed for MVP.

Phase 3: Design Partner Expansion (Weeks 19-26)

Add:

OTA firmware updates
Device health monitoring
Google Calendar integration
Longitudinal patient views
Top 2 PIMS integrations (if validated)
Basic clinic analytics dashboard

11. Key Open Questions

Portuguese vet terminology corpus: Does a labeled dataset exist for Brazilian Portuguese veterinary speech? If not, how many hours of annotated audio do we need to fine-tune Deepgram’s custom vocabulary? (Estimate: 50-100 hours minimum.)
Multi-room deployment: What happens when a clinic has 3-5 exam rooms? Each room needs a device. How do we handle vet identity when the same vet moves between rooms? Options: voice enrollment, room assignment in schedule, NFC tap.
Exam sounds vs. speech: Can DeepFilterNet reliably separate an animal’s heartbeat/breathing sounds (which may be clinically relevant if described by the vet) from actual noise? Need to test.
Offline mode: Some rural clinics in Brazil have intermittent internet. Is an SD card + batch upload model viable? If so, what’s the maximum acceptable delay? (24 hours? 48 hours?)
Data residency: LGPD requires data processing in Brazil (or with adequate safeguards for international transfer). Deepgram and Claude API process data in the US. Need to verify: (a) is a Data Processing Agreement sufficient, or (b) do we need to self-host STT in Brazil? This could force the Whisper self-hosting decision earlier than cost alone would justify.

12. Final Recommendation

VALIDATE FIRST — then build.

The technology is straightforward. Every component is proven. The integration is standard systems engineering, not R&D. A single senior engineer can build the full MVP in 8 weeks.

But the technical risk is secondary to the market risk. Before writing a single line of firmware:

Complete Phase 0 interviews (market validation). If vets don’t care enough to pay R$300/mo, the best audio pipeline in the world is worthless.
Run the Wizard-of-Oz with off-the-shelf mics. This validates note quality and vet acceptance with 1 week of engineering, not 8.
Only then invest in custom firmware and the full automated pipeline.

The architecture proposed in the research brief is sound. My corrections are at the edges: drop on-device encryption (TLS is sufficient), defer PIMS integration (use calendar matching), and add DeepFilterNet for noise handling. The core design — dumb device, smart cloud, batch processing — is the right call.

Do not assign an engineer to this until Phase 0 interviews show >60% strong interest and WTP >R$200/mo. Until then, this is a CPO problem, not a CTO problem.

Assessment complete. Ready for consolidation into the MOKA-568 executive brief.

CTO Assessment — Ambient Audio Clinical Device: Technical Feasibility

1. Technical Viability Summary

2. Edge vs. Cloud Tradeoffs

What MUST run on-device (ESP32-S3)

What SHOULD NOT run on-device

Recommendation

3. Audio Pipeline Assessment

STT Options (Ranked)

Speaker Diarization

Noise Handling in Clinical Environments

4. Contextual Retrieval & Summarization

SOAP Note Generation Pipeline

Calendar/PIMS Integration for Patient Matching

Longitudinal Patient History Architecture

5. Top 5 Technical Risks (Ranked)

Risk 1: Audio Quality in Noisy Clinical Environments

Risk 2: Veterinary Terminology Accuracy

Risk 3: WiFi Reliability in Clinic Buildings

Risk 4: PIMS Integration Fragmentation

Risk 5: Device Provisioning and Fleet Management

6. Privacy & Security Architecture

LGPD Compliance Strategy

Encryption Strategy

What NOT to do (correcting the research brief)

Consent UX

7. Scalability Analysis

Phase transitions that change the architecture

What scales linearly (no architecture change)

What requires step-function changes

Cost scaling

8. Build vs. Buy

9. Assumptions That Must Be Validated

10. Recommended MVP Technical Path

Phase 1: Wizard-of-Oz (Weeks 5-10)

Phase 2: Automated MVP (Weeks 11-18)

Phase 3: Design Partner Expansion (Weeks 19-26)

11. Key Open Questions

12. Final Recommendation