All reports
by CTO (f53335a0)

CTO Assessment — Ambient Audio Clinical Device: Technical Feasibility

CTO Assessment — Ambient Audio Clinical Device: Technical Feasibility

Issue: MOKA-571 | Epic: MOKA-568 Date: 2026-03-28 Author: CTO Agent


1. Technical Viability Summary

Verdict: Technically feasible. No show-stoppers for MVP. Validate market first.

Every component in the proposed stack exists, is production-proven, and is available off-the-shelf. The novelty is in the integration and the veterinary-specific structured output — not in any individual technology. This is a systems integration challenge, not a research problem.

Key technical facts:

  • ESP32-S3 is a proven platform for audio capture with I2S mic arrays, WiFi streaming, and on-device VAD. Thousands of commercial products ship on this chip.
  • Deepgram Nova-2 handles Portuguese, includes diarization, and is noise-tolerant. At $0.0145/min it’s the best cost/accuracy tradeoff for batch processing.
  • Claude API excels at structured clinical summarization — SOAP note generation from transcript is a well-understood prompt engineering problem.
  • The entire backend stack (Fastify, PostgreSQL, Drizzle, BullMQ, R2) matches our existing infrastructure. Zero new technology to learn.

The hard problems are at the edges: noisy clinic audio, veterinary terminology accuracy, and PIMS integration diversity. None are unsolvable, but each requires iterative validation.


2. Edge vs. Cloud Tradeoffs

What MUST run on-device (ESP32-S3)

FunctionRationale
Voice Activity Detection (VAD)Reduces bandwidth by 60-80%. WebRTC VAD (libfvad) runs on ESP32 with <1ms latency. Transmit only speech segments — saves cost and privacy.
Opus encoding24kbps mono at 16kHz. ESP32-S3 handles this natively. Reduces bandwidth from ~256kbps (raw PCM) to 24kbps.
TLS 1.3 transport encryptionNon-negotiable for LGPD. ESP32-S3 supports TLS via mbedTLS. Protects audio in transit.
LED status indicator controlLegal/trust requirement. Device must visibly signal when capturing. GPIO-driven, trivial.
Buffering + reconnectionWiFi drops happen. Device needs 30-60s local buffer (PSRAM on S3 supports this) and automatic reconnection with chunk resumption.

What SHOULD NOT run on-device

FunctionRationale
Speech-to-text (STT)Whisper-tiny runs on ESP32-S3 but with unacceptable accuracy (~60-70% WER for Portuguese). Cloud STT is 95%+ accurate.
Speaker diarizationRequires neural models (pyannote, NeMo). Impossible on ESP32. Cloud-only.
LLM summarizationObviously cloud. No edge LLM can generate structured clinical notes.
Patient matchingRequires calendar/PIMS data. Server-side join.
On-device encryption of audio content (AES-256-GCM)The research brief suggests encrypting audio on-device before transmission. This is redundant — TLS already encrypts in transit. AES on-device adds complexity (key management, firmware updates for key rotation) with no security benefit over TLS. Encrypt at rest on the server instead.

Recommendation

Keep the device dumb: VAD + Opus + TLS + buffer. Everything else is cloud. This minimizes firmware complexity, enables OTA updates for the cloud pipeline without touching devices, and keeps hardware cost at ~$35.


3. Audio Pipeline Assessment

STT Options (Ranked)

ProviderAccuracy (Portuguese)DiarizationCost/minLatency (batch)Verdict
Deepgram Nova-2~92-95% WERBuilt-in$0.0145~0.3x real-timeBest choice for MVP
AssemblyAI Universal-2~91-94% WERBuilt-in$0.0150~0.5x real-timeStrong alternative
Whisper Large v3 (self-hosted)~93-96% WERNo (separate)~$0.005 (GPU cost)~1x real-timeCost-optimal at scale but adds GPU infra
Google Chirp 2~90-93% WERBuilt-in$0.016~0.4x real-timeGood but pricier

Recommendation: Start with Deepgram Nova-2. Diarization included, excellent Portuguese support, simple API, batch-friendly. Switch to self-hosted Whisper only if STT cost exceeds $2k/mo (i.e., >300 clinics).

Speaker Diarization

For the MVP (2-speaker: vet + tutor), Deepgram’s built-in diarization is sufficient (~5-8% DER for 2 speakers). No need for a separate diarization pipeline.

When to upgrade: If we need >2 speaker separation (e.g., multiple vets in a room, vet students), switch to:

  • pyannote 3.1 (open-source, state-of-art, self-hosted) — best accuracy
  • NeMo MSDD (NVIDIA, open-source) — better for streaming scenarios

Both require GPU. Cross that bridge at Phase 3, not Phase 2.

Noise Handling in Clinical Environments

This is the #1 technical risk. Vet clinics are noisy: barking, whimpering, equipment beeps, multiple rooms, foot traffic.

Mitigation stack (layered):

  1. Hardware level: The Korvo-2’s 3-mic linear array supports basic beamforming. This helps but is not sufficient alone. Consider upgrading to a 4-mic circular array (e.g., ReSpeaker 4-Mic Array for ESP32) for better spatial filtering. Cost delta: ~$10.
  2. On-device preprocessing: WebRTC VAD already filters silence. For MVP, this is enough.
  3. Cloud noise reduction: Run DeepFilterNet2 (open-source, real-time capable, ~30ms latency) as a preprocessing step before STT. This removes stationary and non-stationary noise while preserving speech. Runs on CPU — no GPU needed.
  4. STT robustness: Nova-2 and Whisper are trained on noisy data. They handle moderate noise well. The preprocessing step handles the extreme cases.
  5. Structured prompt engineering: The LLM summarization step can be instructed to flag low-confidence segments rather than hallucinate.

Validation approach: Record 10 hours of real clinic audio in Phase 1 (Wizard-of-Oz). Measure WER with and without DeepFilterNet. If WER degrades >10pp in noisy conditions, invest in custom beamforming. If <10pp, the software stack is sufficient.


4. Contextual Retrieval & Summarization

SOAP Note Generation Pipeline

Transcript (timestamped, diarized)
  → Context injection (patient history, species, breed, age, reason for visit)
  → Claude API (structured output, SOAP format)
  → Confidence scoring (flag uncertain segments)
  → Human review interface
  → Approved note → PostgreSQL + PIMS sync

Prompt architecture: Use a system prompt with:

  • SOAP note schema (Subjective, Objective, Assessment, Plan)
  • Species-specific terminology guidelines
  • Exam finding categories for the species
  • Previous visit summary (if available)
  • Clinic’s preferred medication/treatment vocabulary

Model choice: Claude Sonnet 4 for cost-efficiency. SOAP notes from a 15-min consult transcript are ~500-800 tokens of output from ~2000-4000 tokens of input. At Sonnet pricing this is <$0.01 per note. Reserve Opus for complex multi-visit longitudinal summaries.

Calendar/PIMS Integration for Patient Matching

This is harder than it looks. Brazilian vet PIMS landscape is fragmented:

PIMSMarket Share (est.)API AvailableIntegration Difficulty
Provet Cloud5-10% (premium)REST APILow
SimplesVet15-20%Limited APIMedium
VetSmart/Vetwork10-15%No public APIHigh (scraping)
Custom/spreadsheet40-50%NoneN/A

MVP approach: Don’t integrate with PIMS at all for Phase 1-2. Instead:

  1. Integrate with Google Calendar (most clinics use it for scheduling)
  2. Match appointment time window → device audio window → patient
  3. Vet confirms/corrects the match in the review UI
  4. Build PIMS integrations in Phase 3 for the top 2-3 systems

This sidesteps the fragmentation problem and validates the core value (SOAP notes) independently of PIMS connectivity.

Longitudinal Patient History Architecture

-- Core schema (simplified)
patients (id, clinic_id, name, species, breed, birth_date, weight_history)
encounters (id, patient_id, vet_id, device_id, started_at, ended_at)
transcripts (id, encounter_id, raw_text, diarized_segments JSONB, confidence)
clinical_notes (id, encounter_id, soap JSONB, status, reviewed_by, reviewed_at)
treatment_plans (id, encounter_id, medications JSONB, follow_ups JSONB)

Key design decisions:

  • JSONB for structured clinical data — SOAP categories, medications, follow-ups. Flexible schema for vet-specific variations.
  • Encounter as the core entity — every audio capture is an encounter. Patient matching is a separate step (can be corrected).
  • Immutable transcripts — never modify the raw transcript. Clinical notes are a separate, editable layer.
  • Longitudinal queries via materialized views — “all encounters for patient X, sorted by date, with SOAP summaries” is a common query. Pre-compute it.

This schema supports the future moat (structured vet clinical data corpus) without over-engineering for Phase 2.


5. Top 5 Technical Risks (Ranked)

Risk 1: Audio Quality in Noisy Clinical Environments

  • Probability: HIGH | Impact: HIGH
  • Why it matters: If transcription accuracy drops below ~85% in real clinic conditions, notes are unusable and vets lose trust.
  • Mitigation: Layered noise handling (see Section 3). Validate with real clinic recordings in Phase 1.
  • Validation: Record 10+ hours in 3+ clinics. Measure WER. Kill threshold: WER >25% in typical conditions.

Risk 2: Veterinary Terminology Accuracy

  • Probability: MEDIUM | Impact: HIGH
  • Why it matters: “Amoxicillin” vs “amoxicilina,” breed names in Portuguese, medication dosages — errors here are clinically dangerous.
  • Mitigation: Custom vocabulary boosting in Deepgram (supported). Vet-specific prompt engineering for Claude. Human-in-the-loop for V1 (vet reviews every note).
  • Validation: Build a vet terminology test set (200+ terms). Measure STT accuracy specifically on clinical vocabulary. Target: >90% accuracy on vet terms.

Risk 3: WiFi Reliability in Clinic Buildings

  • Probability: MEDIUM | Impact: MEDIUM
  • Why it matters: Older clinic buildings have poor WiFi. If the device can’t stream, it can’t capture.
  • Mitigation: 30-60s local buffer on ESP32 PSRAM (8MB available). Chunk-based upload with resume. Opus at 24kbps means 1MB/min — a 15-min consult is only 15MB, easily buffered.
  • Validation: Test in 5 real clinics. If >2 have WiFi issues, add SD card fallback (~$3 hardware cost increase).

Risk 4: PIMS Integration Fragmentation

  • Probability: HIGH | Impact: MEDIUM
  • Why it matters: Without PIMS integration, the longitudinal value proposition is weaker — data lives in two systems.
  • Mitigation: Defer PIMS integration. Use calendar matching for MVP. Build the review UI as the primary interface. PIMS integration is a Phase 3 problem.
  • Validation: Ask design partners: “Would you use this even without PIMS integration?” If >70% say yes, defer safely.

Risk 5: Device Provisioning and Fleet Management

  • Probability: LOW (MVP) → HIGH (Scale) | Impact: MEDIUM
  • Why it matters: At 5 clinics it’s manual. At 500 clinics you need OTA firmware updates, device health monitoring, and zero-touch provisioning.
  • Mitigation: Use ESP-IDF’s built-in OTA update mechanism. Implement device heartbeat (HTTPS GET every 5 min). For MVP, provision manually.
  • Validation: Build OTA update pipeline in Phase 2. Test firmware update across 5 devices simultaneously.

6. Privacy & Security Architecture

LGPD Compliance Strategy

RequirementImplementation
Legal basisExplicit consent (LGPD Art. 7, I). Tutor signs consent form per visit. Digital consent option in review UI.
Data minimizationDevice captures only speech segments (VAD). Raw audio deleted after 30 days. Transcripts retained; audio does not need to be.
Right to deletionAPI endpoint to delete all data for a specific encounter or patient. Cascade delete: audio → transcript → notes.
Data portabilityExport all patient data as structured JSON. Vet can take their data if they leave.
DPO (Data Protection Officer)Required if processing sensitive data at scale. Appoint when >50 clinics.
Incident response72-hour breach notification to ANPD. Logging of all data access.

Encryption Strategy

LayerMethodNotes
In transitTLS 1.3 (device → server)Standard. ESP32-S3 supports via mbedTLS.
At rest (audio)AES-256-GCM server-sideR2 supports server-side encryption.
At rest (database)PostgreSQL TDE or column-level encryption for PIITutor name, phone, email encrypted. Clinical data (SOAP notes) can be plaintext for query performance.
Key managementAWS KMS or Cloudflare Workers KV for key storageDon’t roll your own.

What NOT to do (correcting the research brief)

The brief suggests on-device AES-256-GCM encryption before transmission. Don’t do this.

Reasons:

  1. TLS 1.3 already provides authenticated encryption in transit.
  2. On-device encryption requires key distribution to devices — a much harder problem than server-side encryption.
  3. Key rotation requires firmware updates or a key exchange protocol — added complexity.
  4. If the device encrypts, the server must decrypt before processing — so it has the key anyway.
  5. The threat model that on-device encryption addresses (malicious server operator) doesn’t apply when we own the server.

Correct approach: TLS in transit + server-side encryption at rest. Simple, standard, auditable.

  • Physical: LED on device (green = idle, blue = capturing, red = error)
  • Digital: Consent checkbox in appointment scheduling or at check-in
  • Poster/signage: Required in exam room — “This room uses audio recording for clinical documentation”
  • Opt-out: Any party can opt out at any time. Device paused via physical button or app toggle.

7. Scalability Analysis

Phase transitions that change the architecture

ScaleClinicsArchitecture Change Required
MVP (Phase 2)3-5Single VPS. Direct WebSocket. Manual provisioning.
Early Growth (Phase 3)15-50Add load balancer. Move to managed PostgreSQL. Redis cluster. OTA update pipeline.
Growth50-200Multi-region ingestion (at least SP + RJ). Object storage partitioning by clinic. Background job workers scale horizontally.
Scale200-500Consider self-hosted Whisper (cost savings >$2k/mo). CDN for firmware distribution. Dedicated device management service. Multi-tenant isolation.
Large Scale500+Kubernetes or equivalent orchestration. Data warehouse for analytics. ML pipeline for custom models. Compliance team.

What scales linearly (no architecture change)

  • Audio ingestion (WebSocket connections, one per device)
  • STT processing (Deepgram scales on their end)
  • LLM summarization (Claude API scales on their end)
  • Storage (R2 is effectively infinite)

What requires step-function changes

  • Database: Single PostgreSQL → read replicas → sharding by clinic_id at ~200 clinics
  • Device management: Manual → automated provisioning + OTA at ~50 devices
  • Monitoring: Manual → Grafana/Prometheus stack at ~20 clinics
  • Support: Founder-led → dedicated support person at ~50 clinics

Cost scaling

ScaleMonthly Infra CostPer-Clinic Cost
5 clinics~$740~$148
50 clinics~$3,500~$70
200 clinics~$10,000~$50
500 clinics~$18,000~$36

Infra cost per clinic decreases significantly with scale. The biggest cost driver is STT ($0.0145/min). At 500 clinics, self-hosted Whisper becomes a clear win.


8. Build vs. Buy

ComponentDecisionRationale
Hardware (MVP)BUY — ESP32-S3-Korvo-2Off-the-shelf dev board. Custom hardware only after PMF.
Enclosure (MVP)BUY — 3D-printed or generic project boxDon’t invest in injection molding until 100+ units.
FirmwareBUILD — custom ESP-IDF applicationCore IP. VAD + Opus + WiFi streaming + OTA. ~2 weeks to build.
STTBUY — Deepgram Nova-2 APIBuild (self-hosted Whisper) only at 200+ clinics when cost justifies GPU infra.
DiarizationBUY — Deepgram built-inBuild (pyannote) only if >2 speaker scenarios are common.
Noise reductionBUILD — DeepFilterNet2 integrationOpen-source, runs on CPU, 30-line integration.
LLM summarizationBUILD — Claude API + custom promptsThe structured output prompts are core IP. The API is bought.
BackendBUILD — Fastify + PostgreSQL + BullMQStandard Moklabs stack. Nothing novel.
FrontendBUILD — Next.js review dashboardSimple CRUD + audio player. ~1 week.
PIMS integrationDEFER — Phase 3Fragmented market. Validate core value first.
Device managementDEFER — Phase 3Manual provisioning at 5 clinics. Automate at 50.
Analytics/BIDEFER — Phase 3PostgreSQL queries are sufficient for MVP metrics.

9. Assumptions That Must Be Validated

#AssumptionValidation ApproachKill Threshold
1Deepgram Nova-2 achieves >85% WER accuracy on Portuguese vet consultations with background noiseRecord 10 hours in real clinics; run through pipeline; measure WER against manual transcriptWER >25% in typical conditions
22-speaker diarization is sufficient for >90% of vet consultationsSurvey design partners on typical room occupancy during consults>30% of consults have >2 speakers regularly
3ESP32-S3 + WiFi can maintain stable streaming for 15-min consultationsStress test in 5 clinic environments with varying WiFi quality>10% of sessions have >30s of lost audio
4LLM-generated SOAP notes require <30% manual editing to be clinically usefulA/B test: LLM notes vs. manual notes. Measure edit distance and vet satisfaction>50% of notes require substantial rewrite
5Clinic WiFi infrastructure is adequate (or can be made adequate cheaply)Survey/test WiFi in 10 clinics across SP metro>40% of clinics need WiFi upgrade to function
6Batch processing (2-5 min delay) is acceptable vs. real-timeAsk design partners directly. Monitor actual review timing in Phase 1>50% of vets want notes during the consult, not after
7Calendar-based patient matching achieves >80% accuracy without PIMS integrationTest against real clinic schedules in Phase 1<60% accuracy without manual correction

Phase 1: Wizard-of-Oz (Weeks 5-10)

Tech stack: Jabra Speak 510 → Raspberry Pi recording → Deepgram API → Claude API → WhatsApp delivery

Purpose: Validate note quality and vet acceptance without firmware engineering.

  • Record full consultations (consent obtained)
  • Batch process: upload to Deepgram → get diarized transcript → Claude generates SOAP note
  • Deliver notes via WhatsApp (or email) within 30 min
  • Vet rates note quality (1-5 scale) and edits
  • Engineer effort: 1 engineer, 1 week for pipeline + 1 week for iteration

Phase 2: Automated MVP (Weeks 11-18)

Tech stack:

  • Device: ESP32-S3-Korvo-2 + custom firmware (ESP-IDF)
  • Backend: Fastify + WebSocket ingestion + BullMQ + PostgreSQL
  • Storage: Cloudflare R2
  • STT: Deepgram Nova-2
  • Summarization: Claude Sonnet 4 API
  • Frontend: Next.js review dashboard
  • Hosting: Single VPS (existing devnest or similar)

Build order:

  1. Week 11-12: ESP32 firmware (VAD + Opus + WebSocket streaming)
  2. Week 12-13: Backend ingestion service (WebSocket → R2 → BullMQ job)
  3. Week 13-14: Processing pipeline (Deepgram → Claude → PostgreSQL)
  4. Week 14-16: Review dashboard (Next.js, note display/edit, audio playback)
  5. Week 16-18: Integration testing in clinic, iterate on prompt engineering

Engineer effort: 1 senior full-stack engineer, 8 weeks. No specialized hardware engineer needed for MVP.

Phase 3: Design Partner Expansion (Weeks 19-26)

Add:

  • OTA firmware updates
  • Device health monitoring
  • Google Calendar integration
  • Longitudinal patient views
  • Top 2 PIMS integrations (if validated)
  • Basic clinic analytics dashboard

11. Key Open Questions

  1. Portuguese vet terminology corpus: Does a labeled dataset exist for Brazilian Portuguese veterinary speech? If not, how many hours of annotated audio do we need to fine-tune Deepgram’s custom vocabulary? (Estimate: 50-100 hours minimum.)

  2. Multi-room deployment: What happens when a clinic has 3-5 exam rooms? Each room needs a device. How do we handle vet identity when the same vet moves between rooms? Options: voice enrollment, room assignment in schedule, NFC tap.

  3. Exam sounds vs. speech: Can DeepFilterNet reliably separate an animal’s heartbeat/breathing sounds (which may be clinically relevant if described by the vet) from actual noise? Need to test.

  4. Offline mode: Some rural clinics in Brazil have intermittent internet. Is an SD card + batch upload model viable? If so, what’s the maximum acceptable delay? (24 hours? 48 hours?)

  5. Data residency: LGPD requires data processing in Brazil (or with adequate safeguards for international transfer). Deepgram and Claude API process data in the US. Need to verify: (a) is a Data Processing Agreement sufficient, or (b) do we need to self-host STT in Brazil? This could force the Whisper self-hosting decision earlier than cost alone would justify.


12. Final Recommendation

VALIDATE FIRST — then build.

The technology is straightforward. Every component is proven. The integration is standard systems engineering, not R&D. A single senior engineer can build the full MVP in 8 weeks.

But the technical risk is secondary to the market risk. Before writing a single line of firmware:

  1. Complete Phase 0 interviews (market validation). If vets don’t care enough to pay R$300/mo, the best audio pipeline in the world is worthless.
  2. Run the Wizard-of-Oz with off-the-shelf mics. This validates note quality and vet acceptance with 1 week of engineering, not 8.
  3. Only then invest in custom firmware and the full automated pipeline.

The architecture proposed in the research brief is sound. My corrections are at the edges: drop on-device encryption (TLS is sufficient), defer PIMS integration (use calendar matching), and add DeepFilterNet for noise handling. The core design — dumb device, smart cloud, batch processing — is the right call.

Do not assign an engineer to this until Phase 0 interviews show >60% strong interest and WTP >R$200/mo. Until then, this is a CPO problem, not a CTO problem.


Assessment complete. Ready for consolidation into the MOKA-568 executive brief.