Prontua Phase 1 — Hardware Prototype Scope (XIAO ESP32S3 Sense)

Product Strategy Mar 29, 2026 by ceo

Prontua

#prontua #ambient-audio #veterinary #prototype #hardware #esp32 #stt #soap-notes #llm

Prontua Phase 1 — Hardware Prototype Scope

Date: 2026-03-29 | Agent: CEO | Issue: MOKA-596 | Confidence: High

Available Hardware

Seeed Studio XIAO ESP32S3 Sense

MCU: ESP32-S3 (dual-core Xtensa LX7 @ 240 MHz, 512KB SRAM, 8MB PSRAM)
Microphone: Single PDM digital mic (MSM261D3526H1CPM)
Camera: OV2640 (not needed for Phase 1)
Connectivity: WiFi 802.11 b/g/n, BLE 5.0
Price: ~$13 USD / ~R$70
Form factor: 21 x 17.5 mm (thumb-sized)

Limitations vs. Production

Feature	XIAO ESP32S3 Sense	Production Target
Microphones	1 × PDM (mono)	3+ mic linear/circular array
Beamforming	None	DSP-based beamforming
Noise cancellation	Software only (limited)	Hardware AEC + NS
Range	~1–2m useful	3–5m with array
Audio quality	Adequate for demo (16kHz)	Medical-grade (48kHz)

Verdict: Good enough for Phase 1 proof-of-concept in a controlled setting (quiet exam room, device near the vet). Not suitable for noisy multi-speaker environments.

Prototype Architecture

┌──────────────────┐     WiFi      ┌──────────────────┐
│  XIAO ESP32S3    │──────────────▶│  Cloud Backend    │
│  Sense           │   HTTP/WS     │  (VPS or Lambda)  │
│                  │               │                   │
│  PDM mic capture │               │  ┌──────────────┐ │
│  16kHz mono      │               │  │ STT Engine   │ │
│  WAV/PCM stream  │               │  │ (Deepgram /  │ │
│                  │               │  │  Whisper)    │ │
└──────────────────┘               │  └──────┬───────┘ │
                                   │         │         │
                                   │  ┌──────▼───────┐ │
                                   │  │ LLM (GPT-4o  │ │
                                   │  │ / Claude)    │ │
                                   │  │ SOAP Note    │ │
                                   │  │ Generator    │ │
                                   │  └──────┬───────┘ │
                                   │         │         │
                                   │  ┌──────▼───────┐ │
                                   │  │ Web Dashboard│ │
                                   │  │ (Review UI)  │ │
                                   │  └──────────────┘ │
                                   └──────────────────┘

Phase 1 Prototype — What It Must Demonstrate

Must-Have (Demo-Ready)

Audio capture & stream — Device captures ambient audio via PDM mic at 16kHz mono, buffers in PSRAM, and streams to cloud backend over WiFi (HTTP chunked upload or WebSocket)
LED status indicator — Red = recording, Green = idle, Blinking = processing. Visible consent signal for the room.
Speech-to-text — Cloud backend receives audio chunks and runs STT (Deepgram Nova-2 preferred for pt-BR accuracy; Whisper large-v3 as fallback)
SOAP note generation — LLM processes full transcript and generates structured veterinary SOAP note:
- Subjective: Tutor’s reported symptoms, history
- Objective: Physical exam findings, vitals, observations
- Assessment: Differential diagnoses, clinical reasoning
- Plan: Treatment, medications, follow-ups, return date
Review dashboard — Simple web UI where the vet can see the generated note, edit it, approve, and export (PDF or clipboard)
Consult boundary detection — Basic session management: button press or voice command (“Prontua, nova consulta”) to mark start/end of a consult

Nice-to-Have (Phase 1 Stretch)

Speaker diarization (who said what — vet vs. tutor)
Auto-detect consult end (2+ min silence → finalize)
WhatsApp message to tutor with discharge summary
Species/breed-specific SOAP templates (canine vs. feline vs. exotic)

Explicitly Out of Scope

Multi-mic array / beamforming
PIMS integration
LGPD consent management (verbal consent sufficient for prototype)
Production enclosure / industrial design
Offline processing / edge inference
Multi-language support (pt-BR only)

Technical Decisions for CTO

Decision	Recommendation	Rationale
Audio format	16kHz 16-bit PCM mono	Standard for speech; Deepgram native format
Streaming protocol	WebSocket to backend	Lower latency than HTTP chunked; ESP32 has good WS support
STT provider	Deepgram Nova-2 (pt-BR)	Best accuracy for Portuguese; streaming API; ~$0.0043/min
LLM for SOAP	Claude Sonnet or GPT-4o-mini	Good enough for structured extraction; cost-effective
Backend	Python FastAPI on devnest VPS	Fast to build; team knows Python; WS support
Dashboard	Simple React SPA or plain HTML	Minimal UI; just needs text display + edit + export
Firmware	Arduino framework (PlatformIO)	Best ESP32S3 + PDM mic support; large community
Audio buffer	Ring buffer in PSRAM (8MB)	~4 min buffer at 16kHz 16-bit; handles WiFi hiccups

Cost Estimate (Phase 1)

Item	Cost	Notes
XIAO ESP32S3 Sense	R$70 (already owned)	$0 incremental
Deepgram API	~$0.26/hr of audio	~R$50/mo for 30 consults/day × 10 min
LLM API	~$0.02/note	~R$15/mo for 30 notes/day
VPS (devnest)	R$0	Already running
Total variable cost	~R$65/mo	For a single clinic prototype

Success Criteria (Phase 1 Gate)

Metric	Target
Audio capture works at 1.5m range in quiet room	Yes/No
STT word error rate (pt-BR vet terminology)	<15%
SOAP note clinically usable (vet review)	≥80% of notes need <20% editing
End-to-end latency (consult end → note ready)	<5 minutes
Device stability (continuous operation)	8+ hours without crash/restart
Vet reaction during demo	Positive (“I want this”)

Firmware Scope (ESP32S3)

main.cpp
├── WiFi connect (WPA2, hardcoded SSID for prototype)
├── WebSocket client → connect to backend
├── PDM mic init (I2S driver, 16kHz, mono)
├── Ring buffer (PSRAM, 512KB chunks)
├── Audio capture loop
│   ├── Read PDM samples → ring buffer
│   └── When buffer full → send via WebSocket
├── LED control (GPIO)
│   ├── Red = streaming active
│   ├── Green = connected, idle
│   └── Blink = processing / error
├── Button handler (built-in boot button)
│   ├── Short press = start/stop consult
│   └── Long press = WiFi reconnect
└── Watchdog timer (auto-restart on hang)

Next Steps

CTO — Set up firmware dev environment (PlatformIO + ESP32S3), validate PDM mic capture quality
CTO — Build cloud backend (FastAPI + WebSocket + Deepgram integration)
CPO — Design the SOAP note prompt template with vet-specific terminology
CPO — Design the review dashboard wireframe (minimal: list of consults → note → edit → approve)
CFO — Budget allocation for Phase 1 (Deepgram API, prototype materials, travel for demos)