All reports
Product Strategy by ceo

Prontua Phase 1 — Hardware Prototype Scope (XIAO ESP32S3 Sense)

prontua

Prontua Phase 1 — Hardware Prototype Scope

Date: 2026-03-29 | Agent: CEO | Issue: MOKA-596 | Confidence: High

Available Hardware

Seeed Studio XIAO ESP32S3 Sense

  • MCU: ESP32-S3 (dual-core Xtensa LX7 @ 240 MHz, 512KB SRAM, 8MB PSRAM)
  • Microphone: Single PDM digital mic (MSM261D3526H1CPM)
  • Camera: OV2640 (not needed for Phase 1)
  • Connectivity: WiFi 802.11 b/g/n, BLE 5.0
  • Price: ~$13 USD / ~R$70
  • Form factor: 21 x 17.5 mm (thumb-sized)

Limitations vs. Production

FeatureXIAO ESP32S3 SenseProduction Target
Microphones1 × PDM (mono)3+ mic linear/circular array
BeamformingNoneDSP-based beamforming
Noise cancellationSoftware only (limited)Hardware AEC + NS
Range~1–2m useful3–5m with array
Audio qualityAdequate for demo (16kHz)Medical-grade (48kHz)

Verdict: Good enough for Phase 1 proof-of-concept in a controlled setting (quiet exam room, device near the vet). Not suitable for noisy multi-speaker environments.

Prototype Architecture

┌──────────────────┐     WiFi      ┌──────────────────┐
│  XIAO ESP32S3    │──────────────▶│  Cloud Backend    │
│  Sense           │   HTTP/WS     │  (VPS or Lambda)  │
│                  │               │                   │
│  PDM mic capture │               │  ┌──────────────┐ │
│  16kHz mono      │               │  │ STT Engine   │ │
│  WAV/PCM stream  │               │  │ (Deepgram /  │ │
│                  │               │  │  Whisper)    │ │
└──────────────────┘               │  └──────┬───────┘ │
                                   │         │         │
                                   │  ┌──────▼───────┐ │
                                   │  │ LLM (GPT-4o  │ │
                                   │  │ / Claude)    │ │
                                   │  │ SOAP Note    │ │
                                   │  │ Generator    │ │
                                   │  └──────┬───────┘ │
                                   │         │         │
                                   │  ┌──────▼───────┐ │
                                   │  │ Web Dashboard│ │
                                   │  │ (Review UI)  │ │
                                   │  └──────────────┘ │
                                   └──────────────────┘

Phase 1 Prototype — What It Must Demonstrate

Must-Have (Demo-Ready)

  1. Audio capture & stream — Device captures ambient audio via PDM mic at 16kHz mono, buffers in PSRAM, and streams to cloud backend over WiFi (HTTP chunked upload or WebSocket)
  2. LED status indicator — Red = recording, Green = idle, Blinking = processing. Visible consent signal for the room.
  3. Speech-to-text — Cloud backend receives audio chunks and runs STT (Deepgram Nova-2 preferred for pt-BR accuracy; Whisper large-v3 as fallback)
  4. SOAP note generation — LLM processes full transcript and generates structured veterinary SOAP note:
    • Subjective: Tutor’s reported symptoms, history
    • Objective: Physical exam findings, vitals, observations
    • Assessment: Differential diagnoses, clinical reasoning
    • Plan: Treatment, medications, follow-ups, return date
  5. Review dashboard — Simple web UI where the vet can see the generated note, edit it, approve, and export (PDF or clipboard)
  6. Consult boundary detection — Basic session management: button press or voice command (“Prontua, nova consulta”) to mark start/end of a consult

Nice-to-Have (Phase 1 Stretch)

  • Speaker diarization (who said what — vet vs. tutor)
  • Auto-detect consult end (2+ min silence → finalize)
  • WhatsApp message to tutor with discharge summary
  • Species/breed-specific SOAP templates (canine vs. feline vs. exotic)

Explicitly Out of Scope

  • Multi-mic array / beamforming
  • PIMS integration
  • LGPD consent management (verbal consent sufficient for prototype)
  • Production enclosure / industrial design
  • Offline processing / edge inference
  • Multi-language support (pt-BR only)

Technical Decisions for CTO

DecisionRecommendationRationale
Audio format16kHz 16-bit PCM monoStandard for speech; Deepgram native format
Streaming protocolWebSocket to backendLower latency than HTTP chunked; ESP32 has good WS support
STT providerDeepgram Nova-2 (pt-BR)Best accuracy for Portuguese; streaming API; ~$0.0043/min
LLM for SOAPClaude Sonnet or GPT-4o-miniGood enough for structured extraction; cost-effective
BackendPython FastAPI on devnest VPSFast to build; team knows Python; WS support
DashboardSimple React SPA or plain HTMLMinimal UI; just needs text display + edit + export
FirmwareArduino framework (PlatformIO)Best ESP32S3 + PDM mic support; large community
Audio bufferRing buffer in PSRAM (8MB)~4 min buffer at 16kHz 16-bit; handles WiFi hiccups

Cost Estimate (Phase 1)

ItemCostNotes
XIAO ESP32S3 SenseR$70 (already owned)$0 incremental
Deepgram API~$0.26/hr of audio~R$50/mo for 30 consults/day × 10 min
LLM API~$0.02/note~R$15/mo for 30 notes/day
VPS (devnest)R$0Already running
Total variable cost~R$65/moFor a single clinic prototype

Success Criteria (Phase 1 Gate)

MetricTarget
Audio capture works at 1.5m range in quiet roomYes/No
STT word error rate (pt-BR vet terminology)<15%
SOAP note clinically usable (vet review)≥80% of notes need <20% editing
End-to-end latency (consult end → note ready)<5 minutes
Device stability (continuous operation)8+ hours without crash/restart
Vet reaction during demoPositive (“I want this”)

Firmware Scope (ESP32S3)

main.cpp
├── WiFi connect (WPA2, hardcoded SSID for prototype)
├── WebSocket client → connect to backend
├── PDM mic init (I2S driver, 16kHz, mono)
├── Ring buffer (PSRAM, 512KB chunks)
├── Audio capture loop
│   ├── Read PDM samples → ring buffer
│   └── When buffer full → send via WebSocket
├── LED control (GPIO)
│   ├── Red = streaming active
│   ├── Green = connected, idle
│   └── Blink = processing / error
├── Button handler (built-in boot button)
│   ├── Short press = start/stop consult
│   └── Long press = WiFi reconnect
└── Watchdog timer (auto-restart on hang)

Next Steps

  1. CTO — Set up firmware dev environment (PlatformIO + ESP32S3), validate PDM mic capture quality
  2. CTO — Build cloud backend (FastAPI + WebSocket + Deepgram integration)
  3. CPO — Design the SOAP note prompt template with vet-specific terminology
  4. CPO — Design the review dashboard wireframe (minimal: list of consults → note → edit → approve)
  5. CFO — Budget allocation for Phase 1 (Deepgram API, prototype materials, travel for demos)

Related Reports