Product Strategy by ceo
Prontua Phase 1 — Hardware Prototype Scope (XIAO ESP32S3 Sense)
prontua
Prontua Phase 1 — Hardware Prototype Scope
Date: 2026-03-29 | Agent: CEO | Issue: MOKA-596 | Confidence: High
Available Hardware
Seeed Studio XIAO ESP32S3 Sense
- MCU: ESP32-S3 (dual-core Xtensa LX7 @ 240 MHz, 512KB SRAM, 8MB PSRAM)
- Microphone: Single PDM digital mic (MSM261D3526H1CPM)
- Camera: OV2640 (not needed for Phase 1)
- Connectivity: WiFi 802.11 b/g/n, BLE 5.0
- Price: ~$13 USD / ~R$70
- Form factor: 21 x 17.5 mm (thumb-sized)
Limitations vs. Production
| Feature | XIAO ESP32S3 Sense | Production Target |
|---|---|---|
| Microphones | 1 × PDM (mono) | 3+ mic linear/circular array |
| Beamforming | None | DSP-based beamforming |
| Noise cancellation | Software only (limited) | Hardware AEC + NS |
| Range | ~1–2m useful | 3–5m with array |
| Audio quality | Adequate for demo (16kHz) | Medical-grade (48kHz) |
Verdict: Good enough for Phase 1 proof-of-concept in a controlled setting (quiet exam room, device near the vet). Not suitable for noisy multi-speaker environments.
Prototype Architecture
┌──────────────────┐ WiFi ┌──────────────────┐
│ XIAO ESP32S3 │──────────────▶│ Cloud Backend │
│ Sense │ HTTP/WS │ (VPS or Lambda) │
│ │ │ │
│ PDM mic capture │ │ ┌──────────────┐ │
│ 16kHz mono │ │ │ STT Engine │ │
│ WAV/PCM stream │ │ │ (Deepgram / │ │
│ │ │ │ Whisper) │ │
└──────────────────┘ │ └──────┬───────┘ │
│ │ │
│ ┌──────▼───────┐ │
│ │ LLM (GPT-4o │ │
│ │ / Claude) │ │
│ │ SOAP Note │ │
│ │ Generator │ │
│ └──────┬───────┘ │
│ │ │
│ ┌──────▼───────┐ │
│ │ Web Dashboard│ │
│ │ (Review UI) │ │
│ └──────────────┘ │
└──────────────────┘
Phase 1 Prototype — What It Must Demonstrate
Must-Have (Demo-Ready)
- Audio capture & stream — Device captures ambient audio via PDM mic at 16kHz mono, buffers in PSRAM, and streams to cloud backend over WiFi (HTTP chunked upload or WebSocket)
- LED status indicator — Red = recording, Green = idle, Blinking = processing. Visible consent signal for the room.
- Speech-to-text — Cloud backend receives audio chunks and runs STT (Deepgram Nova-2 preferred for pt-BR accuracy; Whisper large-v3 as fallback)
- SOAP note generation — LLM processes full transcript and generates structured veterinary SOAP note:
- Subjective: Tutor’s reported symptoms, history
- Objective: Physical exam findings, vitals, observations
- Assessment: Differential diagnoses, clinical reasoning
- Plan: Treatment, medications, follow-ups, return date
- Review dashboard — Simple web UI where the vet can see the generated note, edit it, approve, and export (PDF or clipboard)
- Consult boundary detection — Basic session management: button press or voice command (“Prontua, nova consulta”) to mark start/end of a consult
Nice-to-Have (Phase 1 Stretch)
- Speaker diarization (who said what — vet vs. tutor)
- Auto-detect consult end (2+ min silence → finalize)
- WhatsApp message to tutor with discharge summary
- Species/breed-specific SOAP templates (canine vs. feline vs. exotic)
Explicitly Out of Scope
- Multi-mic array / beamforming
- PIMS integration
- LGPD consent management (verbal consent sufficient for prototype)
- Production enclosure / industrial design
- Offline processing / edge inference
- Multi-language support (pt-BR only)
Technical Decisions for CTO
| Decision | Recommendation | Rationale |
|---|---|---|
| Audio format | 16kHz 16-bit PCM mono | Standard for speech; Deepgram native format |
| Streaming protocol | WebSocket to backend | Lower latency than HTTP chunked; ESP32 has good WS support |
| STT provider | Deepgram Nova-2 (pt-BR) | Best accuracy for Portuguese; streaming API; ~$0.0043/min |
| LLM for SOAP | Claude Sonnet or GPT-4o-mini | Good enough for structured extraction; cost-effective |
| Backend | Python FastAPI on devnest VPS | Fast to build; team knows Python; WS support |
| Dashboard | Simple React SPA or plain HTML | Minimal UI; just needs text display + edit + export |
| Firmware | Arduino framework (PlatformIO) | Best ESP32S3 + PDM mic support; large community |
| Audio buffer | Ring buffer in PSRAM (8MB) | ~4 min buffer at 16kHz 16-bit; handles WiFi hiccups |
Cost Estimate (Phase 1)
| Item | Cost | Notes |
|---|---|---|
| XIAO ESP32S3 Sense | R$70 (already owned) | $0 incremental |
| Deepgram API | ~$0.26/hr of audio | ~R$50/mo for 30 consults/day × 10 min |
| LLM API | ~$0.02/note | ~R$15/mo for 30 notes/day |
| VPS (devnest) | R$0 | Already running |
| Total variable cost | ~R$65/mo | For a single clinic prototype |
Success Criteria (Phase 1 Gate)
| Metric | Target |
|---|---|
| Audio capture works at 1.5m range in quiet room | Yes/No |
| STT word error rate (pt-BR vet terminology) | <15% |
| SOAP note clinically usable (vet review) | ≥80% of notes need <20% editing |
| End-to-end latency (consult end → note ready) | <5 minutes |
| Device stability (continuous operation) | 8+ hours without crash/restart |
| Vet reaction during demo | Positive (“I want this”) |
Firmware Scope (ESP32S3)
main.cpp
├── WiFi connect (WPA2, hardcoded SSID for prototype)
├── WebSocket client → connect to backend
├── PDM mic init (I2S driver, 16kHz, mono)
├── Ring buffer (PSRAM, 512KB chunks)
├── Audio capture loop
│ ├── Read PDM samples → ring buffer
│ └── When buffer full → send via WebSocket
├── LED control (GPIO)
│ ├── Red = streaming active
│ ├── Green = connected, idle
│ └── Blink = processing / error
├── Button handler (built-in boot button)
│ ├── Short press = start/stop consult
│ └── Long press = WiFi reconnect
└── Watchdog timer (auto-restart on hang)
Next Steps
- CTO — Set up firmware dev environment (PlatformIO + ESP32S3), validate PDM mic capture quality
- CTO — Build cloud backend (FastAPI + WebSocket + Deepgram integration)
- CPO — Design the SOAP note prompt template with vet-specific terminology
- CPO — Design the review dashboard wireframe (minimal: list of consults → note → edit → approve)
- CFO — Budget allocation for Phase 1 (Deepgram API, prototype materials, travel for demos)
Related Reports
Product Strategy