first stab at solving for menus and real STT

2026-06-16 08:56:28 +00:00 · 2026-04-16 15:40:28 -05:00
parent efe4dfd04e
commit fe1e11653f
19 changed files with 799 additions and 19 deletions
--- a/OpenJibo/docs/development-plan.md
+++ b/OpenJibo/docs/development-plan.md
@@ -55,6 +55,20 @@ Right now the strongest implemented vertical slice beyond basic listen completio

 That should remain the model for future websocket work: capture first, fixture second, parity third.

+The latest live captures also support a second discovery track:
+
+- menu-driven `CLIENT_NLU` parity for clock, timer, and alarm flows
+- richer transcript-bearing `CLIENT_ASR` discovery beyond jokes
+- buffered-audio preservation for eventual real ASR in `.NET`
+
+Near-term ASR work should stay staged:
+
+1. preserve and replay the websocket audio payloads honestly
+2. validate a local tool-based decode/transcribe loop in `.NET`
+3. compare that against Azure-hosted STT before choosing a default production path
+
+That keeps Node as the reverse-engineering oracle while letting the long-term `.NET` cloud gain real STT seams without pretending they are finished.
+
 ## Speech, Animation, And ESML

 The current joke flow is only a small foothold into Jibo expressiveness.
--- a/OpenJibo/docs/protocol-inventory.md
+++ b/OpenJibo/docs/protocol-inventory.md
@@ -108,6 +108,65 @@ What remains intentionally unclaimed for that slice:
 - whether additional websocket messages appear in other successful skill paths
 - whether any timing gaps besides the observed 75 ms `EOS -> SKILL_ACTION` delay matter

+### Latest Live Capture Additions From April 16, 2026
+
+The newest repo-root websocket capture at [captures/websocket/20260416.events.ndjson](/C:/Projects/JiboExperiments/captures/websocket/20260416.events.ndjson) adds more grounded websocket discovery without implying broad protocol coverage.
+
+Observed `CLIENT_ASR` transcript-bearing turns now include:
+
+- `tell me a joke`
+- `do a dance`
+- `surprise me`
+- `personal report`
+- `tell me about the weather`
+- `tell me about my calendar`
+- `what does my commute look like`
+- `tell me about the news`
+
+Observed menu-driven `CLIENT_NLU` intents now include:
+
+- `loadMenu`
+- `askForTime`
+- `askForDate`
+- `start`
+- `timerValue`
+- `set`
+- `alarmValue`
+
+Observed entity/rule shapes from those menu flows include:
+
+- `askForTime` with `entities.domain = "clock"` and `rules = ["clock/clock_menu"]`
+- `askForDate` with the same `clock` menu rule family
+- `timerValue` with timer duration entities
+- `alarmValue` with alarm time entities such as `ampm` and `time`
+
+Current `.NET` parity for that new slice is still intentionally partial:
+
+- menu-side `CLIENT_NLU` replies now preserve the observed inbound intent/rules/entities in the synthetic outbound `LISTEN` payload
+- `askForTime` and `askForDate` are now fixture-backed as mapped menu intents
+- `do a dance` is now recognized as a distinct chat/dance intent in the current synthetic path
+
+Still unknown:
+
+- whether `surprise me`, `personal report`, weather, calendar, commute, and news should map to richer skill-specific websocket payloads
+- whether menu-side clock/timer/alarm flows require additional websocket messages beyond the currently observed `LISTEN` and `EOS`
+- how much of those flows are actually completed robot-side versus merely acknowledged by the cloud
+
+### Buffered Audio / ASR Direction
+
+The `.NET` hosted implementation now has two STT lanes:
+
+- existing synthetic transcript-hint replay for fixture-driven parity work
+- a new opt-in local buffered-audio path that preserves websocket Ogg/Opus frames and can invoke external `ffmpeg` plus `whisper.cpp`
+
+That local tool-based path is intentionally experimental and disabled by default. Its purpose is to let us iterate on real buffered-audio decoding in `.NET` without changing the stable cloud-first architecture or claiming production ASR parity yet.
+
+Future provider options still under consideration:
+
+- local decode/transcribe in `.NET` using preserved websocket audio plus external tools
+- Azure Speech as a hosted STT option for the long-term cloud path
+- direct managed Opus decode later if a library proves stable enough for the hosted deployment target
+
 Current raw-audio fallback behavior remains explicitly synthetic:

 - when a buffered-audio turn can be resolved through the synthetic transcript-hint seam, `.NET` now auto-finalizes and emits `LISTEN` + `EOS` + `SKILL_ACTION`