mirror of
https://kevinblog.sytes.net/Code/Jibo-Revival-Group/JiboExperiments.git
synced 2026-06-16 08:56:28 +00:00
first stab at solving for menus and real STT
This commit is contained in:
@@ -55,6 +55,20 @@ Right now the strongest implemented vertical slice beyond basic listen completio
|
||||
|
||||
That should remain the model for future websocket work: capture first, fixture second, parity third.
|
||||
|
||||
The latest live captures also support a second discovery track:
|
||||
|
||||
- menu-driven `CLIENT_NLU` parity for clock, timer, and alarm flows
|
||||
- richer transcript-bearing `CLIENT_ASR` discovery beyond jokes
|
||||
- buffered-audio preservation for eventual real ASR in `.NET`
|
||||
|
||||
Near-term ASR work should stay staged:
|
||||
|
||||
1. preserve and replay the websocket audio payloads honestly
|
||||
2. validate a local tool-based decode/transcribe loop in `.NET`
|
||||
3. compare that against Azure-hosted STT before choosing a default production path
|
||||
|
||||
That keeps Node as the reverse-engineering oracle while letting the long-term `.NET` cloud gain real STT seams without pretending they are finished.
|
||||
|
||||
## Speech, Animation, And ESML
|
||||
|
||||
The current joke flow is only a small foothold into Jibo expressiveness.
|
||||
|
||||
@@ -108,6 +108,65 @@ What remains intentionally unclaimed for that slice:
|
||||
- whether additional websocket messages appear in other successful skill paths
|
||||
- whether any timing gaps besides the observed 75 ms `EOS -> SKILL_ACTION` delay matter
|
||||
|
||||
### Latest Live Capture Additions From April 16, 2026
|
||||
|
||||
The newest repo-root websocket capture at [captures/websocket/20260416.events.ndjson](/C:/Projects/JiboExperiments/captures/websocket/20260416.events.ndjson) adds more grounded websocket discovery without implying broad protocol coverage.
|
||||
|
||||
Observed `CLIENT_ASR` transcript-bearing turns now include:
|
||||
|
||||
- `tell me a joke`
|
||||
- `do a dance`
|
||||
- `surprise me`
|
||||
- `personal report`
|
||||
- `tell me about the weather`
|
||||
- `tell me about my calendar`
|
||||
- `what does my commute look like`
|
||||
- `tell me about the news`
|
||||
|
||||
Observed menu-driven `CLIENT_NLU` intents now include:
|
||||
|
||||
- `loadMenu`
|
||||
- `askForTime`
|
||||
- `askForDate`
|
||||
- `start`
|
||||
- `timerValue`
|
||||
- `set`
|
||||
- `alarmValue`
|
||||
|
||||
Observed entity/rule shapes from those menu flows include:
|
||||
|
||||
- `askForTime` with `entities.domain = "clock"` and `rules = ["clock/clock_menu"]`
|
||||
- `askForDate` with the same `clock` menu rule family
|
||||
- `timerValue` with timer duration entities
|
||||
- `alarmValue` with alarm time entities such as `ampm` and `time`
|
||||
|
||||
Current `.NET` parity for that new slice is still intentionally partial:
|
||||
|
||||
- menu-side `CLIENT_NLU` replies now preserve the observed inbound intent/rules/entities in the synthetic outbound `LISTEN` payload
|
||||
- `askForTime` and `askForDate` are now fixture-backed as mapped menu intents
|
||||
- `do a dance` is now recognized as a distinct chat/dance intent in the current synthetic path
|
||||
|
||||
Still unknown:
|
||||
|
||||
- whether `surprise me`, `personal report`, weather, calendar, commute, and news should map to richer skill-specific websocket payloads
|
||||
- whether menu-side clock/timer/alarm flows require additional websocket messages beyond the currently observed `LISTEN` and `EOS`
|
||||
- how much of those flows are actually completed robot-side versus merely acknowledged by the cloud
|
||||
|
||||
### Buffered Audio / ASR Direction
|
||||
|
||||
The `.NET` hosted implementation now has two STT lanes:
|
||||
|
||||
- existing synthetic transcript-hint replay for fixture-driven parity work
|
||||
- a new opt-in local buffered-audio path that preserves websocket Ogg/Opus frames and can invoke external `ffmpeg` plus `whisper.cpp`
|
||||
|
||||
That local tool-based path is intentionally experimental and disabled by default. Its purpose is to let us iterate on real buffered-audio decoding in `.NET` without changing the stable cloud-first architecture or claiming production ASR parity yet.
|
||||
|
||||
Future provider options still under consideration:
|
||||
|
||||
- local decode/transcribe in `.NET` using preserved websocket audio plus external tools
|
||||
- Azure Speech as a hosted STT option for the long-term cloud path
|
||||
- direct managed Opus decode later if a library proves stable enough for the hosted deployment target
|
||||
|
||||
Current raw-audio fallback behavior remains explicitly synthetic:
|
||||
|
||||
- when a buffered-audio turn can be resolved through the synthetic transcript-hint seam, `.NET` now auto-finalizes and emits `LISTEN` + `EOS` + `SKILL_ACTION`
|
||||
|
||||
Reference in New Issue
Block a user