From summer
Adds facial animation and lipsync to characters: phoneme-driven mouth movement and emotional expressions (smile, frown, surprise). Works with Rhubarb Lip Sync, ARKit visemes, and audio from TTS.
How this skill is triggered — by the user, by Claude, or both
Slash command
/summer:facial-and-lipsync**/*.gd**/*.tscn**/*.tresThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Half generative, half authored. The *generative* half: an audio file (TTS via `summer_generate_audio` with `capability: "text_to_speech"`) goes through a phoneme-extraction tool (Summer does NOT wrap one — you run it externally), out comes a viseme timeline — a list of `{ phoneme, start_time, duration }` triples. The *authored* half: the character's face mesh must have BlendShapes named for the...
Half generative, half authored. The generative half: an audio file (TTS via summer_generate_audio with capability: "text_to_speech") goes through a phoneme-extraction tool (Summer does NOT wrap one — you run it externally), out comes a viseme timeline — a list of { phoneme, start_time, duration } triples. The authored half: the character's face mesh must have BlendShapes named for the standard viseme set, or the timeline has nothing to drive. Without both halves, you get a talking robot.
The 2026 production stack:
.wav/.mp3 from summer_generate_audio({capability: "text_to_speech", ...}) or imported VO.summer_* MCP for this in the current engine.smile, brow_raise, eye_squint etc., authored in code or at clip-edit time.summer:animation/generate-motion.summer_generate_audio({capability: "text_to_speech", ...}). Lipsync without audio has nothing to sync to.summer_inspect_node "./World/NPC/Head" # or wherever the head MeshInstance3D lives
Look for mesh.blend_shape_count > 0 and a list of names. The Meshy ARKit-52 set (default for Meshy character heads) has names like jawOpen, mouthClose, mouthFunnel, mouthPucker, mouthLeft, mouthRight, mouthSmile_L, mouthSmile_R, mouthFrown_L, mouthFrown_R, browInnerUp, browOuterUp_L, browDown_L, eyeBlink_L, eyeWide_L, eyeSquint_L (mirrored on R).
If the head has zero BlendShapes, stop. Tell the user: "This head mesh has no BlendShapes — facial animation is impossible without re-meshing. Options: regenerate the character with the face_blendshapes: arkit option (summer_generate_3d({ kind: \"image-to-3d\", options: { rig: true, face_blendshapes: \"arkit\" } })), use an emotional body-language overlay instead, or hand off the character to a 3D artist for shape-key authoring." Don't proceed.
Note: the face_blendshapes option is a Meshy passthrough; if your target backend doesn't support it, the rig will use the default skeleton — adjust manually in the Meshy dashboard.
If the user has VO already, use it. If not, generate via TTS. The voiceId is an ElevenLabs voice ID — the user has to pick one from the ElevenLabs voice library (https://elevenlabs.io/app/voice-library) or upload/clone their own. There is no "voice name" string; it's always an ID like 21m00Tcm4TlvDq8ikWAM (Rachel) or a custom-cloned ID.
summer_generate_audio({
capability: "text_to_speech",
text: "Welcome to the village, traveler.",
voiceId: "<elevenlabs_voice_id>"
})
// returns { jobId, ... } — poll with summer_check_job for the audio asset
If the user hasn't picked a voice yet, stop and ask: "Which ElevenLabs voice ID should I use? You can browse the library at elevenlabs.io/app/voice-library and copy the ID, or paste a custom-clone ID from your account."
Summer does not wrap a phoneme-extraction MCP tool. Run it externally, then import the result. Recommended: Rhubarb Lip Sync — an open-source CLI built for game-dev lipsync. Mouth shapes A–H map cleanly to viseme shapes.
Install once:
# macOS
brew install rhubarb-lip-sync
# Windows / Linux: download release from https://github.com/DanielSWolf/rhubarb-lip-sync
Run per VO line:
rhubarb -f json -o welcome_traveler.json welcome_traveler.wav
Output JSON shape:
{
"metadata": { "duration": 2.34 },
"mouthCues": [
{ "start": 0.00, "end": 0.08, "value": "B" },
{ "start": 0.08, "end": 0.18, "value": "C" },
{ "start": 0.18, "end": 0.25, "value": "D" },
{ "start": 0.25, "end": 0.30, "value": "B" },
{ "start": 0.30, "end": 0.42, "value": "A" },
{ "start": 0.42, "end": 0.50, "value": "B" },
{ "start": 2.30, "end": 2.34, "value": "X" }
]
}
Rhubarb mouth-shape → ARKit viseme cheat sheet (apply during the bake step):
| Rhubarb | ARKit viseme | Notes |
|---|---|---|
| A | viseme_PP | closed (P, B, M) |
| B | viseme_kk | slightly open, neutral |
| C | viseme_E | open, lips spread (EH, IH) |
| D | viseme_aa | wide open (AA, AH) |
| E | viseme_O | rounded (OW, ER) |
| F | viseme_U | small rounded (UW, OO) |
| G | viseme_FF | F, V (lower lip + teeth) |
| H | viseme_RR | L, R (tongue raised) |
| X | viseme_sil | silence / closed neutral |
Manual fallback (no Rhubarb available): for short lines you can hand-type a mouthCues array by listening to the clip and tagging vowel/consonant boundaries. Painful past ~3s of audio; use Rhubarb for anything longer.
Cloud fallback: a Whisper-phoneme model on Replicate or a Hugging Face inference endpoint will emit ARPAbet phonemes with timestamps. Then map ARPAbet → ARKit visemes via the table in the Reference card section. More accurate than Rhubarb on noisy audio; slower and costs cents per minute.
Cost: Rhubarb is free + ~5s CPU per 30s clip locally. Cloud fallback ~$0.02 / minute, ~5s wall-clock for a 30s clip.
Convert Rhubarb's mouthCues into a Godot Animation resource — one track per BlendShape, keyframes at each viseme transition. This makes lipsync replayable via the same AnimationPlayer/AnimationTree as body motion.
# scripts/lipsync_baker.gd — run once per VO line at edit time
const RHUBARB_TO_VISEME := {
"A": "viseme_PP", "B": "viseme_kk", "C": "viseme_E",
"D": "viseme_aa", "E": "viseme_O", "F": "viseme_U",
"G": "viseme_FF", "H": "viseme_RR", "X": "viseme_sil",
}
static func bake_from_rhubarb(rhubarb_json_path: String, head_path: NodePath) -> Animation:
var f := FileAccess.open(rhubarb_json_path, FileAccess.READ)
var data: Dictionary = JSON.parse_string(f.get_as_text())
var cues: Array = data["mouthCues"]
var anim := Animation.new()
anim.length = float(data["metadata"]["duration"])
var visemes := ["viseme_aa", "viseme_E", "viseme_I", "viseme_O", "viseme_U",
"viseme_PP", "viseme_FF", "viseme_TH", "viseme_DD", "viseme_kk",
"viseme_CH", "viseme_SS", "viseme_nn", "viseme_RR", "viseme_sil"]
var tracks := {}
for v in visemes:
var idx := anim.add_track(Animation.TYPE_BLEND_SHAPE)
anim.track_set_path(idx, NodePath(str(head_path) + ":" + v))
tracks[v] = idx
# For each cue, set the active viseme to 1.0 and others to 0.0 at cue.start
for cue in cues:
var active_viseme: String = RHUBARB_TO_VISEME.get(cue["value"], "viseme_sil")
for v in visemes:
var weight: float = 1.0 if v == active_viseme else 0.0
anim.track_insert_key(tracks[v], float(cue["start"]), weight)
return anim
Bake once, save into the character's AnimationLibrary as dialogue_<line_id>, and play via the AnimationTree.
If you used the Whisper/ARPAbet cloud fallback instead of Rhubarb, swap bake_from_rhubarb for a variant that consumes { phoneme, start, duration } triples and applies the ARPAbet → viseme table from the Reference card.
Add a OneShot node Lipsync that overlays the viseme animation as an additive layer over the base face. Fire from the dialogue system:
@onready var tree: AnimationTree = $AnimationTree
@onready var audio: AudioStreamPlayer3D = $VoicePlayer
func say(line_id: String) -> void:
var clip_id := "dialogue_" + line_id
tree.set("parameters/Lipsync/animation", clip_id)
audio.stream = load("res://audio/" + line_id + ".ogg")
audio.play()
tree.set("parameters/Lipsync/request", AnimationNodeOneShot.ONE_SHOT_REQUEST_FIRE)
Sync is preserved as long as both fire on the same frame. ~16ms drift is the threshold of perception; AnimationTree + AudioStreamPlayer3D are both sample-accurate, so drift only happens if the engine hitches mid-line.
A second OneShot or persistent additive track for expressions. Key the relevant BlendShapes (mouthSmile_L, mouthSmile_R, browInnerUp, etc.) at design time:
func smile(intensity: float) -> void:
var head: MeshInstance3D = $Head
head.set_blend_shape_value(head.find_blend_shape_by_name("mouthSmile_L"), intensity)
head.set_blend_shape_value(head.find_blend_shape_by_name("mouthSmile_R"), intensity)
func surprise(intensity: float) -> void:
var head: MeshInstance3D = $Head
head.set_blend_shape_value(head.find_blend_shape_by_name("browInnerUp"), intensity)
head.set_blend_shape_value(head.find_blend_shape_by_name("browOuterUp_L"), intensity * 0.7)
head.set_blend_shape_value(head.find_blend_shape_by_name("browOuterUp_R"), intensity * 0.7)
head.set_blend_shape_value(head.find_blend_shape_by_name("eyeWide_L"), intensity)
head.set_blend_shape_value(head.find_blend_shape_by_name("eyeWide_R"), intensity)
head.set_blend_shape_value(head.find_blend_shape_by_name("jawOpen"), intensity * 0.3)
Driven from gameplay events; orthogonal to the lipsync layer.
| Viseme | Triggered by phonemes (ARPAbet) | Mouth shape |
|---|---|---|
viseme_sil | silence | closed neutral |
viseme_PP | P, B, M | lips pressed |
viseme_FF | F, V | lower lip + upper teeth |
viseme_TH | TH, DH | tongue tip + teeth |
viseme_DD | T, D, N, S, Z | tongue + alveolar |
viseme_kk | K, G, NG | back-tongue, mouth slightly open |
viseme_CH | CH, JH, SH, ZH | lips rounded forward |
viseme_SS | S, Z (sometimes split from DD) | teeth nearly closed |
viseme_nn | N, L | tongue-tip + open mouth |
viseme_RR | R, ER | mouth slightly rounded |
viseme_aa | AA, AH, AE | wide open |
viseme_E | EH, EY, IH | mid open + lips spread |
viseme_I | IY, IH | narrow + spread |
viseme_O | OW, AO | rounded |
viseme_U | UW, UH | rounded + small |
If the user has a CMUDict-style phoneme list and wants to map manually, this is the table to apply.
viseme_sil keyframes at silence intervals. The bake step must scan the audio for silence (RMS below threshold for 100ms+) and insert sil keys, OR the phoneme extractor must emit silence markers. Rhubarb emits X cues for silence (mapped above to viseme_sil); cloud Whisper-phoneme models often don't — add a silence-detection pass if you go that route.Animation.TRACK_INTERPOLATION_LINEAR (default) and ensure each viseme's weight ramps from previous to current. Default bake does this; if you wrote custom keyframes with NEAREST interp, switch.Timer.Dictionary and apply it during bake. Only do this once per character.BlendShape track interpolation = NEAREST (loses smoothness, gains 30% perf). Reserve for mobile / very crowded crowd scenes.summer:animation/procedural-animation) and you're at ~95%. The remaining 5% is brow articulation tied to dialogue sentiment, which is bespoke per scene._process with hand-coded curves. The Animation track is sample-accurate and re-uses the AnimationTree's interpolation; manual code drifts and stutters.play() audio in _ready and fire the OneShot in _process, you'll see ~16ms drift at line start. Both calls in the same function, same frame.jawOpen from the audio's RMS envelope instead — much simpler.--recognizer phonetic (language-agnostic, less accurate) or use a multilingual Whisper-phoneme model on Replicate / HF. Cross-language lipsync (extract as English on Spanish audio) gives ~70% accuracy — bad enough that subtitles are needed regardless.If Rhubarb won't install on the user's platform, run a Whisper-phoneme model on Replicate (e.g., cjwbw/whisper-phoneme) or aeneas-align for forced alignment. Download the JSON, swap the bake function for one that consumes ARPAbet phonemes (table in Reference card). Same end result — replayable BlendShape Animation resource.
For projects that can't use any cloud or third-party tool, Godot 4.5's AudioStreamGenerator with hand-rolled vowel/consonant detection from RMS + zero-crossings gives ~50% accuracy — enough for a stylized character but not photoreal.
summer:audio/generate-voice (which wraps summer_generate_audio({capability: "text_to_speech", ...})).summer:ai-and-npcs/design-npc.summer:animation/animation-tree.summer:animation/procedural-animation.summer:audio/generate-voice — TTS upstream of this skill.summer:animation/animation-tree — wire the lipsync OneShot into the character's tree.summer:animation/procedural-animation — eye blinks, saccades, head idle.references/mcp-tools-reference.md — summer_generate_audio schema (TTS).npx claudepluginhub summerengine/summer-engine-agent --plugin summerLip-sync a face image or video clip to a user-uploaded audio track, producing a 9:16 talking-head video. No TTS or voice cloning.
Generates a rigged humanoid character from a T-pose reference image using the Meshy auto-rig pipeline, ready for animation in Godot (CharacterBody3D/Node3D).
Wires ARKit-blendshape-rigged head/face models to Quest face tracking via Meta Movement SDK (A2E). Use when you have an FBX with 52 ARKit blendshapes and want it driven by facial expressions on Quest Pro / Quest 3 / Quest 3S.