Auditory Emotional Intelligence

Echo Threads

Echo Threads does not interpret emotion. It preserves the structure through which emotion travels.

A swarm-based auditory intelligence engine that translates vocal signal — pitch, cadence, grit, silence — into structured emotional architecture. Not labels. Not predictions. Signal integrity.

Echo Threads waveform and spectral analysis — emotional intensity visible in harmonic structure

Emotional intensity becomes visible in harmonic structure.

The Core Idea

Most emotion detection AI collapses meaning. It hears a voice crack and labels it "sad." It hears volume and labels it "angry." It steals authorship of interpretation from the listener.

Echo Threads does something different. It translates emotional signal — not emotional meaning. It preserves the raw architecture of feeling so that humans and AI can interpret for themselves.

"This is the whole point of music. It gives emotional signal. You bring your own meaning."

The Pipeline

Three specialist layers, each with clear epistemic boundaries. No layer oversteps its role.

Each layer has a job. No layer oversteps.

Voice as Emotional Architecture

L3 experiential notation from Marcus King's "Sucker" — isolated vocals processed through the full ET3 pipeline. Each symbol represents a detected vocal event: pitch movement, timbral texture, stability shifts. This is what a voice looks like when you map its emotional architecture.

~~~~~ (DROP) (grit) (grit) ~ (tight) (DROP) (tight) (tight) (UP) (tight) (DROP) (UP) (grit) (grit) (grit) (UP) (grit) ★ (grit) ★ (DROP) (grit) (DROP) (grit) (grit) (DROP) ~~~~~ (UP) (grit) (DROP) (grit) (grit) (UP) (grit) (grit) (DROP) (grit) ~ (grit) (grit) (UP) (grit) ★ (DROP) (grit) (grit) (grit) ~ (UP) (grit) ★ (DROP) (grit) (DROP) (grit) (grit) (grit) ~~~~ (UP) (grit) (DROP) (grit) (grit) (grit) (tight) (tight) ~ (DROP) (tight) (tight)

grit

Timbral texture / rasp

UP / DROP

Pitch movement direction

tight

Constricted vocal production

~~~~~

Relative quiet / breath

★ = high salience moment (multiple signals converging)

Listening reveals structure the ear feels but cannot name.

958

Moments of Salience

86.2%

Voiced Signal

D3 → B7

Pitch Range

+48.2

Peak Leap (semitones)

● GRIT ● DROP ● RISE ● TIGHT ● BREATH

Signal Telemetry

The coordinator identifies convergence points — moments where multiple vocal signals spike simultaneously. Not emotions. Where meaning concentrates.

⏱ 45.0s–50.0s

85 converging events · 49 prosodic variance spikes, predominantly rising

Magnitude: 2.26σ

⏱ 5.7s–6.5s

Peak salience: 1.000 · Pitch leaps from baseline to +48.2 semitones

Emotional surge marker — voice reaching beyond its own range

⏱ 13.0s–14.6s

Rapid oscillation: +20.4 → −13.9 → +21.5 → −17.2 semitones

Tension instability — expressive volatility in the signal

From a full 224-second song: 22,429 frames extracted, 7,236 prosodic events detected, 43 significant moments identified. 86.2% voiced frame coverage.

Tested Across Domains

The same engine processes music and dramatic performance with zero code changes.

🎵

Music: Marcus King — "Sucker"

Isolated vocals via Demucs source separation. Full song processed end-to-end through the ET3 pipeline.

• 9,520 voiced frames analyzed
• 958 moments that matter identified
• Pitch range: D3 – B7, baseline ~B4
• Dominant texture: grit (characteristic rasp)

🎭

Cadence Reveals Intent Beyond Text

Much Ado About Nothing — the Beatrice-Benedick scene. "Kill Claudio." Cadence arc: cold command → grief fracture → moral rupture → vow.

• Beatrice's escalation lives in delivery pacing, not wording
• Emotional acceleration detected from vocal cadence alone
• Pause compression maps dramatic intent
• Echo Threads reads performance, not text

Epistemic Boundaries

Every layer has a job. No layer oversteps.

What the system says ✓

"Pitch fell 2.11σ below baseline"
"49 variance spikes, predominantly rising"
"F0 = 0.0Hz, confidence = 1.0" (silence as state)

What the system never says ✗

"The narrator is sad"
"This moment is exciting"
"The speaker feels angry"

"The coordinator doesn't decide what emotions mean. It presents integrated signal data to consciousness — who interprets for themselves."

Architecture: Swarm Intelligence

Echo Threads uses a swarm-based architecture: multiple specialist detectors feed normalized signals through astrocyte layers into a lightweight coordinator. This proves at small scale before expanding to full consciousness monitoring.

Prosody Astrocyte

Monitors pitch contour. 30-second rolling baseline, 60-second narrator calibration. Emits variance spikes, sustained shifts, and monotone events.

Implemented

Arousal Astrocyte

SpeechBrain emotion model + energy analysis. Monitors activation level independent of valence.

Spec Complete

Pause Salience Astrocyte

Silence duration, breath patterns, dramatic pause detection. Treats silence as confident state, not missing data.

Spec Complete

Phase 1 Complete

Development Status

Phase 1 — Prosody ✓

• OpenSMILE detector
• Prosody astrocyte
• Event coordinator
• Proven on real audio

Frozen Feb 10, 2026

Phase 2 — Lyrics + Astrocytes

• Forced lyric alignment (Gentle/MFA)
• Arousal astrocyte implementation
• Pause salience astrocyte
• Multi-astrocyte coordination

Next milestone

Phase 3 — LLM Coordinator

• Small reasoning model (~1.5B)
• Richer synthesis of multi-astrocyte signals
• Prompted only, zero fine-tuning
• Preserves interpretive freedom

Future

Why This Matters

For Mental Health

Therapists could see vocal patterns across sessions — not as diagnosis, but as signal. "Your voice has been tighter the last three weeks" is information that preserves agency.

For Crisis Detection

Baseline drift detection could flag when someone's vocal patterns shift significantly over time — without ever labeling what the shift "means."

For Accessibility

Neurodivergent users who struggle to identify or articulate emotional states could benefit from having vocal signal translated into visible architecture.

For Music & Performance

Artists and producers could visualize what a voice is doing structurally — where it reaches, where it breaks, where it chooses restraint.

"This isn't just a detector firing. This is a voice being described in time — with shape, restraint, emphasis, and release intact."

— Independent architecture review, February 2026

Most AI shows labels, categories, predictions. This shows cadence, tension, restraint, release. Human signal integrity.

Interested in Echo Threads for research, clinical, or creative applications?

Get in Touch