Gideon commited on
Commit
3a9fd79
·
0 Parent(s):

VoiceKit MCP Server

Browse files
Files changed (4) hide show
  1. .gitignore +6 -0
  2. README.md +102 -0
  3. app.py +2454 -0
  4. requirements.txt +3 -0
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ .env
2
+ __pycache__/
3
+ *.pyc
4
+ gradio_temp/
5
+ .claude/
6
+ modal_app.py
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: VoiceKit MCP Server
3
+ emoji: 🎙️
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 6.0.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ tags:
12
+ - building-mcp-track-creative
13
+ ---
14
+
15
+ # VoiceKit MCP Server
16
+
17
+ **Voice analysis toolkit exposing 6 MCP tools for AI assistants.**
18
+
19
+ VoiceKit provides comprehensive voice processing capabilities through the Model Context Protocol (MCP), enabling Claude and other AI assistants to analyze, compare, transcribe, and process audio.
20
+
21
+ ## Purpose
22
+
23
+ VoiceKit bridges the gap between AI assistants and advanced voice analysis. It allows:
24
+ - **Voice comparison** for mimicry games, pronunciation practice
25
+ - **Audio transcription** in multiple languages
26
+ - **Acoustic analysis** for voice coaching, music production
27
+ - **Background removal** for clean audio extraction
28
+
29
+ ## MCP Endpoint
30
+
31
+ ```
32
+ https://MCP-1st-Birthday-voicekit.hf.space/gradio_api/mcp/sse
33
+ ```
34
+
35
+ ## Quick Start
36
+
37
+ Add to your `claude_desktop_config.json`:
38
+
39
+ ```json
40
+ {
41
+ "mcpServers": {
42
+ "voicekit": {
43
+ "url": "https://MCP-1st-Birthday-voicekit.hf.space/gradio_api/mcp/sse"
44
+ }
45
+ }
46
+ }
47
+ ```
48
+
49
+ ## Available Tools (6)
50
+
51
+ ### Primitive Tools
52
+
53
+ | Tool | Purpose | Input | Output |
54
+ |------|---------|-------|--------|
55
+ | `extract_embedding` | Get voice fingerprint | Audio file | 768-dim Wav2Vec2 vector |
56
+ | `compare_voices` | Measure voice similarity | 2 audio files | Similarity score (0-1) |
57
+ | `analyze_acoustic_features` | Analyze voice characteristics | Audio file | Pitch, energy, rhythm, tempo |
58
+ | `transcribe_audio` | Speech-to-text | Audio + language | Transcribed text |
59
+ | `isolate_voice` | Remove background noise/music | Audio file | Clean voice audio |
60
+
61
+ ### Composite Tool
62
+
63
+ | Tool | Purpose | Input | Output |
64
+ |------|---------|-------|--------|
65
+ | `analyze_voice_similarity` | Full voice analysis | 2 audios + text | 5 metrics + overall score |
66
+
67
+ ## Use Cases
68
+
69
+ ### Voice Mimicry Game
70
+ ```
71
+ User: "Compare my voice to this movie clip"
72
+ Claude: [uses analyze_voice_similarity] → Returns pronunciation, tone, pitch, rhythm, energy scores
73
+ ```
74
+
75
+ ### Audio Transcription
76
+ ```
77
+ User: "What does this Korean audio say?"
78
+ Claude: [uses transcribe_audio with language="ko"] → Returns Korean text
79
+ ```
80
+
81
+ ### Clean Audio Extraction
82
+ ```
83
+ User: "Remove the background music from this meme"
84
+ Claude: [uses isolate_voice] → Returns isolated voice track
85
+ ```
86
+
87
+ ## Architecture
88
+
89
+ ```
90
+ ┌─────────────────┐ MCP/SSE ┌─────────────────┐ API ┌─────────────────┐
91
+ │ Claude Desktop │ ◄──────────────► │ HF Space │ ◄──────────► │ Modal GPU │
92
+ │ (MCP Client) │ │ (Gradio) │ │ (Inference) │
93
+ └─────────────────┘ └─────────────────┘ └─────────────────┘
94
+ ```
95
+
96
+ - **Frontend**: Gradio 6 MCP Server on Hugging Face Spaces
97
+ - **Backend**: Modal serverless GPU for ML inference
98
+ - **Models**: Wav2Vec2, ElevenLabs Scribe STT, Voice Isolator
99
+
100
+ ## Demo
101
+
102
+ Try each tool directly in the tabs on the Space UI!
app.py ADDED
@@ -0,0 +1,2454 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ VoiceKit - MCP Server for Voice Analysis
3
+
4
+ 6 MCP tools for voice processing (all accept base64 audio):
5
+ - Embedding extraction, voice comparison, acoustic analysis
6
+ - Speech-to-text, voice isolation, similarity analysis
7
+
8
+ MCP Endpoint: https://MCP-1st-Birthday-voicekit-test.hf.space/gradio_api/mcp/sse
9
+ """
10
+
11
+ import gradio as gr
12
+ import base64
13
+ import os
14
+ import json
15
+ import tempfile
16
+ import math
17
+
18
+ # Set Gradio temp directory to current directory
19
+ GRADIO_TEMP_DIR = os.path.join(os.getcwd(), "gradio_temp")
20
+ os.makedirs(GRADIO_TEMP_DIR, exist_ok=True)
21
+ os.environ['GRADIO_TEMP_DIR'] = GRADIO_TEMP_DIR
22
+ tempfile.tempdir = GRADIO_TEMP_DIR
23
+
24
+ # Modal connection (requires MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in HF Secrets)
25
+ try:
26
+ import modal
27
+ AudioAnalyzer = modal.Cls.from_name("voice-semantle", "AudioAnalyzer")
28
+ analyzer = AudioAnalyzer()
29
+ modal_available = True
30
+ print("Modal connected!")
31
+ except Exception as e:
32
+ modal_available = False
33
+ analyzer = None
34
+ print(f"Modal not available: {e}")
35
+
36
+
37
+ def file_to_base64(file_path: str) -> str:
38
+ """Convert file path to base64 string"""
39
+ if not file_path:
40
+ return ""
41
+ with open(file_path, "rb") as f:
42
+ return base64.b64encode(f.read()).decode()
43
+
44
+
45
+ # ============================================================================
46
+ # MCP Tools (all accept base64 directly)
47
+ # ============================================================================
48
+
49
+ def extract_embedding(audio_base64: str) -> str:
50
+ """
51
+ Extract voice embedding using Wav2Vec2.
52
+
53
+ Returns a 768-dimensional vector representing voice characteristics.
54
+ Useful for voice comparison, speaker identification, etc.
55
+
56
+ Args:
57
+ audio_base64: Audio file as base64 encoded string
58
+
59
+ Returns:
60
+ embedding (768-dim list), model, dim
61
+ """
62
+ if not modal_available:
63
+ return json.dumps({"error": "Modal not available. Please set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in HF Secrets."})
64
+ if not audio_base64:
65
+ return json.dumps({"error": "No audio provided"})
66
+
67
+ try:
68
+ result = analyzer.extract_embedding.remote(audio_base64)
69
+ if "embedding" in result:
70
+ result["embedding_preview"] = result["embedding"][:5] + ["..."]
71
+ result["embedding_length"] = len(result["embedding"])
72
+ del result["embedding"]
73
+ return json.dumps(result, ensure_ascii=False, indent=2)
74
+ except Exception as e:
75
+ return json.dumps({"error": str(e)})
76
+
77
+
78
+ def match_voice(audio1_base64: str, audio2_base64: str) -> str:
79
+ """
80
+ Compare similarity between two voices.
81
+
82
+ Extracts Wav2Vec2 embeddings and calculates cosine similarity.
83
+ Useful for checking if the same person spoke with similar tone.
84
+
85
+ Args:
86
+ audio1_base64: First audio as base64 encoded string
87
+ audio2_base64: Second audio as base64 encoded string
88
+
89
+ Returns:
90
+ similarity (0-1), tone_score (0-100)
91
+ """
92
+ if not modal_available:
93
+ return json.dumps({"error": "Modal not available. Please set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in HF Secrets."})
94
+ if not audio1_base64 or not audio2_base64:
95
+ return json.dumps({"error": "Both audio files required"})
96
+
97
+ try:
98
+ result = analyzer.compare_voices.remote(audio1_base64, audio2_base64)
99
+ return json.dumps(result, ensure_ascii=False, indent=2)
100
+ except Exception as e:
101
+ return json.dumps({"error": str(e)})
102
+
103
+
104
+ def analyze_acoustics(audio_base64: str) -> str:
105
+ """
106
+ Analyze acoustic features of audio.
107
+
108
+ Extracts pitch, energy, rhythm, tempo, and spectral characteristics.
109
+ Useful for understanding voice expressiveness and characteristics.
110
+
111
+ Args:
112
+ audio_base64: Audio file as base64 encoded string
113
+
114
+ Returns:
115
+ pitch, energy, rhythm, tempo, spectral information
116
+ """
117
+ if not modal_available:
118
+ return json.dumps({"error": "Modal not available. Please set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in HF Secrets."})
119
+ if not audio_base64:
120
+ return json.dumps({"error": "No audio provided"})
121
+
122
+ try:
123
+ result = analyzer.analyze_acoustic_features.remote(audio_base64)
124
+ return json.dumps(result, ensure_ascii=False, indent=2)
125
+ except Exception as e:
126
+ return json.dumps({"error": str(e)})
127
+
128
+
129
+ def transcribe_audio(audio_base64: str, language: str = "en") -> str:
130
+ """
131
+ Convert audio to text (Speech-to-Text).
132
+
133
+ Uses ElevenLabs Scribe v1 model for high-quality speech recognition.
134
+ Supports various languages.
135
+
136
+ Args:
137
+ audio_base64: Audio file as base64 encoded string
138
+ language: Language code (e.g., "en", "ko", "ja"). Default is "en"
139
+
140
+ Returns:
141
+ text, language, model
142
+ """
143
+ if not modal_available:
144
+ return json.dumps({"error": "Modal not available. Please set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in HF Secrets."})
145
+ if not audio_base64:
146
+ return json.dumps({"error": "No audio provided"})
147
+
148
+ try:
149
+ result = analyzer.transcribe_audio.remote(audio_base64, language)
150
+ return json.dumps(result, ensure_ascii=False, indent=2)
151
+ except Exception as e:
152
+ return json.dumps({"error": str(e)})
153
+
154
+
155
+ def isolate_voice(audio_base64: str) -> str:
156
+ """
157
+ Remove background music (BGM) and extract voice only.
158
+
159
+ Uses ElevenLabs Voice Isolator to remove music, noise, etc.
160
+ Useful for memes, songs, and other audio with background sounds.
161
+
162
+ Args:
163
+ audio_base64: Audio file as base64 encoded string
164
+
165
+ Returns:
166
+ isolated_audio_base64, metadata (bgm_detected, sizes, duration)
167
+ """
168
+ if not modal_available:
169
+ return json.dumps({"error": "Modal not available. Please set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in HF Secrets."})
170
+ if not audio_base64:
171
+ return json.dumps({"error": "No audio provided"})
172
+
173
+ try:
174
+ result = analyzer.isolate_voice.remote(audio_base64)
175
+ return json.dumps(result, ensure_ascii=False, indent=2)
176
+ except Exception as e:
177
+ return json.dumps({"error": str(e)})
178
+
179
+
180
+ def grade_voice(
181
+ user_audio_base64: str,
182
+ reference_audio_base64: str,
183
+ reference_text: str = "",
184
+ category: str = "meme"
185
+ ) -> str:
186
+ """
187
+ Comprehensively compare and analyze user voice with reference voice.
188
+
189
+ Evaluates with 5 metrics:
190
+ - pronunciation: Pronunciation accuracy (STT-based)
191
+ - tone: Voice timbre similarity (Wav2Vec2 embedding)
192
+ - pitch: Pitch matching
193
+ - rhythm: Rhythm sense
194
+ - energy: Energy expressiveness
195
+
196
+ Args:
197
+ user_audio_base64: User audio as base64 encoded string
198
+ reference_audio_base64: Reference audio as base64 encoded string
199
+ reference_text: Reference text (optional, enables pronunciation scoring)
200
+ category: Category (meme, song, movie) - determines weights
201
+
202
+ Returns:
203
+ overall_score, metrics, weak_points, strong_points, feedback
204
+ """
205
+ if not modal_available:
206
+ return json.dumps({"error": "Modal not available. Please set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in HF Secrets."})
207
+ if not user_audio_base64 or not reference_audio_base64:
208
+ return json.dumps({"error": "Both user and reference audio required"})
209
+
210
+ try:
211
+ result = analyzer.analyze_audio.remote(
212
+ user_audio_base64=user_audio_base64,
213
+ reference_audio_base64=reference_audio_base64,
214
+ reference_text=reference_text if reference_text else None,
215
+ challenge_id="mcp_analysis",
216
+ category=category,
217
+ )
218
+ # Simplify output for backend/API use
219
+ metrics = result.get("metrics", {})
220
+ simple_result = {
221
+ "pitch": metrics.get("pitch", 0),
222
+ "rhythm": metrics.get("rhythm", 0),
223
+ "energy": metrics.get("energy", 0),
224
+ "pronunciation": metrics.get("pronunciation", 0),
225
+ "transcript": metrics.get("transcript", 0),
226
+ "overall": result.get("overall_score", 0),
227
+ "user_text": result.get("user_text", "")
228
+ }
229
+ return json.dumps(simple_result, ensure_ascii=False, indent=2)
230
+ except Exception as e:
231
+ return json.dumps({"error": str(e)})
232
+
233
+
234
+ # ============================================================================
235
+ # Demo Functions for UI
236
+ # ============================================================================
237
+
238
+ def demo_acoustic_analysis(audio_file):
239
+ """Acoustic Analysis - Analyze pitch, energy, rhythm, tempo"""
240
+ if not audio_file:
241
+ return create_acoustic_empty()
242
+
243
+ audio_b64 = file_to_base64(audio_file)
244
+ result_json = analyze_acoustics(audio_b64)
245
+
246
+ try:
247
+ result = json.loads(result_json)
248
+ if "error" in result:
249
+ return f'''<div style="color: #ef4444; padding: 20px; background: #fee; border-radius: 12px; border: 1px solid #fca5a5;">
250
+ <strong>Error in result:</strong><br>{result.get("error", "Unknown error")}
251
+ </div>'''
252
+ return create_acoustic_visualization(result)
253
+ except Exception as e:
254
+ return f'''<div style="color: #ef4444; padding: 20px; background: #fee; border-radius: 12px; border: 1px solid #fca5a5;">
255
+ <strong>Parsing Error:</strong> {str(e)}<br><br>
256
+ <strong>Raw Result (first 500 chars):</strong><br>
257
+ <code style="display: block; padding: 10px; background: white; border-radius: 4px; overflow-x: auto; font-size: 12px;">{result_json[:500]}</code>
258
+ </div>'''
259
+
260
+
261
+ def demo_transcribe_audio(audio_file, language):
262
+ """Audio Transcription"""
263
+ if not audio_file:
264
+ return create_transcription_empty()
265
+
266
+ audio_b64 = file_to_base64(audio_file)
267
+ result_json = transcribe_audio(audio_b64, language)
268
+
269
+ try:
270
+ result = json.loads(result_json)
271
+ if "error" in result:
272
+ return create_transcription_empty()
273
+ text = result.get("text", "")
274
+ return create_transcription_visualization(text)
275
+ except:
276
+ return create_transcription_empty()
277
+
278
+
279
+ def demo_clean_extraction(audio_file):
280
+ """Clean Audio Extraction - returns audio file only"""
281
+ if not audio_file:
282
+ return None
283
+
284
+ audio_b64 = file_to_base64(audio_file)
285
+ result_json = isolate_voice(audio_b64)
286
+
287
+ try:
288
+ result = json.loads(result_json)
289
+ if "error" in result:
290
+ return None
291
+
292
+ # Convert isolated audio base64 back to file
293
+ import tempfile
294
+ isolated_audio_bytes = base64.b64decode(result["isolated_audio_base64"])
295
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
296
+ tmp.write(isolated_audio_bytes)
297
+ isolated_audio_path = tmp.name
298
+
299
+ return isolated_audio_path
300
+ except:
301
+ return None
302
+
303
+
304
+ def demo_extract_embedding(audio_file):
305
+ """Extract Embedding - extract voice fingerprint"""
306
+ if not audio_file:
307
+ return create_embedding_empty()
308
+
309
+ audio_b64 = file_to_base64(audio_file)
310
+ result_json = extract_embedding(audio_b64)
311
+
312
+ try:
313
+ result = json.loads(result_json)
314
+ if "error" in result:
315
+ return f'''<div style="color: #ef4444; padding: 20px; background: #fee; border-radius: 12px; border: 1px solid #fca5a5;">
316
+ <strong>Error in result:</strong><br>{result.get("error", "Unknown error")}
317
+ </div>'''
318
+ return create_embedding_visualization(result)
319
+ except Exception as e:
320
+ return f'''<div style="color: #ef4444; padding: 20px; background: #fee; border-radius: 12px; border: 1px solid #fca5a5;">
321
+ <strong>Parsing Error:</strong> {str(e)}<br><br>
322
+ <strong>Raw Result (first 500 chars):</strong><br>
323
+ <code style="display: block; padding: 10px; background: white; border-radius: 4px; overflow-x: auto; font-size: 12px;">{result_json[:500]}</code>
324
+ </div>'''
325
+
326
+
327
+ def demo_match_voice(audio1, audio2):
328
+ """Compare Voices - compare two voice similarities"""
329
+ if not audio1 or not audio2:
330
+ return create_compare_empty()
331
+
332
+ audio1_b64 = file_to_base64(audio1)
333
+ audio2_b64 = file_to_base64(audio2)
334
+ result_json = match_voice(audio1_b64, audio2_b64)
335
+
336
+ try:
337
+ result = json.loads(result_json)
338
+ if "error" in result:
339
+ return create_compare_empty()
340
+ return create_compare_visualization(result)
341
+ except:
342
+ return create_compare_empty()
343
+
344
+
345
+ def demo_voice_similarity(user_audio, ref_audio):
346
+ """Voice Similarity - comprehensive 5-metric analysis"""
347
+ if not user_audio or not ref_audio:
348
+ return create_similarity_empty()
349
+
350
+ user_b64 = file_to_base64(user_audio)
351
+ ref_b64 = file_to_base64(ref_audio)
352
+ result_json = grade_voice(user_b64, ref_b64, "", "meme")
353
+
354
+ try:
355
+ result = json.loads(result_json)
356
+ if "error" in result:
357
+ return create_similarity_empty()
358
+ return create_similarity_visualization(result)
359
+ except:
360
+ return create_similarity_empty()
361
+
362
+
363
+ # ============================================================================
364
+ # Visualization Functions
365
+ # ============================================================================
366
+
367
+ def create_acoustic_empty():
368
+ """Empty state for acoustic analysis"""
369
+ return """
370
+ <div style="
371
+ background: rgba(10, 10, 26, 0.4);
372
+ border: 1px solid rgba(124, 58, 237, 0.2);
373
+ border-radius: 16px;
374
+ padding: 30px 20px;
375
+ text-align: center;
376
+ height: 100%;
377
+ display: flex;
378
+ flex-direction: column;
379
+ align-items: center;
380
+ justify-content: center;
381
+ ">
382
+ <div style="margin-bottom: 12px; opacity: 0.5;">
383
+ <svg width="48" height="48" viewBox="0 0 24 24" fill="none" style="margin: 0 auto; display: block;">
384
+ <path d="M22 10C22 10 20 4 17 4C14 4 12 16 9 16C6 16 4 10 2 10" stroke="#7c3aed" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
385
+ <g transform="translate(13, 11)">
386
+ <circle cx="5" cy="5" r="4" stroke="#7c3aed" stroke-width="1.5"/>
387
+ <path d="M8 8L11 11" stroke="#7c3aed" stroke-width="1.5" stroke-linecap="round"/>
388
+ </g>
389
+ </svg>
390
+ </div>
391
+ <div style="color: #a5b4fc; font-size: 12px; line-height: 1.5;">
392
+ Upload audio to analyze acoustic features
393
+ </div>
394
+ </div>
395
+ """
396
+
397
+
398
+ def create_acoustic_visualization(result):
399
+ """Acoustic analysis visualization with radar chart"""
400
+ pitch = result.get("pitch", {})
401
+ energy = result.get("energy", {})
402
+ rhythm = result.get("rhythm", {})
403
+ tempo = result.get("tempo", 0)
404
+ spectral = result.get("spectral", {})
405
+
406
+ # Use pre-calculated scores from Modal backend (already 0-100)
407
+ pitch_norm = pitch.get("score", 0)
408
+ energy_norm = energy.get("score", 0)
409
+ rhythm_norm = rhythm.get("score", 0)
410
+ spectral_norm = spectral.get("score", 0)
411
+
412
+ # Tempo: normalize BPM to 0-100 (60-180 BPM range)
413
+ tempo_bpm = tempo
414
+ tempo_norm = min(100, max(0, (tempo_bpm - 60) / 120 * 100)) if tempo_bpm > 0 else 0
415
+
416
+ # Radar chart calculation
417
+ center_x, center_y = 150, 150
418
+ radius = 110
419
+
420
+ # 5 metrics in order: Pitch(top), Energy(top-right), Rhythm(bottom-right), Tempo(bottom-left), Spectral(top-left)
421
+ metrics = [
422
+ ("Pitch", pitch_norm, -90), # 0° - 90° = -90° (top)
423
+ ("Energy", energy_norm, -18), # 72° - 90° = -18° (top-right)
424
+ ("Rhythm", rhythm_norm, 54), # 144° - 90° = 54° (bottom-right)
425
+ ("Tempo", tempo_norm, 126), # 216° - 90° = 126° (bottom-left)
426
+ ("Spectral", spectral_norm, 198) # 288° - 90° = 198° (top-left)
427
+ ]
428
+
429
+ # Calculate polygon points for data
430
+ data_points = []
431
+ for _, value, angle_deg in metrics:
432
+ angle_rad = math.radians(angle_deg)
433
+ point_radius = (value / 100) * radius
434
+ x = center_x + point_radius * math.cos(angle_rad)
435
+ y = center_y + point_radius * math.sin(angle_rad)
436
+ data_points.append(f"{x:.2f},{y:.2f}")
437
+
438
+ # Background concentric pentagons (20, 40, 60, 80, 100)
439
+ def create_pentagon_points(scale):
440
+ points = []
441
+ for _, _, angle_deg in metrics:
442
+ angle_rad = math.radians(angle_deg)
443
+ r = radius * scale
444
+ x = center_x + r * math.cos(angle_rad)
445
+ y = center_y + r * math.sin(angle_rad)
446
+ points.append(f"{x:.2f},{y:.2f}")
447
+ return " ".join(points)
448
+
449
+ background_pentagons = ""
450
+ for scale in [0.2, 0.4, 0.6, 0.8, 1.0]:
451
+ background_pentagons += f'<polygon points="{create_pentagon_points(scale)}" fill="none" stroke="rgba(124, 58, 237, 0.15)" stroke-width="1"/>'
452
+
453
+ # Axis lines from center to vertices
454
+ axis_lines = ""
455
+ for _, _, angle_deg in metrics:
456
+ angle_rad = math.radians(angle_deg)
457
+ x = center_x + radius * math.cos(angle_rad)
458
+ y = center_y + radius * math.sin(angle_rad)
459
+ axis_lines += f'<line x1="{center_x}" y1="{center_y}" x2="{x:.2f}" y2="{y:.2f}" stroke="rgba(124, 58, 237, 0.3)" stroke-width="1"/>'
460
+
461
+ # Labels at vertices
462
+ labels = ""
463
+ for label, value, angle_deg in metrics:
464
+ angle_rad = math.radians(angle_deg)
465
+ # Position label outside the pentagon
466
+ label_radius = radius + 25
467
+ x = center_x + label_radius * math.cos(angle_rad)
468
+ y = center_y + label_radius * math.sin(angle_rad)
469
+ labels += f'''<text x="{x:.2f}" y="{y:.2f}" text-anchor="middle" dominant-baseline="middle" fill="#a5b4fc" font-size="11" font-weight="600">
470
+ {label}
471
+ <tspan x="{x:.2f}" dy="12" fill="#a855f7" font-size="13" font-weight="700">{int(value)}</tspan>
472
+ </text>'''
473
+
474
+ return f"""
475
+ <div style="
476
+ background: rgba(10, 10, 26, 0.6);
477
+ border: 1px solid rgba(124, 58, 237, 0.3);
478
+ border-radius: 16px;
479
+ padding: 20px;
480
+ display: flex;
481
+ align-items: center;
482
+ justify-content: center;
483
+ ">
484
+ <svg width="300" height="300" viewBox="0 0 300 300">
485
+ <!-- Background pentagons -->
486
+ {background_pentagons}
487
+
488
+ <!-- Axis lines -->
489
+ {axis_lines}
490
+
491
+ <!-- Data polygon -->
492
+ <polygon points="{' '.join(data_points)}"
493
+ fill="rgba(124, 58, 237, 0.3)"
494
+ stroke="#a855f7"
495
+ stroke-width="2"/>
496
+
497
+ <!-- Data points -->
498
+ {''.join([f'<circle cx="{pt.split(",")[0]}" cy="{pt.split(",")[1]}" r="4" fill="#a855f7"/>' for pt in data_points])}
499
+
500
+ <!-- Labels -->
501
+ {labels}
502
+ </svg>
503
+ </div>
504
+ """
505
+
506
+
507
+ def create_mimicry_empty():
508
+ """Empty state for voice mimicry game"""
509
+ return """
510
+ <div style="
511
+ background: rgba(10, 10, 26, 0.4);
512
+ border: 1px solid rgba(124, 58, 237, 0.2);
513
+ border-radius: 16px;
514
+ padding: 30px 20px;
515
+ text-align: center;
516
+ height: 100%;
517
+ display: flex;
518
+ flex-direction: column;
519
+ align-items: center;
520
+ justify-content: center;
521
+ ">
522
+ <div style="margin-bottom: 12px; opacity: 0.5;">
523
+ <svg width="48" height="48" viewBox="0 0 24 24" fill="none" style="margin: 0 auto; display: block;">
524
+ <defs>
525
+ <linearGradient id="micGradEmpty" x1="0%" y1="0%" x2="100%" y2="100%">
526
+ <stop offset="0%" style="stop-color:#8b5cf6"/>
527
+ <stop offset="100%" style="stop-color:#6366f1"/>
528
+ </linearGradient>
529
+ </defs>
530
+ <path d="M12 14c1.66 0 3-1.34 3-3V5c0-1.66-1.34-3-3-3S9 3.34 9 5v6c0 1.66 1.34 3 3 3z" fill="url(#micGradEmpty)"/>
531
+ <path d="M17 11c0 2.76-2.24 5-5 5s-5-2.24-5-5H5c0 3.53 2.61 6.43 6 6.92V21h2v-3.08c3.39-.49 6-3.39 6-6.92h-2z" fill="url(#micGradEmpty)"/>
532
+ </svg>
533
+ </div>
534
+ <div style="color: #a5b4fc; font-size: 12px; line-height: 1.5;">
535
+ Upload reference and your voice to see similarity scores
536
+ </div>
537
+ </div>
538
+ """
539
+
540
+
541
+ def create_mimicry_visualization(result):
542
+ """Voice mimicry score visualization with progress bars"""
543
+ pronunciation = result.get("pronunciation", 0)
544
+ tone = result.get("transcript", 0) # Tone score
545
+ pitch = result.get("pitch", 0)
546
+ rhythm = result.get("rhythm", 0)
547
+ energy = result.get("energy", 0)
548
+
549
+ def create_progress_bar(label, value):
550
+ return f"""
551
+ <div style="display: flex; align-items: center; gap: 12px; margin-bottom: 10px;">
552
+ <div style="flex: 1;">
553
+ <div style="font-size: 11px; color: #cbd5e1; margin-bottom: 4px;">{label}</div>
554
+ <div style="
555
+ height: 6px;
556
+ background: rgba(124, 58, 237, 0.2);
557
+ border-radius: 3px;
558
+ overflow: hidden;
559
+ ">
560
+ <div style="
561
+ height: 100%;
562
+ width: {value}%;
563
+ background: linear-gradient(90deg, #6366f1, #22d3ee);
564
+ border-radius: 3px;
565
+ "></div>
566
+ </div>
567
+ </div>
568
+ <div style="
569
+ font-size: 14px;
570
+ font-weight: 700;
571
+ color: #22d3ee;
572
+ min-width: 32px;
573
+ text-align: right;
574
+ ">{value}</div>
575
+ </div>
576
+ """
577
+
578
+ return f"""
579
+ <div style="
580
+ background: rgba(10, 10, 26, 0.6);
581
+ border: 1px solid rgba(124, 58, 237, 0.3);
582
+ border-radius: 16px;
583
+ padding: 20px;
584
+ height: 100%;
585
+ display: flex;
586
+ flex-direction: column;
587
+ ">
588
+ <div style="
589
+ display: flex;
590
+ align-items: center;
591
+ gap: 10px;
592
+ margin-bottom: 16px;
593
+ padding-bottom: 14px;
594
+ border-bottom: 1px solid rgba(124, 58, 237, 0.2);
595
+ ">
596
+ <div style="
597
+ width: 40px;
598
+ height: 40px;
599
+ border-radius: 10px;
600
+ background: linear-gradient(135deg, #7c3aed, #6366f1);
601
+ display: flex;
602
+ align-items: center;
603
+ justify-content: center;
604
+ flex-shrink: 0;
605
+ ">
606
+ <svg width="24" height="24" viewBox="0 0 24 24" fill="none">
607
+ <circle cx="12" cy="12" r="10" fill="rgba(255, 255, 255, 0.2)" stroke="white" stroke-width="1.5"/>
608
+ <text x="12" y="16" text-anchor="middle" font-size="10" fill="white" font-weight="bold">AI</text>
609
+ </svg>
610
+ </div>
611
+ <div style="flex: 1; min-width: 0;">
612
+ <div style="font-size: 10px; color: #a5b4fc; text-transform: uppercase; letter-spacing: 1px;">CLAUDE</div>
613
+ <div style="font-size: 11px; color: #cbd5e1; line-height: 1.4;">
614
+ Wow, that voice input, takes analytical skills of course but I'll handle it
615
+ </div>
616
+ </div>
617
+ </div>
618
+
619
+ <div style="flex: 1;">
620
+ {create_progress_bar("Pronunciation", pronunciation)}
621
+ {create_progress_bar("Tone", tone)}
622
+ {create_progress_bar("Pitch", pitch)}
623
+ {create_progress_bar("Rhythm", rhythm)}
624
+ {create_progress_bar("Energy", energy)}
625
+ </div>
626
+ </div>
627
+ """
628
+
629
+
630
+ def create_transcription_empty():
631
+ """Empty state for transcription"""
632
+ return """
633
+ <div style="
634
+ background: rgba(10, 10, 26, 0.4);
635
+ border: 1px solid rgba(124, 58, 237, 0.2);
636
+ border-radius: 12px;
637
+ padding: 20px;
638
+ text-align: center;
639
+ color: #a5b4fc;
640
+ font-size: 13px;
641
+ ">
642
+ Upload audio to transcribe
643
+ </div>
644
+ """
645
+
646
+
647
+ def create_transcription_visualization(text):
648
+ """Simple text display for transcription result"""
649
+ return f"""
650
+ <div style="
651
+ background: rgba(10, 10, 26, 0.6);
652
+ border: 1px solid rgba(124, 58, 237, 0.3);
653
+ border-radius: 12px;
654
+ padding: 20px;
655
+ color: #e2e8f0;
656
+ font-size: 20px;
657
+ line-height: 1.6;
658
+ white-space: pre-wrap;
659
+ word-wrap: break-word;
660
+ ">
661
+ {text if text else "Transcription completed"}
662
+ </div>
663
+ """
664
+
665
+
666
+ def create_embedding_empty():
667
+ """Empty state for embedding extraction"""
668
+ return """
669
+ <div style="
670
+ background: rgba(10, 10, 26, 0.4);
671
+ border: 1px solid rgba(124, 58, 237, 0.2);
672
+ border-radius: 16px;
673
+ padding: 30px 20px;
674
+ text-align: center;
675
+ height: 100%;
676
+ display: flex;
677
+ flex-direction: column;
678
+ align-items: center;
679
+ justify-content: center;
680
+ ">
681
+ <div style="margin-bottom: 12px; opacity: 0.5;">
682
+ <svg width="48" height="48" viewBox="0 0 24 24" fill="none" style="margin: 0 auto; display: block;">
683
+ <path d="M21 16V8L12 4L3 8V16L12 20L21 16Z" stroke="#A855F7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
684
+ <path d="M12 4V12M12 12V20M12 12L21 8M12 12L3 8" stroke="#A855F7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
685
+ <circle cx="12" cy="12" r="2" fill="#A855F7"/>
686
+ </svg>
687
+ </div>
688
+ <div style="color: #a5b4fc; font-size: 12px; line-height: 1.5;">
689
+ Upload audio to extract voice embedding
690
+ </div>
691
+ </div>
692
+ """
693
+
694
+
695
+ def create_embedding_visualization(result):
696
+ """Embedding extraction visualization"""
697
+ model = result.get("model", "Wav2Vec2")
698
+ dim = result.get("embedding_length", result.get("dim", 768))
699
+ preview = result.get("embedding_preview", [])
700
+
701
+ # Filter only numeric values to avoid format errors with strings like "..."
702
+ if preview:
703
+ numeric_preview = [v for v in preview if isinstance(v, (int, float))]
704
+ preview_str = ", ".join([f"{v:.4f}" for v in numeric_preview]) if numeric_preview else "..."
705
+ else:
706
+ preview_str = "..."
707
+
708
+ return f"""
709
+ <div style="
710
+ background: rgba(10, 10, 26, 0.6);
711
+ border: 1px solid rgba(124, 58, 237, 0.3);
712
+ border-radius: 16px;
713
+ padding: 20px;
714
+ height: 100%;
715
+ display: flex;
716
+ flex-direction: column;
717
+ ">
718
+ <div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 10px; padding: 10px; background: rgba(124, 58, 237, 0.1); border-radius: 8px;">
719
+ <div style="font-size: 16px; color: #cbd5e1; font-weight: 600;">Model</div>
720
+ <div style="font-size: 18px; font-weight: 700; color: #22d3ee;">{model}</div>
721
+ </div>
722
+ <div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 10px; padding: 10px; background: rgba(124, 58, 237, 0.1); border-radius: 8px;">
723
+ <div style="font-size: 16px; color: #cbd5e1; font-weight: 600;">Dimensions</div>
724
+ <div style="font-size: 18px; font-weight: 700; color: #22d3ee;">{dim}</div>
725
+ </div>
726
+ <div style="padding: 10px; background: rgba(124, 58, 237, 0.1); border-radius: 8px;">
727
+ <div style="font-size: 16px; color: #cbd5e1; font-weight: 600; margin-bottom: 8px;">Preview</div>
728
+ <div style="font-size: 14px; font-family: monospace; color: #22d3ee; overflow: hidden; text-overflow: ellipsis; white-space: nowrap;">
729
+ [{preview_str}]
730
+ </div>
731
+ </div>
732
+ </div>
733
+ """
734
+
735
+
736
+ def create_compare_empty():
737
+ """Empty state for voice comparison"""
738
+ return """
739
+ <div style="
740
+ background: rgba(10, 10, 26, 0.4);
741
+ border: 1px solid rgba(124, 58, 237, 0.2);
742
+ border-radius: 16px;
743
+ padding: 30px 20px;
744
+ text-align: center;
745
+ height: 100%;
746
+ display: flex;
747
+ flex-direction: column;
748
+ align-items: center;
749
+ justify-content: center;
750
+ ">
751
+ <div style="margin-bottom: 12px; opacity: 0.5;">
752
+ <svg width="48" height="48" viewBox="0 0 24 24" fill="none" style="margin: 0 auto; display: block;">
753
+ <path d="M2 10V14" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
754
+ <path d="M5 8V16" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
755
+ <path d="M8 11V13" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
756
+ <path d="M22 10V14" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
757
+ <path d="M19 7V17" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
758
+ <path d="M16 11V13" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
759
+ <path d="M10 12H14" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
760
+ </svg>
761
+ </div>
762
+ <div style="color: #a5b4fc; font-size: 12px; line-height: 1.5;">
763
+ Upload two audio files to compare voices
764
+ </div>
765
+ </div>
766
+ """
767
+
768
+
769
+ def create_compare_visualization(result):
770
+ """Voice comparison visualization with similarity score"""
771
+ similarity = result.get("similarity", 0)
772
+ tone_score = result.get("tone_score", 0)
773
+
774
+ # Convert similarity to percentage
775
+ similarity_pct = int(similarity * 100)
776
+
777
+ # Color based on similarity - Purple theme matching VOICE SIMILARITY
778
+ if similarity_pct >= 80:
779
+ color = "#c084fc" # Light purple (high score)
780
+ elif similarity_pct >= 60:
781
+ color = "#a855f7" # Medium purple (medium score)
782
+ else:
783
+ color = "#7c3aed" # Dark purple (low score)
784
+
785
+ return f"""
786
+ <div style="
787
+ background: rgba(10, 10, 26, 0.6);
788
+ border: 1px solid rgba(124, 58, 237, 0.3);
789
+ border-radius: 16px;
790
+ padding: 20px;
791
+ height: 100%;
792
+ display: flex;
793
+ align-items: flex-end;
794
+ justify-content: center;
795
+ padding-bottom: 40px;
796
+ ">
797
+ <div style="
798
+ width: 160px;
799
+ height: 160px;
800
+ border-radius: 50%;
801
+ background: conic-gradient({color} 0deg {similarity_pct * 3.6}deg, rgba(124, 58, 237, 0.2) {similarity_pct * 3.6}deg 360deg);
802
+ display: flex;
803
+ align-items: center;
804
+ justify-content: center;
805
+ ">
806
+ <div style="
807
+ width: 130px;
808
+ height: 130px;
809
+ border-radius: 50%;
810
+ background: rgba(10, 10, 26, 0.9);
811
+ display: flex;
812
+ flex-direction: column;
813
+ align-items: center;
814
+ justify-content: center;
815
+ ">
816
+ <span style="font-size: 40px; font-weight: 700; color: {color};">{similarity_pct}</span>
817
+ <span style="font-size: 11px; color: #a5b4fc; letter-spacing: 0.5px;">SIMILARITY</span>
818
+ </div>
819
+ </div>
820
+ </div>
821
+ """
822
+
823
+
824
+ def create_similarity_empty():
825
+ """Empty state for voice similarity analysis"""
826
+ return """
827
+ <div style="
828
+ background: rgba(10, 10, 26, 0.4);
829
+ border: 1px solid rgba(124, 58, 237, 0.2);
830
+ border-radius: 16px;
831
+ padding: 30px 20px;
832
+ text-align: center;
833
+ height: 100%;
834
+ display: flex;
835
+ flex-direction: column;
836
+ align-items: center;
837
+ justify-content: center;
838
+ ">
839
+ <div style="margin-bottom: 12px; opacity: 0.5;">
840
+ <svg width="48" height="48" viewBox="0 0 24 24" fill="none" style="margin: 0 auto; display: block;">
841
+ <circle cx="12" cy="12" r="9" stroke="#A855F7" stroke-width="1" opacity="0.3"/>
842
+ <path d="M12 5L18 9L16.5 18H7.5L6 9L12 5Z" fill="#A855F7" fill-opacity="0.4" stroke="#A855F7" stroke-width="2" stroke-linejoin="round"/>
843
+ <circle cx="12" cy="5" r="1.5" fill="#A855F7"/>
844
+ </svg>
845
+ </div>
846
+ <div style="color: #a5b4fc; font-size: 12px; line-height: 1.5;">
847
+ Upload audio files for comprehensive similarity analysis
848
+ </div>
849
+ </div>
850
+ """
851
+
852
+
853
+ def create_similarity_visualization(result):
854
+ """Voice similarity visualization with radar chart"""
855
+ overall = result.get("overall", 0)
856
+
857
+ pronunciation = result.get("pronunciation", 0)
858
+ transcript = result.get("transcript", 0)
859
+ pitch = result.get("pitch", 0)
860
+ rhythm = result.get("rhythm", 0)
861
+ energy = result.get("energy", 0)
862
+
863
+ # Color based on overall score - Purple theme
864
+ if overall >= 80:
865
+ color = "#c084fc" # Light purple (high score)
866
+ elif overall >= 60:
867
+ color = "#a855f7" # Medium purple (medium score)
868
+ else:
869
+ color = "#7c3aed" # Dark purple (low score)
870
+
871
+ # Radar chart calculation
872
+ center_x, center_y = 150, 150
873
+ radius = 110
874
+
875
+ # 5 metrics in order: Pronunciation(top), Transcript(top-right), Pitch(bottom-right), Energy(bottom-left), Rhythm(top-left)
876
+ metrics = [
877
+ ("Pronunciation", pronunciation, -90), # 0° - 90° = -90° (top)
878
+ ("Transcript", transcript, -18), # 72° - 90° = -18° (top-right)
879
+ ("Pitch", pitch, 54), # 144° - 90° = 54° (bottom-right)
880
+ ("Energy", energy, 126), # 216° - 90° = 126° (bottom-left)
881
+ ("Rhythm", rhythm, 198) # 288° - 90° = 198° (top-left)
882
+ ]
883
+
884
+ # Calculate polygon points for data
885
+ data_points = []
886
+ for _, value, angle_deg in metrics:
887
+ angle_rad = math.radians(angle_deg)
888
+ point_radius = (value / 100) * radius
889
+ x = center_x + point_radius * math.cos(angle_rad)
890
+ y = center_y + point_radius * math.sin(angle_rad)
891
+ data_points.append(f"{x:.2f},{y:.2f}")
892
+
893
+ # Background concentric pentagons (20, 40, 60, 80, 100)
894
+ def create_pentagon_points(scale):
895
+ points = []
896
+ for _, _, angle_deg in metrics:
897
+ angle_rad = math.radians(angle_deg)
898
+ r = radius * scale
899
+ x = center_x + r * math.cos(angle_rad)
900
+ y = center_y + r * math.sin(angle_rad)
901
+ points.append(f"{x:.2f},{y:.2f}")
902
+ return " ".join(points)
903
+
904
+ background_pentagons = ""
905
+ for scale in [0.2, 0.4, 0.6, 0.8, 1.0]:
906
+ background_pentagons += f'<polygon points="{create_pentagon_points(scale)}" fill="none" stroke="rgba(124, 58, 237, 0.15)" stroke-width="1"/>'
907
+
908
+ # Axis lines from center to vertices
909
+ axis_lines = ""
910
+ for _, _, angle_deg in metrics:
911
+ angle_rad = math.radians(angle_deg)
912
+ x = center_x + radius * math.cos(angle_rad)
913
+ y = center_y + radius * math.sin(angle_rad)
914
+ axis_lines += f'<line x1="{center_x}" y1="{center_y}" x2="{x:.2f}" y2="{y:.2f}" stroke="rgba(124, 58, 237, 0.3)" stroke-width="1"/>'
915
+
916
+ # Labels at vertices
917
+ labels = ""
918
+ for label, value, angle_deg in metrics:
919
+ angle_rad = math.radians(angle_deg)
920
+ # Position label outside the pentagon
921
+ label_radius = radius + 25
922
+ x = center_x + label_radius * math.cos(angle_rad)
923
+ y = center_y + label_radius * math.sin(angle_rad)
924
+ labels += f'''<text x="{x:.2f}" y="{y:.2f}" text-anchor="middle" dominant-baseline="middle" fill="#a5b4fc" font-size="11" font-weight="600">
925
+ {label}
926
+ <tspan x="{x:.2f}" dy="12" fill="#a855f7" font-size="13" font-weight="700">{value}</tspan>
927
+ </text>'''
928
+
929
+ return f"""
930
+ <div style="
931
+ background: rgba(10, 10, 26, 0.6);
932
+ border: 1px solid rgba(124, 58, 237, 0.3);
933
+ border-radius: 16px;
934
+ padding: 20px;
935
+ display: flex;
936
+ align-items: center;
937
+ gap: 30px;
938
+ ">
939
+ <!-- Left: Overall Score Donut -->
940
+ <div style="flex: 1; display: flex; align-items: center; justify-content: center;">
941
+ <div style="
942
+ width: 160px;
943
+ height: 160px;
944
+ border-radius: 50%;
945
+ background: conic-gradient({color} 0deg {overall * 3.6}deg, rgba(124, 58, 237, 0.2) {overall * 3.6}deg 360deg);
946
+ display: flex;
947
+ align-items: center;
948
+ justify-content: center;
949
+ ">
950
+ <div style="
951
+ width: 130px;
952
+ height: 130px;
953
+ border-radius: 50%;
954
+ background: rgba(10, 10, 26, 0.9);
955
+ display: flex;
956
+ flex-direction: column;
957
+ align-items: center;
958
+ justify-content: center;
959
+ ">
960
+ <span style="font-size: 40px; font-weight: 700; color: {color};">{overall}</span>
961
+ <span style="font-size: 11px; color: #a5b4fc; letter-spacing: 0.5px;">OVERALL</span>
962
+ </div>
963
+ </div>
964
+ </div>
965
+
966
+ <!-- Right: Radar Chart -->
967
+ <div style="flex: 1; display: flex; align-items: center; justify-content: center;">
968
+ <svg width="300" height="300" viewBox="0 0 300 300">
969
+ <!-- Background pentagons -->
970
+ {background_pentagons}
971
+
972
+ <!-- Axis lines -->
973
+ {axis_lines}
974
+
975
+ <!-- Data polygon -->
976
+ <polygon points="{' '.join(data_points)}"
977
+ fill="rgba(124, 58, 237, 0.3)"
978
+ stroke="#a855f7"
979
+ stroke-width="2"/>
980
+
981
+ <!-- Data points -->
982
+ {''.join([f'<circle cx="{pt.split(",")[0]}" cy="{pt.split(",")[1]}" r="4" fill="#a855f7"/>' for pt in data_points])}
983
+
984
+ <!-- Labels -->
985
+ {labels}
986
+ </svg>
987
+ </div>
988
+ </div>
989
+ """
990
+
991
+
992
+ # Clean audio functions removed - using gr.Audio component directly
993
+
994
+
995
+
996
+
997
+ # ============================================================================
998
+ # Gradio Interface with Dark Theme
999
+ # ============================================================================
1000
+
1001
+ custom_css = """
1002
+ /* ===== FORCE DARK MODE FOR BOTH LIGHT AND DARK THEMES ===== */
1003
+ :root, html, body, .gradio-container, .main, .app, [data-testid="block-container"] {
1004
+ background: linear-gradient(180deg, #0a0a1a 0%, #0f0f23 100%) !important;
1005
+ background-color: #0a0a1a !important;
1006
+ color: #ffffff !important;
1007
+ }
1008
+
1009
+ /* Force dark background on all containers */
1010
+ .gradio-container, .gradio-container *, .main, .main *, .app, .app * {
1011
+ background-color: transparent !important;
1012
+ }
1013
+
1014
+ /* Force dark text colors */
1015
+ .gradio-container label, .gradio-container p, .gradio-container span,
1016
+ .gradio-container div:not(.card):not(.terminal-window):not([class*="button"]) {
1017
+ color: #ffffff !important;
1018
+ }
1019
+
1020
+ /* ===== GLOBAL STYLES ===== */
1021
+ body {
1022
+ background: linear-gradient(180deg, #0a0a1a 0%, #0f0f23 100%) !important;
1023
+ color: #ffffff !important;
1024
+ font-family: system-ui, -apple-system, sans-serif;
1025
+ }
1026
+
1027
+ .gradio-container {
1028
+ max-width: 100% !important;
1029
+ width: 100% !important;
1030
+ padding: 0px 16px 20px 16px !important;
1031
+ background: transparent !important;
1032
+ margin: 0 !important;
1033
+ }
1034
+
1035
+ .gradio-container > .main,
1036
+ .gradio-container .main,
1037
+ .main {
1038
+ max-width: 100% !important;
1039
+ width: 100% !important;
1040
+ padding-left: 0 !important;
1041
+ padding-right: 0 !important;
1042
+ margin: 0 auto !important;
1043
+ }
1044
+
1045
+ .contain {
1046
+ max-width: 100% !important;
1047
+ padding: 0 !important;
1048
+ }
1049
+
1050
+ /* Force full width on all Gradio internal containers */
1051
+ .gradio-container > div,
1052
+ .gradio-container > div > div,
1053
+ #component-0,
1054
+ .wrap,
1055
+ .app,
1056
+ .contain,
1057
+ footer,
1058
+ .gradio-row,
1059
+ .gradio-column,
1060
+ .svelte-1gfkn6j,
1061
+ [class*="svelte-"] {
1062
+ max-width: 100% !important;
1063
+ }
1064
+
1065
+ .gradio-row {
1066
+ max-width: 100% !important;
1067
+ width: 100% !important;
1068
+ margin: 0 !important;
1069
+ padding: 0 !important;
1070
+ }
1071
+
1072
+ /* ===== HEADER (FLOATING, NO CARD) ===== */
1073
+ .header-main {
1074
+ display: flex;
1075
+ justify-content: space-between;
1076
+ align-items: center;
1077
+ margin-bottom: 0;
1078
+ padding: 0;
1079
+ }
1080
+
1081
+ .header-left {
1082
+ display: flex;
1083
+ align-items: center;
1084
+ gap: 16px;
1085
+ }
1086
+
1087
+ .header-icon {
1088
+ font-size: 48px;
1089
+ filter: drop-shadow(0 4px 12px rgba(99, 102, 241, 0.6));
1090
+ }
1091
+
1092
+ .header-title {
1093
+ font-size: 42px;
1094
+ font-weight: 900;
1095
+ color: #e0e7ff;
1096
+ margin: 0;
1097
+ letter-spacing: -0.5px;
1098
+ }
1099
+
1100
+ .header-subtitle {
1101
+ color: #c7d2fe;
1102
+ font-size: 20px;
1103
+ font-weight: 700;
1104
+ margin-left: 6px;
1105
+ }
1106
+
1107
+ /* ===== CARD STYLES ===== */
1108
+ .card {
1109
+ background: rgba(15, 15, 35, 0.8);
1110
+ backdrop-filter: blur(20px);
1111
+ border: 1px solid rgba(124, 58, 237, 0.3);
1112
+ border-radius: 20px;
1113
+ padding: 30px;
1114
+ box-shadow: 0 8px 32px rgba(0, 0, 0, 0.4);
1115
+ transition: all 0.3s ease;
1116
+ height: 100%;
1117
+ display: flex;
1118
+ flex-direction: column;
1119
+ }
1120
+
1121
+ .card:hover {
1122
+ border-color: rgba(124, 58, 237, 0.5);
1123
+ box-shadow: 0 12px 48px rgba(124, 58, 237, 0.3);
1124
+ }
1125
+
1126
+ /* Ensure columns in top row have equal height */
1127
+ .gradio-row:first-of-type .gradio-column {
1128
+ display: flex !important;
1129
+ flex-direction: column !important;
1130
+ }
1131
+
1132
+ .gradio-row:first-of-type .gradio-column > div {
1133
+ flex: 1 !important;
1134
+ display: flex !important;
1135
+ flex-direction: column !important;
1136
+ }
1137
+
1138
+ /* Set minimum height for top row cards */
1139
+ .gradio-row:first-of-type .card {
1140
+ min-height: 550px;
1141
+ }
1142
+
1143
+ .card-title {
1144
+ font-size: 16px;
1145
+ font-weight: 700;
1146
+ color: #a5b4fc;
1147
+ text-transform: uppercase;
1148
+ letter-spacing: 1px;
1149
+ margin-bottom: 20px;
1150
+ display: flex;
1151
+ align-items: center;
1152
+ }
1153
+
1154
+ /* ===== ROW SPACING ===== */
1155
+ .gradio-row {
1156
+ gap: 24px !important;
1157
+ }
1158
+
1159
+ /* ===== QUICK START - CODE BLOCK (TERMINAL/IDE STYLE) ===== */
1160
+ .terminal-window {
1161
+ background: #1a1b26;
1162
+ border: 1px solid rgba(124, 58, 237, 0.3);
1163
+ border-radius: 12px;
1164
+ overflow: hidden;
1165
+ margin-bottom: 16px;
1166
+ box-shadow: 0 8px 32px rgba(0, 0, 0, 0.6);
1167
+ }
1168
+
1169
+ .terminal-header {
1170
+ background: #16161e;
1171
+ padding: 12px 16px;
1172
+ display: flex;
1173
+ align-items: center;
1174
+ justify-content: space-between;
1175
+ border-bottom: 1px solid rgba(124, 58, 237, 0.2);
1176
+ }
1177
+
1178
+ .terminal-dots {
1179
+ display: flex;
1180
+ gap: 8px;
1181
+ }
1182
+
1183
+ .terminal-dot {
1184
+ width: 12px;
1185
+ height: 12px;
1186
+ border-radius: 50%;
1187
+ }
1188
+
1189
+ .terminal-dot.red {
1190
+ background: #ff5f56 !important;
1191
+ box-shadow: 0 0 8px rgba(255, 95, 86, 0.8) !important;
1192
+ }
1193
+
1194
+ .terminal-dot.yellow {
1195
+ background: #ffbd2e !important;
1196
+ box-shadow: 0 0 8px rgba(255, 189, 46, 0.8) !important;
1197
+ }
1198
+
1199
+ .terminal-dot.green {
1200
+ background: #27c93f !important;
1201
+ box-shadow: 0 0 8px rgba(39, 201, 63, 0.8) !important;
1202
+ }
1203
+
1204
+ .terminal-title {
1205
+ font-size: 12px;
1206
+ color: #6b7280;
1207
+ font-family: 'SF Mono', 'Monaco', 'Consolas', monospace;
1208
+ font-weight: 500;
1209
+ }
1210
+
1211
+ .terminal-body {
1212
+ background: #1a1b26;
1213
+ padding: 0;
1214
+ display: flex;
1215
+ }
1216
+
1217
+ .line-numbers {
1218
+ background: #16161e;
1219
+ padding: 16px 12px;
1220
+ border-right: 1px solid rgba(124, 58, 237, 0.15);
1221
+ user-select: none;
1222
+ text-align: right;
1223
+ min-width: 48px;
1224
+ }
1225
+
1226
+ .line-num {
1227
+ display: block;
1228
+ color: #4a5568;
1229
+ font-family: 'SF Mono', 'Monaco', 'Consolas', monospace;
1230
+ font-size: 14px;
1231
+ line-height: 1.8;
1232
+ }
1233
+
1234
+ .code-content {
1235
+ flex: 1;
1236
+ padding: 16px 20px;
1237
+ overflow-x: auto;
1238
+ }
1239
+
1240
+ .code-line {
1241
+ display: block;
1242
+ white-space: pre;
1243
+ font-family: 'SF Mono', 'Monaco', 'Consolas', monospace;
1244
+ font-size: 14px;
1245
+ line-height: 1.8;
1246
+ color: #a9b1d6;
1247
+ }
1248
+
1249
+ .json-key {
1250
+ color: #7dcfff;
1251
+ font-weight: 500;
1252
+ }
1253
+
1254
+ .json-string {
1255
+ color: #9ece6a;
1256
+ }
1257
+
1258
+ .json-bracket {
1259
+ color: #bb9af7;
1260
+ font-weight: 600;
1261
+ }
1262
+
1263
+ .json-colon {
1264
+ color: #c0caf5;
1265
+ }
1266
+
1267
+ .json-comma {
1268
+ color: #c0caf5;
1269
+ }
1270
+
1271
+ .copy-button {
1272
+ width: 100%;
1273
+ background: linear-gradient(135deg, #7c3aed, #6366f1) !important;
1274
+ border: none !important;
1275
+ border-radius: 12px !important;
1276
+ padding: 14px 24px !important;
1277
+ font-weight: 700 !important;
1278
+ font-size: 13px !important;
1279
+ color: white !important;
1280
+ text-transform: uppercase;
1281
+ letter-spacing: 1px;
1282
+ cursor: pointer;
1283
+ box-shadow: 0 4px 16px rgba(124, 58, 237, 0.4) !important;
1284
+ transition: all 0.3s ease !important;
1285
+ display: flex;
1286
+ align-items: center;
1287
+ justify-content: center;
1288
+ gap: 8px;
1289
+ }
1290
+
1291
+ .copy-button:hover {
1292
+ transform: translateY(-2px) !important;
1293
+ box-shadow: 0 6px 24px rgba(124, 58, 237, 0.6) !important;
1294
+ }
1295
+
1296
+ /* ===== TOOLS TABLE ===== */
1297
+ .tools-table {
1298
+ width: 100%;
1299
+ border-collapse: separate;
1300
+ border-spacing: 0;
1301
+ background: rgba(10, 10, 26, 0.6);
1302
+ border-radius: 12px;
1303
+ overflow: hidden;
1304
+ border: 1px solid rgba(124, 58, 237, 0.3);
1305
+ margin-bottom: 0;
1306
+ flex: 1;
1307
+ }
1308
+
1309
+ .tools-table th {
1310
+ background: rgba(124, 58, 237, 0.2);
1311
+ color: #a5b4fc;
1312
+ font-weight: 700;
1313
+ font-size: 16px;
1314
+ text-transform: uppercase;
1315
+ letter-spacing: 1.5px;
1316
+ padding: 20px 14px;
1317
+ text-align: left;
1318
+ border-bottom: 1px solid rgba(124, 58, 237, 0.3);
1319
+ }
1320
+
1321
+ .tools-table td {
1322
+ padding: 20px 14px;
1323
+ color: #cbd5e1;
1324
+ font-size: 16px;
1325
+ border-bottom: 1px solid rgba(99, 102, 241, 0.1);
1326
+ }
1327
+
1328
+ .tools-table tr:last-child td {
1329
+ border-bottom: none;
1330
+ }
1331
+
1332
+ .tools-table tr:hover {
1333
+ background: rgba(124, 58, 237, 0.08);
1334
+ }
1335
+
1336
+ .tool-name {
1337
+ color: #22d3ee;
1338
+ font-family: 'SF Mono', 'Monaco', 'Consolas', monospace;
1339
+ font-weight: 600;
1340
+ font-size: 13px;
1341
+ vertical-align: middle;
1342
+ }
1343
+
1344
+ /* ===== COMPOSITE SECTION ===== */
1345
+ .composite-section {
1346
+ background: rgba(10, 10, 26, 0.8);
1347
+ border: 1px solid rgba(124, 58, 237, 0.3);
1348
+ border-radius: 12px;
1349
+ padding: 20px;
1350
+ }
1351
+
1352
+ .composite-header {
1353
+ font-size: 11px;
1354
+ font-weight: 700;
1355
+ color: #a5b4fc;
1356
+ text-transform: uppercase;
1357
+ letter-spacing: 1.5px;
1358
+ margin-bottom: 12px;
1359
+ }
1360
+
1361
+ .composite-content {
1362
+ color: #cbd5e1;
1363
+ font-size: 12px;
1364
+ line-height: 1.6;
1365
+ margin-bottom: 16px;
1366
+ }
1367
+
1368
+ .try-demo-button {
1369
+ width: 100%;
1370
+ background: transparent !important;
1371
+ border: 2px solid #7c3aed !important;
1372
+ border-radius: 12px !important;
1373
+ padding: 12px 24px !important;
1374
+ font-weight: 700 !important;
1375
+ font-size: 12px !important;
1376
+ color: #7c3aed !important;
1377
+ text-transform: uppercase;
1378
+ letter-spacing: 1px;
1379
+ cursor: pointer;
1380
+ transition: all 0.3s ease !important;
1381
+ }
1382
+
1383
+ .try-demo-button:hover {
1384
+ background: rgba(124, 58, 237, 0.1) !important;
1385
+ border-color: #7c3aed !important;
1386
+ color: #8b5cf6 !important;
1387
+ }
1388
+
1389
+ /* ===== BUTTONS ===== */
1390
+ button[variant="primary"] {
1391
+ background: linear-gradient(135deg, #7c3aed, #6366f1) !important;
1392
+ border: none !important;
1393
+ border-radius: 12px !important;
1394
+ padding: 14px 32px !important;
1395
+ font-weight: 700 !important;
1396
+ font-size: 14px !important;
1397
+ color: white !important;
1398
+ box-shadow: 0 4px 20px rgba(124, 58, 237, 0.4) !important;
1399
+ transition: all 0.3s ease !important;
1400
+ }
1401
+
1402
+ button[variant="primary"]:hover {
1403
+ transform: translateY(-2px) !important;
1404
+ box-shadow: 0 8px 32px rgba(124, 58, 237, 0.6) !important;
1405
+ }
1406
+
1407
+ /* ===== AUDIO COMPONENT ===== */
1408
+ .gradio-audio {
1409
+ background: rgba(30, 27, 75, 0.6) !important;
1410
+ border: 1px solid rgba(124, 58, 237, 0.3) !important;
1411
+ border-radius: 12px !important;
1412
+ }
1413
+
1414
+ /* ===== TEXTBOX ===== */
1415
+ textarea {
1416
+ background: rgba(30, 27, 75, 0.6) !important;
1417
+ border: 1px solid rgba(124, 58, 237, 0.3) !important;
1418
+ border-radius: 12px !important;
1419
+ color: #e0e7ff !important;
1420
+ font-size: 13px !important;
1421
+ }
1422
+
1423
+ /* ===== DROPDOWN ===== */
1424
+ select {
1425
+ background: rgba(30, 27, 75, 0.6) !important;
1426
+ border: 1px solid rgba(124, 58, 237, 0.3) !important;
1427
+ border-radius: 12px !important;
1428
+ color: #e0e7ff !important;
1429
+ }
1430
+
1431
+ /* ===== LABELS ===== */
1432
+ label {
1433
+ color: #a5b4fc !important;
1434
+ font-weight: 600 !important;
1435
+ font-size: 12px !important;
1436
+ text-transform: uppercase;
1437
+ letter-spacing: 0.5px;
1438
+ }
1439
+
1440
+ /* ===== HTML OUTPUT ===== */
1441
+ .gradio-html {
1442
+ background: transparent !important;
1443
+ border: none !important;
1444
+ }
1445
+
1446
+ /* ===== DEMO ROW LAYOUT ===== */
1447
+ .demo-row {
1448
+ display: flex !important;
1449
+ gap: 24px !important;
1450
+ align-items: stretch !important;
1451
+ }
1452
+
1453
+ /* Only apply card style to the outer column (demo-card-column) */
1454
+ .demo-card-column {
1455
+ display: flex !important;
1456
+ flex-direction: column !important;
1457
+ height: 700px !important;
1458
+ min-height: 700px !important;
1459
+ max-height: 700px !important;
1460
+ background: rgba(15, 15, 35, 0.8) !important;
1461
+ backdrop-filter: blur(20px) !important;
1462
+ border: 1px solid rgba(124, 58, 237, 0.3) !important;
1463
+ border-radius: 20px !important;
1464
+ padding: 4px 4px 2px 4px !important;
1465
+ box-shadow: 0 8px 32px rgba(0, 0, 0, 0.4) !important;
1466
+ transition: all 0.3s ease !important;
1467
+ gap: 2px !important;
1468
+ overflow-y: auto !important;
1469
+ }
1470
+
1471
+ .demo-card-column:hover {
1472
+ border-color: rgba(124, 58, 237, 0.5) !important;
1473
+ box-shadow: 0 12px 48px rgba(124, 58, 237, 0.3) !important;
1474
+ }
1475
+
1476
+ /* Remove any border/background from inner elements */
1477
+ .demo-card-column > div,
1478
+ .demo-card-column > div > div,
1479
+ .demo-row > div > div {
1480
+ background: transparent !important;
1481
+ border: none !important;
1482
+ box-shadow: none !important;
1483
+ padding: 0 !important;
1484
+ border-radius: 0 !important;
1485
+ }
1486
+
1487
+ /* Remove card background from inner HTML - we use column background instead */
1488
+ .demo-row .card {
1489
+ background: transparent !important;
1490
+ backdrop-filter: none !important;
1491
+ border: none !important;
1492
+ border-radius: 0 !important;
1493
+ padding: 0 !important;
1494
+ box-shadow: none !important;
1495
+ margin-bottom: 12px !important;
1496
+ }
1497
+
1498
+ .demo-row .card:hover {
1499
+ border: none !important;
1500
+ box-shadow: none !important;
1501
+ }
1502
+
1503
+ /* Ensure all inner components have transparent background */
1504
+ .demo-row .gradio-audio,
1505
+ .demo-row .gradio-dropdown,
1506
+ .demo-row .gradio-textbox,
1507
+ .demo-row .gradio-button {
1508
+ background: transparent !important;
1509
+ }
1510
+
1511
+ /* Create a wrapper for input elements (flex container) */
1512
+ .demo-card-column > div:not(:last-child) {
1513
+ flex: 0 0 auto !important;
1514
+ }
1515
+
1516
+ /* Adjust spacing for input elements in demo cards */
1517
+ .demo-row .gradio-audio {
1518
+ margin-top: 6px !important;
1519
+ margin-bottom: 0px !important;
1520
+ max-height: 50px !important;
1521
+ min-height: 40px !important;
1522
+ height: 45px !important;
1523
+ }
1524
+
1525
+ /* Target all child elements of audio component */
1526
+ .demo-row .gradio-audio > div,
1527
+ .demo-row .gradio-audio .wrap,
1528
+ .demo-row .gradio-audio .upload-container,
1529
+ .demo-row .gradio-audio .record-container,
1530
+ .demo-row .gradio-audio * {
1531
+ max-height: 50px !important;
1532
+ }
1533
+
1534
+ /* Audio player specific height reduction */
1535
+ .demo-row .gradio-audio audio {
1536
+ height: 26px !important;
1537
+ max-height: 26px !important;
1538
+ min-height: 26px !important;
1539
+ }
1540
+
1541
+ /* Upload/record button container height */
1542
+ .demo-row .gradio-audio .upload-container,
1543
+ .demo-row .gradio-audio .record-container {
1544
+ min-height: 38px !important;
1545
+ max-height: 38px !important;
1546
+ padding: 4px !important;
1547
+ }
1548
+
1549
+ /* Audio component buttons */
1550
+ .demo-row .gradio-audio button {
1551
+ height: 28px !important;
1552
+ min-height: 28px !important;
1553
+ max-height: 28px !important;
1554
+ padding: 4px 10px !important;
1555
+ font-size: 10px !important;
1556
+ }
1557
+
1558
+ /* Hide text nodes in audio upload area - keep icons */
1559
+ .demo-row .gradio-audio .upload-text {
1560
+ display: none !important;
1561
+ }
1562
+
1563
+ .demo-row .gradio-audio .placeholder {
1564
+ display: none !important;
1565
+ }
1566
+
1567
+ .demo-row .gradio-audio span:not(:has(svg)) {
1568
+ font-size: 0 !important;
1569
+ }
1570
+
1571
+ .demo-row .gradio-audio p {
1572
+ display: none !important;
1573
+ }
1574
+
1575
+ /* Hide "Drop Audio Here", "- or -", "Click to Upload" text */
1576
+ .demo-row .gradio-audio .upload-container span,
1577
+ .demo-row .gradio-audio .upload-container p {
1578
+ font-size: 0 !important;
1579
+ line-height: 0 !important;
1580
+ }
1581
+
1582
+ /* Keep SVG icons visible */
1583
+ .demo-row .gradio-audio svg {
1584
+ font-size: initial !important;
1585
+ }
1586
+
1587
+ /* ADDITIONAL METHODS: Hide all text in audio upload area */
1588
+ .demo-row .gradio-audio label {
1589
+ font-size: 0 !important;
1590
+ }
1591
+
1592
+ .demo-row .gradio-audio label span:not(:has(svg)) {
1593
+ display: none !important;
1594
+ }
1595
+
1596
+ .demo-row .gradio-audio .file-preview {
1597
+ font-size: 0 !important;
1598
+ }
1599
+
1600
+ .demo-row .gradio-audio .file-preview span {
1601
+ font-size: 0 !important;
1602
+ display: none !important;
1603
+ }
1604
+
1605
+ .demo-row .gradio-audio [data-testid="upload-text"],
1606
+ .demo-row .gradio-audio [data-testid="file-preview-text"],
1607
+ .demo-row .gradio-audio .upload-text,
1608
+ .demo-row .gradio-audio .file-preview-text {
1609
+ display: none !important;
1610
+ visibility: hidden !important;
1611
+ font-size: 0 !important;
1612
+ }
1613
+
1614
+ /* Target all text nodes (more aggressive) */
1615
+ .demo-row .gradio-audio *:not(svg):not(path):not(circle):not(rect):not(line) {
1616
+ color: transparent !important;
1617
+ }
1618
+
1619
+ .demo-row .gradio-audio button {
1620
+ color: white !important;
1621
+ }
1622
+
1623
+ /* Ensure icons remain visible */
1624
+ .demo-row .gradio-audio svg,
1625
+ .demo-row .gradio-audio svg * {
1626
+ color: initial !important;
1627
+ fill: currentColor !important;
1628
+ stroke: currentColor !important;
1629
+ }
1630
+
1631
+ /* NUCLEAR OPTION: Hide everything in label, then show only necessary elements */
1632
+ .demo-row .gradio-audio label > div > div {
1633
+ display: none !important;
1634
+ }
1635
+
1636
+ .demo-row .gradio-audio label::before {
1637
+ content: '' !important;
1638
+ }
1639
+
1640
+ .demo-row .gradio-audio label * {
1641
+ visibility: hidden !important;
1642
+ }
1643
+
1644
+ .demo-row .gradio-audio label svg {
1645
+ visibility: visible !important;
1646
+ }
1647
+
1648
+ .demo-row .gradio-audio label button {
1649
+ visibility: visible !important;
1650
+ }
1651
+
1652
+ .demo-row .gradio-audio label audio {
1653
+ visibility: visible !important;
1654
+ }
1655
+
1656
+ /* Force hide any text content */
1657
+ .demo-row .gradio-audio label > div::after,
1658
+ .demo-row .gradio-audio label > div::before {
1659
+ content: '' !important;
1660
+ display: none !important;
1661
+ }
1662
+
1663
+ /* Additional override for upload text elements */
1664
+ .demo-row .gradio-audio [class*="upload"],
1665
+ .demo-row .gradio-audio [class*="placeholder"],
1666
+ .demo-row .gradio-audio [class*="text"] {
1667
+ font-size: 0 !important;
1668
+ line-height: 0 !important;
1669
+ width: 0 !important;
1670
+ height: 0 !important;
1671
+ opacity: 0 !important;
1672
+ visibility: hidden !important;
1673
+ position: absolute !important;
1674
+ left: -9999px !important;
1675
+ }
1676
+
1677
+ /* NUCLEAR OPTION 2: Complete removal of label content */
1678
+ .demo-row .gradio-audio label.block {
1679
+ display: none !important;
1680
+ }
1681
+
1682
+ .demo-row .gradio-audio .file-upload {
1683
+ display: none !important;
1684
+ }
1685
+
1686
+ /* Hide all direct text children */
1687
+ .demo-row .gradio-audio label > span:not(:has(button)):not(:has(audio)):not(:has(svg)) {
1688
+ display: none !important;
1689
+ }
1690
+
1691
+ /* Gradio 6.0 specific selectors - upload area */
1692
+ .demo-row .gradio-audio [data-testid="upload-button"],
1693
+ .demo-row .gradio-audio [data-testid="file-upload"],
1694
+ .demo-row .gradio-audio .upload-area {
1695
+ display: none !important;
1696
+ }
1697
+
1698
+ /* Hide all paragraph elements in audio component */
1699
+ .demo-row .gradio-audio label p,
1700
+ .demo-row .gradio-audio label span.text,
1701
+ .demo-row .gradio-audio label div.text {
1702
+ display: none !important;
1703
+ }
1704
+
1705
+ /* More aggressive text hiding - target by content */
1706
+ .demo-row .gradio-audio *::before,
1707
+ .demo-row .gradio-audio *::after {
1708
+ content: '' !important;
1709
+ display: none !important;
1710
+ }
1711
+
1712
+ /* Make sure only buttons and audio players are visible */
1713
+ .demo-row .gradio-audio > label > div > div:not(:has(button)):not(:has(audio)) {
1714
+ display: none !important;
1715
+ }
1716
+
1717
+ /* Gradio Blocks specific - Hide wrapper divs that contain text */
1718
+ .demo-row .gradio-audio .wrap > div:not(:has(button)):not(:has(audio)):not(:has(svg)) {
1719
+ display: none !important;
1720
+ }
1721
+
1722
+ /* Override for Gradio 6.x structure */
1723
+ .demo-row .gradio-audio [class*="svelte-"] span:not(:has(svg)):not(:has(button)) {
1724
+ display: none !important;
1725
+ }
1726
+
1727
+ .demo-row .gradio-dropdown,
1728
+ .demo-row .gradio-textbox {
1729
+ margin-bottom: 2px !important;
1730
+ }
1731
+
1732
+ .demo-row .gradio-row {
1733
+ margin-bottom: 2px !important;
1734
+ }
1735
+
1736
+ /* IMPORTANT: Button alignment - push buttons to bottom with margin-top: auto */
1737
+ .demo-row .gradio-button {
1738
+ margin-top: auto !important;
1739
+ margin-bottom: 0px !important;
1740
+ flex-shrink: 0 !important;
1741
+ }
1742
+
1743
+ /* Output area should not push button down - set flex: 1 */
1744
+ .demo-row .gradio-html {
1745
+ flex: 1 !important;
1746
+ margin-bottom: 0 !important;
1747
+ display: flex !important;
1748
+ flex-direction: column !important;
1749
+ max-height: 200px !important;
1750
+ overflow-y: auto !important;
1751
+ }
1752
+
1753
+ /* Output audio component (clean_audio_output) height limit */
1754
+ .demo-row .gradio-audio[data-testid="audio-output"],
1755
+ .demo-row > div:last-child .gradio-audio {
1756
+ max-height: 120px !important;
1757
+ min-height: 60px !important;
1758
+ height: auto !important;
1759
+ margin-bottom: 0px !important;
1760
+ }
1761
+
1762
+
1763
+ /* ===== CUSTOM ACTION BUTTONS (DEMO CARDS) ===== */
1764
+ .custom-action-btn,
1765
+ .custom-action-btn button,
1766
+ .custom-action-btn button[data-testid="button"],
1767
+ button.custom-action-btn,
1768
+ .demo-row .custom-action-btn,
1769
+ .demo-row .custom-action-btn button {
1770
+ width: 100% !important;
1771
+ min-width: 100% !important;
1772
+ max-width: 100% !important;
1773
+ background: linear-gradient(135deg, #6366f1, #7c3aed) !important;
1774
+ border: none !important;
1775
+ border-radius: 12px !important;
1776
+ padding: 8px 16px !important;
1777
+ height: 38px !important;
1778
+ min-height: 38px !important;
1779
+ max-height: 38px !important;
1780
+ font-weight: 700 !important;
1781
+ font-size: 16px !important;
1782
+ letter-spacing: 1.5px !important;
1783
+ text-transform: uppercase !important;
1784
+ color: white !important;
1785
+ box-shadow: 0 4px 20px rgba(124, 58, 237, 0.4) !important;
1786
+ transition: all 0.3s ease !important;
1787
+ }
1788
+
1789
+ .custom-action-btn:hover,
1790
+ .custom-action-btn button:hover,
1791
+ .custom-action-btn button[data-testid="button"]:hover,
1792
+ button.custom-action-btn:hover,
1793
+ .demo-row .custom-action-btn:hover,
1794
+ .demo-row .custom-action-btn button:hover {
1795
+ transform: translateY(-2px) !important;
1796
+ box-shadow: 0 8px 32px rgba(124, 58, 237, 0.6) !important;
1797
+ background: linear-gradient(135deg, #6366f1, #7c3aed) !important;
1798
+ }
1799
+
1800
+ /* ===== DECORATIVE ELEMENTS ===== */
1801
+ .diamond-decoration {
1802
+ position: fixed;
1803
+ bottom: 40px;
1804
+ right: 40px;
1805
+ width: 80px;
1806
+ height: 80px;
1807
+ border: 2px solid rgba(124, 58, 237, 0.2);
1808
+ transform: rotate(45deg);
1809
+ pointer-events: none;
1810
+ z-index: 1;
1811
+ }
1812
+
1813
+ .star-decoration {
1814
+ display: none;
1815
+ }
1816
+ """
1817
+
1818
+ with gr.Blocks() as demo:
1819
+ # Inject custom CSS and decorative elements (positioned fixed, no DOM space)
1820
+ gr.HTML(f"""
1821
+ <style>{custom_css}</style>
1822
+ <div class="diamond-decoration"></div>
1823
+ <div class="star-decoration">
1824
+ <svg width="24" height="24" viewBox="0 0 24 24" fill="none">
1825
+ <path d="M12 2l3.09 6.26L22 9.27l-5 4.87 1.18 6.88L12 17.77l-6.18 3.25L7 14.14 2 9.27l6.91-1.01L12 2z" fill="#a5b4fc" opacity="0.4"/>
1826
+ </svg>
1827
+ </div>
1828
+ <script>
1829
+ // JavaScript to completely remove upload text from Audio components in demo-row
1830
+ function removeAudioUploadText() {{
1831
+ // Find all audio components in demo-row
1832
+ const demoRows = document.querySelectorAll('.demo-row');
1833
+ demoRows.forEach(row => {{
1834
+ const audioComponents = row.querySelectorAll('.gradio-audio');
1835
+ audioComponents.forEach(audio => {{
1836
+ // METHOD 1: Remove ALL text nodes (most aggressive)
1837
+ const walker = document.createTreeWalker(
1838
+ audio,
1839
+ NodeFilter.SHOW_TEXT,
1840
+ null,
1841
+ false
1842
+ );
1843
+
1844
+ const textNodesToRemove = [];
1845
+ while(walker.nextNode()) {{
1846
+ const node = walker.currentNode;
1847
+ // Only keep text that's inside button or audio elements
1848
+ const parentTag = node.parentElement?.tagName?.toLowerCase();
1849
+ if (parentTag !== 'button' && parentTag !== 'audio') {{
1850
+ textNodesToRemove.push(node);
1851
+ }}
1852
+ }}
1853
+
1854
+ textNodesToRemove.forEach(node => {{
1855
+ if (node.parentNode) {{
1856
+ node.parentNode.removeChild(node);
1857
+ }}
1858
+ }});
1859
+
1860
+ // METHOD 2: Hide elements by class/data attributes
1861
+ const elementsToHide = audio.querySelectorAll(
1862
+ '[class*="upload"], [class*="placeholder"], [class*="text"], ' +
1863
+ '[data-testid*="upload"], [data-testid*="file"], ' +
1864
+ 'label.block, .file-upload, p, span:not(:has(button)):not(:has(svg))'
1865
+ );
1866
+ elementsToHide.forEach(el => {{
1867
+ el.style.display = 'none';
1868
+ el.style.visibility = 'hidden';
1869
+ el.style.fontSize = '0';
1870
+ el.style.lineHeight = '0';
1871
+ el.style.width = '0';
1872
+ el.style.height = '0';
1873
+ el.style.opacity = '0';
1874
+ el.style.position = 'absolute';
1875
+ el.style.left = '-9999px';
1876
+ }});
1877
+
1878
+ // METHOD 3: Remove label.block entirely if it exists
1879
+ const labelBlocks = audio.querySelectorAll('label.block');
1880
+ labelBlocks.forEach(label => {{
1881
+ // Only remove if it doesn't contain button or audio
1882
+ if (!label.querySelector('button') && !label.querySelector('audio')) {{
1883
+ label.remove();
1884
+ }}
1885
+ }});
1886
+
1887
+ // METHOD 4: Clear innerHTML of divs that don't contain buttons/audio
1888
+ const allDivs = audio.querySelectorAll('div');
1889
+ allDivs.forEach(div => {{
1890
+ if (!div.querySelector('button') && !div.querySelector('audio') && !div.querySelector('svg')) {{
1891
+ // Check if div only contains text
1892
+ const hasOnlyText = Array.from(div.childNodes).every(node =>
1893
+ node.nodeType === Node.TEXT_NODE ||
1894
+ (node.nodeType === Node.ELEMENT_NODE && !node.querySelector('button, audio, svg'))
1895
+ );
1896
+ if (hasOnlyText) {{
1897
+ div.innerHTML = '';
1898
+ }}
1899
+ }}
1900
+ }});
1901
+ }});
1902
+ }});
1903
+ }}
1904
+
1905
+ // Run immediately
1906
+ removeAudioUploadText();
1907
+
1908
+ // Run after DOM changes (MutationObserver)
1909
+ const observer = new MutationObserver(() => {{
1910
+ removeAudioUploadText();
1911
+ }});
1912
+
1913
+ // Start observing after a short delay to ensure Gradio has loaded
1914
+ setTimeout(() => {{
1915
+ observer.observe(document.body, {{
1916
+ childList: true,
1917
+ subtree: true
1918
+ }});
1919
+ }}, 500);
1920
+
1921
+ // Also run on window load
1922
+ window.addEventListener('load', removeAudioUploadText);
1923
+
1924
+ // Run periodically for the first 5 seconds (catch late renders)
1925
+ let attempts = 0;
1926
+ const interval = setInterval(() => {{
1927
+ removeAudioUploadText();
1928
+ attempts++;
1929
+ if (attempts > 10) {{
1930
+ clearInterval(interval);
1931
+ }}
1932
+ }}, 500);
1933
+ </script>
1934
+ """)
1935
+
1936
+ # ==================== HEADER (FLOATING) ====================
1937
+ gr.HTML("""
1938
+ <div class="header-main">
1939
+ <div class="header-left">
1940
+ <span class="header-icon">
1941
+ <svg width="72" height="72" viewBox="0 0 52 52" fill="none">
1942
+ <defs>
1943
+ <linearGradient id="logoGradHeader" x1="0%" y1="0%" x2="100%" y2="100%">
1944
+ <stop offset="0%" style="stop-color:#7c3aed"/>
1945
+ <stop offset="100%" style="stop-color:#6366f1"/>
1946
+ </linearGradient>
1947
+ </defs>
1948
+ <!-- Left: Microphone (rounded capsule + stand) -->
1949
+ <!-- Microphone capsule (rounded rect) -->
1950
+ <rect x="8" y="12" width="9" height="14" rx="4.5" fill="url(#logoGradHeader)"/>
1951
+ <!-- Microphone grill lines (horizontal detail) -->
1952
+ <line x1="9" y1="16" x2="14" y2="16" stroke="#000000" stroke-width="0.8" stroke-linecap="round"/>
1953
+ <line x1="9" y1="19.5" x2="14" y2="19.5" stroke="#000000" stroke-width="0.8" stroke-linecap="round"/>
1954
+ <line x1="9" y1="23" x2="14" y2="23" stroke="#000000" stroke-width="0.8" stroke-linecap="round"/>
1955
+ <!-- Arc stand -->
1956
+ <path d="M6.5 26c0 2.5 2.2 5 6 5s6-2.5 6-5" stroke="url(#logoGradHeader)" stroke-width="2" fill="none" stroke-linecap="round"/>
1957
+ <!-- Pole -->
1958
+ <rect x="11.5" y="31" width="2" height="5" fill="url(#logoGradHeader)"/>
1959
+ <!-- Base -->
1960
+ <rect x="7.5" y="36" width="9" height="2" rx="1" fill="url(#logoGradHeader)"/>
1961
+
1962
+ <!-- Right: Audio Wave Bars (4 vertical bars with different heights) -->
1963
+ <rect x="28" y="18" width="3" height="16" rx="1.5" fill="url(#logoGradHeader)" opacity="0.9"/>
1964
+ <rect x="34" y="14" width="3" height="24" rx="1.5" fill="url(#logoGradHeader)" opacity="0.95"/>
1965
+ <rect x="40" y="20" width="3" height="12" rx="1.5" fill="url(#logoGradHeader)" opacity="0.85"/>
1966
+ <rect x="46" y="22" width="3" height="8" rx="1.5" fill="url(#logoGradHeader)" opacity="0.8"/>
1967
+ </svg>
1968
+ </span>
1969
+ <div>
1970
+ <span class="header-title">VoiceKit</span>
1971
+ <span class="header-subtitle">MCP Server</span>
1972
+ </div>
1973
+ </div>
1974
+ </div>
1975
+ """)
1976
+
1977
+ # ==================== TOP ROW: QUICK START + AVAILABLE TOOLS ====================
1978
+ with gr.Row(equal_height=True):
1979
+ # QUICK START CARD
1980
+ with gr.Column(scale=1):
1981
+ gr.HTML("""
1982
+ <div class="card" style="min-height: 550px;">
1983
+ <div class="card-title">
1984
+ <svg width="18" height="18" viewBox="0 0 24 24" fill="#7c3aed" style="margin-right: 8px;">
1985
+ <path d="M19.14 12.94c.04-.31.06-.63.06-.94 0-.31-.02-.63-.06-.94l2.03-1.58c.18-.14.23-.41.12-.61l-1.92-3.32c-.12-.22-.37-.29-.59-.22l-2.39.96c-.5-.38-1.03-.7-1.62-.94l-.36-2.54c-.04-.24-.24-.41-.48-.41h-3.84c-.24 0-.43.17-.47.41l-.36 2.54c-.59.24-1.13.57-1.62.94l-2.39-.96c-.22-.08-.47 0-.59.22L2.74 8.87c-.12.21-.08.47.12.61l2.03 1.58c-.04.31-.06.63-.06.94s.02.63.06.94l-2.03 1.58c-.18.14-.23.41-.12.61l1.92 3.32c.12.22.37.29.59.22l2.39-.96c.5.38 1.03.7 1.62.94l.36 2.54c.05.24.24.41.48.41h3.84c.24 0 .44-.17.47-.41l.36-2.54c.59-.24 1.13-.56 1.62-.94l2.39.96c.22.08.47 0 .59-.22l1.92-3.32c.12-.22.07-.47-.12-.61l-2.01-1.58zM12 15.6c-1.98 0-3.6-1.62-3.6-3.6s1.62-3.6 3.6-3.6 3.6 1.62 3.6 3.6-1.62 3.6-3.6 3.6z"/>
1986
+ </svg>
1987
+ QUICK START
1988
+ </div>
1989
+
1990
+ <div class="terminal-window">
1991
+ <!-- Terminal Header with Dots and Filename -->
1992
+ <div class="terminal-header">
1993
+ <div class="terminal-dots">
1994
+ <div class="terminal-dot red"></div>
1995
+ <div class="terminal-dot yellow"></div>
1996
+ <div class="terminal-dot green"></div>
1997
+ </div>
1998
+ <div class="terminal-title">claude_desktop_config.json</div>
1999
+ <div style="width: 60px;"></div> <!-- Spacer for center alignment -->
2000
+ </div>
2001
+
2002
+ <!-- Terminal Body with Line Numbers and Code -->
2003
+ <div class="terminal-body">
2004
+ <div class="line-numbers">
2005
+ <div class="line-num">1</div>
2006
+ <div class="line-num">2</div>
2007
+ <div class="line-num">3</div>
2008
+ <div class="line-num">4</div>
2009
+ <div class="line-num">5</div>
2010
+ <div class="line-num">6</div>
2011
+ <div class="line-num">7</div>
2012
+ <div class="line-num">8</div>
2013
+ <div class="line-num">9</div>
2014
+ <div class="line-num">10</div>
2015
+ <div class="line-num">11</div>
2016
+ <div class="line-num">12</div>
2017
+ </div>
2018
+ <div class="code-content">
2019
+ <div class="code-line"><span class="json-bracket">{</span></div>
2020
+ <div class="code-line"> <span class="json-key">"mcpServers"</span><span class="json-colon">:</span> <span class="json-bracket">{</span></div>
2021
+ <div class="code-line"> <span class="json-key">"voicekit"</span><span class="json-colon">:</span> <span class="json-bracket">{</span></div>
2022
+ <div class="code-line"> <span class="json-key">"command"</span><span class="json-colon">:</span> <span class="json-string">"npx"</span><span class="json-comma">,</span></div>
2023
+ <div class="code-line"> <span class="json-key">"args"</span><span class="json-colon">:</span> <span class="json-bracket">[</span></div>
2024
+ <div class="code-line"> <span class="json-string">"-y"</span><span class="json-comma">,</span></div>
2025
+ <div class="code-line"> <span class="json-string">"mcp-remote"</span><span class="json-comma">,</span></div>
2026
+ <div class="code-line"> <span class="json-string">"https://mcp-1st-birthday-voicekit-test.hf.space/gradio_api/mcp/sse"</span></div>
2027
+ <div class="code-line"> <span class="json-bracket">]</span></div>
2028
+ <div class="code-line"> <span class="json-bracket">}</span></div>
2029
+ <div class="code-line"> <span class="json-bracket">}</span></div>
2030
+ <div class="code-line"><span class="json-bracket">}</span></div>
2031
+ </div>
2032
+ </div>
2033
+ </div>
2034
+
2035
+ <button class="copy-button" onclick="navigator.clipboard.writeText(JSON.stringify({mcpServers:{voicekit:{command:'npx',args:['-y','mcp-remote','https://mcp-1st-birthday-voicekit-test.hf.space/gradio_api/mcp/sse']}}},null,2))">
2036
+ <svg width="16" height="16" viewBox="0 0 24 24" fill="white" style="display: inline-block; vertical-align: middle;">
2037
+ <rect x="9" y="9" width="13" height="13" rx="2" ry="2" fill="white"/>
2038
+ <path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1" fill="none" stroke="white" stroke-width="2"/>
2039
+ </svg>
2040
+ COPY CONFIG
2041
+ </button>
2042
+ </div>
2043
+ """)
2044
+
2045
+ # AVAILABLE TOOLS CARD
2046
+ with gr.Column(scale=1):
2047
+ gr.HTML("""
2048
+ <div class="card" style="min-height: 550px;">
2049
+ <div class="card-title">
2050
+ <svg width="18" height="18" viewBox="0 0 24 24" fill="#7c3aed" style="margin-right: 8px;">
2051
+ <path d="M22.7 19l-9.1-9.1c.9-2.3.4-5-1.5-6.9-2-2-5-2.4-7.4-1.3L9 6 6 9 1.6 4.7C.4 7.1.9 10.1 2.9 12.1c1.9 1.9 4.6 2.4 6.9 1.5l9.1 9.1c.4.4 1 .4 1.4 0l2.3-2.3c.5-.4.5-1.1.1-1.4z"/>
2052
+ </svg>
2053
+ AVAILABLE TOOLS
2054
+ </div>
2055
+ <table class="tools-table">
2056
+ <thead>
2057
+ <tr>
2058
+ <th>TOOL</th>
2059
+ <th>PURPOSE</th>
2060
+ <th>INPUT</th>
2061
+ <th>OUTPUT</th>
2062
+ </tr>
2063
+ </thead>
2064
+ <tbody>
2065
+ <tr>
2066
+ <td>
2067
+ <div style="display: flex; align-items: center; gap: 12px;">
2068
+ <svg width="24" height="24" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2069
+ <path d="M21 16V8L12 4L3 8V16L12 20L21 16Z" stroke="#A855F7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
2070
+ <path d="M12 4V12M12 12V20M12 12L21 8M12 12L3 8" stroke="#A855F7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
2071
+ <circle cx="12" cy="12" r="2" fill="#A855F7"/>
2072
+ <circle cx="16.5" cy="10" r="1.5" fill="#A855F7"/>
2073
+ <circle cx="7.5" cy="14" r="1.5" fill="#A855F7"/>
2074
+ <path d="M12 12L16.5 10M12 12L7.5 14" stroke="#A855F7" stroke-width="1.5" stroke-linecap="round"/>
2075
+ </svg>
2076
+ <span class="tool-name">extract_embedding</span>
2077
+ </div>
2078
+ </td>
2079
+ <td>Extract 768-dim voice fingerprint</td>
2080
+ <td>audio_base64</td>
2081
+ <td>embedding, model, dim</td>
2082
+ </tr>
2083
+ <tr>
2084
+ <td>
2085
+ <div style="display: flex; align-items: center; gap: 12px;">
2086
+ <svg width="24" height="24" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2087
+ <path d="M2 10V14" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2088
+ <path d="M5 8V16" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2089
+ <path d="M8 11V13" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2090
+ <path d="M22 10V14" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2091
+ <path d="M19 7V17" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2092
+ <path d="M16 11V13" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2093
+ <path d="M10 12H14" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2094
+ <path d="M10 12L11.5 10.5" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2095
+ <path d="M10 12L11.5 13.5" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2096
+ <path d="M14 12L12.5 10.5" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2097
+ <path d="M14 12L12.5 13.5" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2098
+ </svg>
2099
+ <span class="tool-name">match_voice</span>
2100
+ </div>
2101
+ </td>
2102
+ <td>Compare two voice similarities</td>
2103
+ <td>audio1_base64, audio2_base64</td>
2104
+ <td>similarity, tone_score</td>
2105
+ </tr>
2106
+ <tr>
2107
+ <td>
2108
+ <div style="display: flex; align-items: center; gap: 12px;">
2109
+ <svg width="24" height="24" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2110
+ <path d="M22 10C22 10 20 4 17 4C14 4 12 16 9 16C6 16 4 10 2 10" stroke="#A855F7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
2111
+ <g transform="translate(13, 11)">
2112
+ <circle cx="5" cy="5" r="4" stroke="#A855F7" stroke-width="1.5"/>
2113
+ <path d="M8 8L11 11" stroke="#A855F7" stroke-width="1.5" stroke-linecap="round"/>
2114
+ </g>
2115
+ </svg>
2116
+ <span class="tool-name">analyze_acoustics</span>
2117
+ </div>
2118
+ </td>
2119
+ <td>Analyze pitch, energy, rhythm, tempo</td>
2120
+ <td>audio_base64</td>
2121
+ <td>pitch, energy, rhythm, tempo</td>
2122
+ </tr>
2123
+ <tr>
2124
+ <td>
2125
+ <div style="display: flex; align-items: center; gap: 12px;">
2126
+ <svg width="24" height="24" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2127
+ <path d="M2 12C2 12 4 5 7 5C10 5 11 19 14 19C15.5 19 16.5 15 16.5 15" stroke="#A855F7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
2128
+ <path d="M19 7H22" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2129
+ <path d="M19 12H22" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2130
+ <path d="M19 17H22" stroke="#A855F7" stroke-width="2" stroke-linecap="round"/>
2131
+ </svg>
2132
+ <span class="tool-name">transcribe_audio</span>
2133
+ </div>
2134
+ </td>
2135
+ <td>Convert speech to text</td>
2136
+ <td>audio_base64, language</td>
2137
+ <td>text, language, model</td>
2138
+ </tr>
2139
+ <tr>
2140
+ <td>
2141
+ <div style="display: flex; align-items: center; gap: 12px;">
2142
+ <svg width="24" height="24" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2143
+ <path d="M12 5V19" stroke="#A855F7" stroke-width="2.5" stroke-linecap="round"/>
2144
+ <path d="M9 8V16" stroke="#A855F7" stroke-width="2.5" stroke-linecap="round"/>
2145
+ <path d="M15 8V16" stroke="#A855F7" stroke-width="2.5" stroke-linecap="round"/>
2146
+ <path d="M5 4H3V20H5" stroke="#A855F7" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/>
2147
+ <path d="M19 4H21V20H19" stroke="#A855F7" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/>
2148
+ </svg>
2149
+ <span class="tool-name">isolate_voice</span>
2150
+ </div>
2151
+ </td>
2152
+ <td>Remove background music/noise</td>
2153
+ <td>audio_base64</td>
2154
+ <td>isolated_audio_base64, metadata</td>
2155
+ </tr>
2156
+ <tr>
2157
+ <td>
2158
+ <div style="display: flex; align-items: center; gap: 12px;">
2159
+ <svg width="24" height="24" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2160
+ <circle cx="12" cy="12" r="9" stroke="#A855F7" stroke-width="1" opacity="0.3"/>
2161
+ <path d="M12 3V21" stroke="#A855F7" stroke-width="1" opacity="0.3"/>
2162
+ <path d="M4.2 7.5L19.8 16.5" stroke="#A855F7" stroke-width="1" opacity="0.3"/>
2163
+ <path d="M19.8 7.5L4.2 16.5" stroke="#A855F7" stroke-width="1" opacity="0.3"/>
2164
+ <path d="M12 5L18 9L16.5 18H7.5L6 9L12 5Z" fill="#A855F7" fill-opacity="0.4" stroke="#A855F7" stroke-width="2" stroke-linejoin="round"/>
2165
+ <circle cx="12" cy="5" r="1.5" fill="#A855F7"/>
2166
+ <circle cx="18" cy="9" r="1.5" fill="#A855F7"/>
2167
+ <circle cx="16.5" cy="18" r="1.5" fill="#A855F7"/>
2168
+ <circle cx="7.5" cy="18" r="1.5" fill="#A855F7"/>
2169
+ <circle cx="6" cy="9" r="1.5" fill="#A855F7"/>
2170
+ </svg>
2171
+ <span class="tool-name">grade_voice</span>
2172
+ </div>
2173
+ </td>
2174
+ <td>5-metric comprehensive analysis</td>
2175
+ <td>user_audio, reference_audio, text, category</td>
2176
+ <td>overall, metrics, feedback</td>
2177
+ </tr>
2178
+ </tbody>
2179
+ </table>
2180
+ </div>
2181
+ """)
2182
+
2183
+ # ==================== FIRST ROW: 3 DEMO CARDS ====================
2184
+ with gr.Row(equal_height=True, elem_classes="demo-row"):
2185
+ # EXTRACT EMBEDDING
2186
+ with gr.Column(scale=1, elem_classes="demo-card-column"):
2187
+ gr.HTML("""
2188
+ <div style="display: flex; align-items: center; gap: 6px; margin-bottom: 8px; padding-left: 18px; padding-top: 10px;">
2189
+ <svg width="20" height="20" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2190
+ <path d="M21 16V8L12 4L3 8V16L12 20L21 16Z" stroke="#7c3aed" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
2191
+ <path d="M12 4V12M12 12V20M12 12L21 8M12 12L3 8" stroke="#7c3aed" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
2192
+ <circle cx="12" cy="12" r="2" fill="#7c3aed"/>
2193
+ <circle cx="16.5" cy="10" r="1.5" fill="#7c3aed"/>
2194
+ <circle cx="7.5" cy="14" r="1.5" fill="#7c3aed"/>
2195
+ <path d="M12 12L16.5 10M12 12L7.5 14" stroke="#7c3aed" stroke-width="1.5" stroke-linecap="round"/>
2196
+ </svg>
2197
+ <div style="font-size: 16px; font-weight: 700; color: #a5b4fc; text-transform: uppercase; letter-spacing: 1px;">
2198
+ EXTRACT EMBEDDING
2199
+ </div>
2200
+ </div>
2201
+ """)
2202
+ embedding_audio = gr.Audio(
2203
+ type="filepath",
2204
+ label="Audio Input",
2205
+ show_label=False,
2206
+ format="wav"
2207
+ )
2208
+ embedding_btn = gr.Button("EXTRACT", variant="primary", size="lg", elem_classes="custom-action-btn")
2209
+ embedding_output = gr.HTML(value=create_embedding_empty())
2210
+
2211
+ embedding_btn.click(
2212
+ demo_extract_embedding,
2213
+ inputs=[embedding_audio],
2214
+ outputs=[embedding_output],
2215
+ api_visibility="private"
2216
+ )
2217
+
2218
+ # COMPARE VOICES
2219
+ with gr.Column(scale=1, elem_classes="demo-card-column"):
2220
+ gr.HTML("""
2221
+ <div style="display: flex; align-items: center; gap: 6px; margin-bottom: 8px; padding-left: 18px; padding-top: 10px;">
2222
+ <svg width="20" height="20" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2223
+ <path d="M2 10V14" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2224
+ <path d="M5 8V16" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2225
+ <path d="M8 11V13" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2226
+ <path d="M22 10V14" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2227
+ <path d="M19 7V17" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2228
+ <path d="M16 11V13" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2229
+ <path d="M10 12H14" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2230
+ <path d="M10 12L11.5 10.5" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2231
+ <path d="M10 12L11.5 13.5" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2232
+ <path d="M14 12L12.5 10.5" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2233
+ <path d="M14 12L12.5 13.5" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2234
+ </svg>
2235
+ <div style="font-size: 16px; font-weight: 700; color: #a5b4fc; text-transform: uppercase; letter-spacing: 1px;">
2236
+ COMPARE VOICES
2237
+ </div>
2238
+ </div>
2239
+ """)
2240
+ with gr.Row():
2241
+ compare_audio1 = gr.Audio(
2242
+ type="filepath",
2243
+ label="Audio 1",
2244
+ show_label=False,
2245
+ format="wav"
2246
+ )
2247
+ compare_audio2 = gr.Audio(
2248
+ type="filepath",
2249
+ label="Audio 2",
2250
+ show_label=False,
2251
+ format="wav"
2252
+ )
2253
+ compare_btn = gr.Button("COMPARE", variant="primary", size="lg", elem_classes="custom-action-btn")
2254
+ compare_output = gr.HTML(value=create_compare_empty())
2255
+
2256
+ compare_btn.click(
2257
+ demo_match_voice,
2258
+ inputs=[compare_audio1, compare_audio2],
2259
+ outputs=[compare_output],
2260
+ api_visibility="private"
2261
+ )
2262
+
2263
+ # ACOUSTIC ANALYSIS
2264
+ with gr.Column(scale=1, elem_classes="demo-card-column"):
2265
+ gr.HTML("""
2266
+ <div style="display: flex; align-items: center; gap: 6px; margin-bottom: 8px; padding-left: 18px; padding-top: 10px;">
2267
+ <svg width="20" height="20" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2268
+ <path d="M22 10C22 10 20 4 17 4C14 4 12 16 9 16C6 16 4 10 2 10" stroke="#7c3aed" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
2269
+ <g transform="translate(13, 11)">
2270
+ <circle cx="5" cy="5" r="4" stroke="#7c3aed" stroke-width="1.5"/>
2271
+ <path d="M8 8L11 11" stroke="#7c3aed" stroke-width="1.5" stroke-linecap="round"/>
2272
+ </g>
2273
+ </svg>
2274
+ <div style="font-size: 16px; font-weight: 700; color: #a5b4fc; text-transform: uppercase; letter-spacing: 1px;">
2275
+ ACOUSTIC ANALYSIS
2276
+ </div>
2277
+ </div>
2278
+ """)
2279
+ acoustic_audio = gr.Audio(
2280
+ type="filepath",
2281
+ label="Audio Input",
2282
+ show_label=False,
2283
+ format="wav"
2284
+ )
2285
+ acoustic_btn = gr.Button("ANALYZE", variant="primary", size="lg", elem_classes="custom-action-btn")
2286
+ acoustic_output = gr.HTML(value=create_acoustic_empty())
2287
+
2288
+ acoustic_btn.click(
2289
+ demo_acoustic_analysis,
2290
+ inputs=[acoustic_audio],
2291
+ outputs=[acoustic_output],
2292
+ api_visibility="private"
2293
+ )
2294
+
2295
+ # ==================== SECOND ROW: 3 MORE DEMO CARDS ====================
2296
+ with gr.Row(equal_height=True, elem_classes="demo-row"):
2297
+ # AUDIO TRANSCRIPTION
2298
+ with gr.Column(scale=1, elem_classes="demo-card-column"):
2299
+ gr.HTML("""
2300
+ <div style="display: flex; align-items: center; gap: 6px; margin-bottom: 8px; padding-left: 18px; padding-top: 10px;">
2301
+ <svg width="20" height="20" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2302
+ <path d="M2 12C2 12 4 5 7 5C10 5 11 19 14 19C15.5 19 16.5 15 16.5 15" stroke="#7c3aed" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
2303
+ <path d="M19 7H22" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2304
+ <path d="M19 12H22" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2305
+ <path d="M19 17H22" stroke="#7c3aed" stroke-width="2" stroke-linecap="round"/>
2306
+ </svg>
2307
+ <div style="font-size: 16px; font-weight: 700; color: #a5b4fc; text-transform: uppercase; letter-spacing: 1px;">
2308
+ AUDIO TRANSCRIPTION
2309
+ </div>
2310
+ </div>
2311
+ """)
2312
+ transcribe_audio_input = gr.Audio(
2313
+ type="filepath",
2314
+ label="Audio Input",
2315
+ show_label=False,
2316
+ format="wav"
2317
+ )
2318
+ transcribe_btn = gr.Button("TRANSCRIBE", variant="primary", size="lg", elem_classes="custom-action-btn")
2319
+ transcribe_output = gr.HTML(value=create_transcription_empty())
2320
+
2321
+ transcribe_btn.click(
2322
+ lambda audio: demo_transcribe_audio(audio, "en"),
2323
+ inputs=[transcribe_audio_input],
2324
+ outputs=[transcribe_output],
2325
+ api_visibility="private"
2326
+ )
2327
+
2328
+ # CLEAN AUDIO EXTRACTION
2329
+ with gr.Column(scale=1, elem_classes="demo-card-column"):
2330
+ gr.HTML("""
2331
+ <div style="display: flex; align-items: center; gap: 6px; margin-bottom: 8px; padding-left: 18px; padding-top: 10px;">
2332
+ <svg width="20" height="20" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2333
+ <path d="M12 5V19" stroke="#7c3aed" stroke-width="2.5" stroke-linecap="round"/>
2334
+ <path d="M9 8V16" stroke="#7c3aed" stroke-width="2.5" stroke-linecap="round"/>
2335
+ <path d="M15 8V16" stroke="#7c3aed" stroke-width="2.5" stroke-linecap="round"/>
2336
+ <path d="M5 4H3V20H5" stroke="#7c3aed" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/>
2337
+ <path d="M19 4H21V20H19" stroke="#7c3aed" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/>
2338
+ </svg>
2339
+ <div style="font-size: 16px; font-weight: 700; color: #a5b4fc; text-transform: uppercase; letter-spacing: 1px;">
2340
+ CLEAN AUDIO EXTRACTION
2341
+ </div>
2342
+ </div>
2343
+ """)
2344
+ clean_audio_input = gr.Audio(
2345
+ type="filepath",
2346
+ label="Audio with Background",
2347
+ show_label=False,
2348
+ format="wav"
2349
+ )
2350
+ clean_btn = gr.Button("EXTRACT VOICE", variant="primary", size="lg", elem_classes="custom-action-btn")
2351
+ clean_audio_output = gr.Audio(label="Clean Audio", type="filepath", visible=True)
2352
+
2353
+ clean_btn.click(
2354
+ demo_clean_extraction,
2355
+ inputs=[clean_audio_input],
2356
+ outputs=[clean_audio_output],
2357
+ api_visibility="private"
2358
+ )
2359
+
2360
+ # VOICE SIMILARITY
2361
+ with gr.Column(scale=1, elem_classes="demo-card-column"):
2362
+ gr.HTML("""
2363
+ <div style="display: flex; align-items: center; gap: 6px; margin-bottom: 8px; padding-left: 18px; padding-top: 10px;">
2364
+ <svg width="20" height="20" viewBox="0 0 24 24" fill="none" style="flex-shrink: 0;">
2365
+ <circle cx="12" cy="12" r="9" stroke="#7c3aed" stroke-width="1" opacity="0.3"/>
2366
+ <path d="M12 3V21" stroke="#7c3aed" stroke-width="1" opacity="0.3"/>
2367
+ <path d="M4.2 7.5L19.8 16.5" stroke="#7c3aed" stroke-width="1" opacity="0.3"/>
2368
+ <path d="M19.8 7.5L4.2 16.5" stroke="#7c3aed" stroke-width="1" opacity="0.3"/>
2369
+ <path d="M12 5L18 9L16.5 18H7.5L6 9L12 5Z" fill="#7c3aed" fill-opacity="0.4" stroke="#7c3aed" stroke-width="2" stroke-linejoin="round"/>
2370
+ <circle cx="12" cy="5" r="1.5" fill="#7c3aed"/>
2371
+ <circle cx="18" cy="9" r="1.5" fill="#7c3aed"/>
2372
+ <circle cx="16.5" cy="18" r="1.5" fill="#7c3aed"/>
2373
+ <circle cx="7.5" cy="18" r="1.5" fill="#7c3aed"/>
2374
+ <circle cx="6" cy="9" r="1.5" fill="#7c3aed"/>
2375
+ </svg>
2376
+ <div style="font-size: 16px; font-weight: 700; color: #a5b4fc; text-transform: uppercase; letter-spacing: 1px;">
2377
+ VOICE SIMILARITY
2378
+ </div>
2379
+ </div>
2380
+ """)
2381
+ with gr.Row():
2382
+ similarity_user_audio = gr.Audio(
2383
+ type="filepath",
2384
+ label="User Audio",
2385
+ show_label=False,
2386
+ format="wav"
2387
+ )
2388
+ similarity_ref_audio = gr.Audio(
2389
+ type="filepath",
2390
+ label="Reference Audio",
2391
+ show_label=False,
2392
+ format="wav"
2393
+ )
2394
+ similarity_btn = gr.Button("ANALYZE", variant="primary", size="lg", elem_classes="custom-action-btn")
2395
+ similarity_output = gr.HTML(value=create_similarity_empty())
2396
+
2397
+ similarity_btn.click(
2398
+ demo_voice_similarity,
2399
+ inputs=[similarity_user_audio, similarity_ref_audio],
2400
+ outputs=[similarity_output],
2401
+ api_visibility="private"
2402
+ )
2403
+
2404
+
2405
+ # ==================== MCP TOOL INTERFACES (HIDDEN, API ONLY) ====================
2406
+ with gr.Row(visible=False):
2407
+ # extract_embedding
2408
+ mcp_emb_input = gr.Textbox()
2409
+ mcp_emb_output = gr.Textbox()
2410
+ mcp_emb_btn = gr.Button()
2411
+ mcp_emb_btn.click(extract_embedding, inputs=[mcp_emb_input], outputs=[mcp_emb_output])
2412
+
2413
+ # match_voice
2414
+ mcp_cmp_input1 = gr.Textbox()
2415
+ mcp_cmp_input2 = gr.Textbox()
2416
+ mcp_cmp_output = gr.Textbox()
2417
+ mcp_cmp_btn = gr.Button()
2418
+ mcp_cmp_btn.click(match_voice, inputs=[mcp_cmp_input1, mcp_cmp_input2], outputs=[mcp_cmp_output])
2419
+
2420
+ # analyze_acoustics
2421
+ mcp_ac_input = gr.Textbox()
2422
+ mcp_ac_output = gr.Textbox()
2423
+ mcp_ac_btn = gr.Button()
2424
+ mcp_ac_btn.click(analyze_acoustics, inputs=[mcp_ac_input], outputs=[mcp_ac_output])
2425
+
2426
+ # transcribe_audio
2427
+ mcp_tr_input = gr.Textbox()
2428
+ mcp_tr_lang = gr.Textbox(value="en")
2429
+ mcp_tr_output = gr.Textbox()
2430
+ mcp_tr_btn = gr.Button()
2431
+ mcp_tr_btn.click(transcribe_audio, inputs=[mcp_tr_input, mcp_tr_lang], outputs=[mcp_tr_output])
2432
+
2433
+ # isolate_voice
2434
+ mcp_iso_input = gr.Textbox()
2435
+ mcp_iso_output = gr.Textbox()
2436
+ mcp_iso_btn = gr.Button()
2437
+ mcp_iso_btn.click(isolate_voice, inputs=[mcp_iso_input], outputs=[mcp_iso_output])
2438
+
2439
+ # grade_voice
2440
+ mcp_sim_user = gr.Textbox()
2441
+ mcp_sim_ref = gr.Textbox()
2442
+ mcp_sim_text = gr.Textbox()
2443
+ mcp_sim_cat = gr.Textbox(value="meme")
2444
+ mcp_sim_output = gr.Textbox()
2445
+ mcp_sim_btn = gr.Button()
2446
+ mcp_sim_btn.click(grade_voice, inputs=[mcp_sim_user, mcp_sim_ref, mcp_sim_text, mcp_sim_cat], outputs=[mcp_sim_output])
2447
+
2448
+
2449
+ if __name__ == "__main__":
2450
+ demo.launch(
2451
+ server_name="0.0.0.0",
2452
+ server_port=7860,
2453
+ mcp_server=True
2454
+ )
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ gradio[mcp]>=6.0.0
2
+ modal>=0.63.0
3
+ requests>=2.31.0