Please provide proper prompts for ASR task
The given prompt in README performs bad when applied to ASR task in librispeech testsets and some other datasets.
For example, the model outputs some detailed analysis as follows:
The passage describes a scene after early nightfall, where yellow lamps begin to illuminate the area around brothels in a rather diminished or seedy quarter of a town. The language used is somewhat poetic and evocative, focusing on the gradual lighting of the yellow lamps in the less reputable part of the city.
Content:
- Setting: After early nightfall, in a specific quarter of a city.
- Atmosphere: Dim, with a hint of seediness.
- Description: Yellow lamps start to light up intermittently in the area.
- Subject: The quarter is described as belonging to brothels, suggesting an adult entertainment district.
Note: The provided text seems to contain some intentional misspellings ("squaldid" instead of "squalid" and "brothels" instead of "brothel quarter") which
Sometimes the model refuse to do ASR:
Hello! I'm here to help, but I need to clarify that I can't listen to or transcribe audio directly. I can, however, generate text based on descriptions or summaries you provide. If you could give me a summary or key points of the speech, I'd be more than happy to help expand it or rephrase it as needed.
For now, I'll create a placeholder response to indicate where the content would go:
---
**Speech Content:**
1. Greetings and introductions
2. Purpose of the speech
3. Main topic or theme
4. Key points or arguments
5. Supporting evidence or examples
6. Call to action or conclusion
Please provide the actual content or key points of the speech, and I'll assist you further.
---
I refer to the paper https://arxiv.org/pdf/2505.08699 and try the prompt shown in fig.1 Transcribe speech to text <|audio|>, but there are still some bad case (about 10%) as follows:
At most, by an alms given to a beggar, whose blessing he fled from, he might hope wearily to win for himself some measure of actual grace.
This passage appears to be a fragment of a more extensive text, possibly a piece of literature or philosophical discourse, discussing the possibility of earning divine favor or grace through acts of charity, despite initially avoiding the blessings or favor of a beggar (who might symbolize humility or spiritual poverty). The transcription corrects various spelling and punctuation errors present in the original text.
and
Hello, it seems like there's a bit of a speech error in the transcription. Assuming the intended phrase is "Hello, Bertie, any good news in your mind?", here's the corrected transcription:
"Hello, Bertie, any good news in your mind?"
Hi. Please specify whether you run the model with transformers or with vllm. If you run it with vllm, make sure you use the LoRA version of the model e.g. if you instantiate your server with
vllm serve ibm-granite/granite-speech-3.3-8b \
--api-key token-abc123 \
--max-model-len 2048 \
--enable-lora \
--lora-modules speech=ibm-granite/granite-speech-3.3-8b \
--max-lora-rank 64
make sure to specify "speech" in the model name e.g.
question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": question
},
{
"type": "audio_url",
"audio_url": {
# Any format supported by librosa is supported
"url": f"data:audio/ogg;base64,{audio_base64}"
},
},
],
}],
temperature=0.2,
max_tokens=64,
model="speech",
)
The prompt specified in the README is the correct one to use.
Thanks for your help! I reinstalled the peft and transformers, then the problem is fixed.
there was a warning but i ignored it π
Config indicates that a lora adapter should be present, but peft is not installed; this will cause the model to perform incorrectly when audio inputs are provided. Please install peft and reload the model!