使用Transfomers进行语音识别

创建日期：2025-04-19 20:58:32

更新日期：2025-04-20 10:10:10

语音识别（使用pipeline）

模型下载1：PolyAI/minds14 · Datasets at Hugging Face

模型下载2：PolyAI/minds14 · Datasets at HF Mirror

import torch
from transformers import pipeline
from datasets import load_dataset, Audio

speech_recognizer = pipeline(
    "automatic-speech-recognition",
    model="facebook/wav2vec2-base-960h",
    framework="pt",
    trust_remote_code=True,
)

dataset = load_dataset(
    "./models/minds14", name="en-US", split="train", trust_remote_code=True
)

dataset = dataset.cast_column(
    "audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)
)

result = speech_recognizer(dataset[:4]["audio"])
print([d["text"] for d in result])

输出：

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING 
HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M 
MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT']

语音识别（使用pipeline）

模型下载1：facebook/wav2vec2-base-960h · Hugging Face

模型下载2：facebook/wav2vec2-base-960h · HF Mirror

from transformers import pipeline

transcriber = pipeline(task="automatic-speech-recognition")
result = transcriber(
    "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)
print(result)

输出：

{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'}

语音识别（不使用pipeline）

模型下载1：openai/whisper-small · Hugging Face

模型下载2：openai/whisper-small · HF Mirror

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import requests
import soundfile as sf
import io
import librosa

# 加载预训练模型和处理器
model_id = "openai/whisper-small"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)

# 加载音频文件
audio_url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
response = requests.get(audio_url)
audio_bytes = io.BytesIO(response.content)
audio_array, sampling_rate = sf.read(audio_bytes)

# 重采样到16000Hz (Whisper模型需要的采样率)
target_sampling_rate = 16000
if sampling_rate != target_sampling_rate:
    audio_array = librosa.resample(
        y=audio_array, orig_sr=sampling_rate, target_sr=target_sampling_rate
    )
    sampling_rate = target_sampling_rate

# 处理音频输入
inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt")

# 使用模型生成转录
with torch.no_grad():
    generated_ids = model.generate(inputs.input_features)

# 解码输出为文本
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

result = {"text": transcription}
print(result)

输出：

{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

语音识别（大模型）

模型下载1：openai/whisper-large-v2 · Hugging Face

模型下载2：openai/whisper-large-v2 · HF Mirror

from transformers import pipeline

transcriber = pipeline(
    task="automatic-speech-recognition", model="openai/whisper-large-v2"
)
result = transcriber(
    "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)
print(result)

输出：

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.40 GiB is allocated by PyTorch, and 83.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try 
setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

转载请注明转自www.hylab.cn，原文地址：使用Transfomers进行语音识别

网站简介

一个来自三线小城市的程序员开发经验总结。

使用Transfomers进行语音识别

语音识别（使用pipeline）

语音识别（使用pipeline）

语音识别（不使用pipeline）

语音识别（大模型）

网站简介

最新修改