语音识别(使用pipeline)
模型下载1:PolyAI/minds14 · Datasets at Hugging Face
模型下载2:PolyAI/minds14 · Datasets at HF Mirror
import torch
from transformers import pipeline
from datasets import load_dataset, Audio
speech_recognizer = pipeline(
"automatic-speech-recognition",
model="facebook/wav2vec2-base-960h",
framework="pt",
trust_remote_code=True,
)
dataset = load_dataset(
"./models/minds14", name="en-US", split="train", trust_remote_code=True
)
dataset = dataset.cast_column(
"audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)
)
result = speech_recognizer(dataset[:4]["audio"])
print([d["text"] for d in result])
输出:
['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING
HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M
MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT']
语音识别(使用pipeline)
模型下载1:facebook/wav2vec2-base-960h · Hugging Face
模型下载2:facebook/wav2vec2-base-960h · HF Mirror
from transformers import pipeline
transcriber = pipeline(task="automatic-speech-recognition")
result = transcriber(
"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)
print(result)
输出:
{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'}
语音识别(不使用pipeline)
模型下载1:openai/whisper-small · Hugging Face
模型下载2:openai/whisper-small · HF Mirror
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import requests
import soundfile as sf
import io
import librosa
# 加载预训练模型和处理器
model_id = "openai/whisper-small"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)
# 加载音频文件
audio_url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
response = requests.get(audio_url)
audio_bytes = io.BytesIO(response.content)
audio_array, sampling_rate = sf.read(audio_bytes)
# 重采样到16000Hz (Whisper模型需要的采样率)
target_sampling_rate = 16000
if sampling_rate != target_sampling_rate:
audio_array = librosa.resample(
y=audio_array, orig_sr=sampling_rate, target_sr=target_sampling_rate
)
sampling_rate = target_sampling_rate
# 处理音频输入
inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt")
# 使用模型生成转录
with torch.no_grad():
generated_ids = model.generate(inputs.input_features)
# 解码输出为文本
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
result = {"text": transcription}
print(result)
输出:
{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}
语音识别(大模型)
模型下载1:openai/whisper-large-v2 · Hugging Face
模型下载2:openai/whisper-large-v2 · HF Mirror
from transformers import pipeline
transcriber = pipeline(
task="automatic-speech-recognition", model="openai/whisper-large-v2"
)
result = transcriber(
"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
)
print(result)
输出:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.40 GiB is allocated by PyTorch, and 83.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try
setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)