Keras 2 : examples : 音声データ – 話者認識 (翻訳/解説)
翻訳 : (株)クラスキャット セールスインフォメーション
作成日時 : 06/27/2022 (keras 2.9.0)
* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです:
- Code examples : Audio Data : Speaker Recognition (Author: Fadi Badine)
* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。
クラスキャット 人工知能 研究開発支援サービス
◆ クラスキャット は人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :
- 人工知能研究開発支援
- 人工知能研修サービス(経営者層向けオンサイト研修)
- テクニカルコンサルティングサービス
- 実証実験(プロトタイプ構築)
- アプリケーションへの実装
- 人工知能研修サービス
- PoC(概念実証)を失敗させないための支援
◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。
- お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。
- 株式会社クラスキャット セールス・マーケティング本部 セールス・インフォメーション
- sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP
Keras 2 : examples : 音声データ – 話者認識
Description : 高速フーリエ変換 (FFT) と 1D Convnet を使用して話者を分類する。
イントロダクション
この例は、高速フーリエ変換 (FFT) により得られた、発話レコーディングの周波数領域の表現から話者を分類するモデルを作成する方法を実演します。
以下を示します :
- 音声ストリームをロードし、前処理し、そしてモデルに供給するために tf.data をどのように使用するか。
- 音声分類のために残差接続を持つ 1D 畳み込みネットワークを作成する方法。
私達のプロセスは :
- 話者をラベルとして持つ、異なる話者からの発話サンプルのデータセットを準備します。
- データを増強するためにこれらのサンプルに背景ノイズを加えます。
- これらのサンプルの FFT を取ります。
- ノイズのある発話サンプルが与えられたとき、正しい話者を予測するために 1D convnet を訓練します。
Note:
- このサンプルは TensorFlow 2.3 またはそれ以上、あるいは tf-nightly で実行される必要があります。
- このサンプルのコードを使用する前に、データセットのノイズサンプルは 16000 Hz のサンプリングレートにリサンプリングされる必要があります。これを行なうためには、ffmpg をインストールしておく必要があります。
セットアップ
import os
import shutil
import numpy as np
import tensorflow as tf
from tensorflow import keras
from pathlib import Path
from IPython.display import display, Audio
# Get the data from https://www.kaggle.com/kongaevans/speaker-recognition-dataset/download
# and save it to the 'Downloads' folder in your HOME directory
DATASET_ROOT = os.path.join(os.path.expanduser("~"), "Downloads/16000_pcm_speeches")
# The folders in which we will put the audio samples and the noise samples
AUDIO_SUBFOLDER = "audio"
NOISE_SUBFOLDER = "noise"
DATASET_AUDIO_PATH = os.path.join(DATASET_ROOT, AUDIO_SUBFOLDER)
DATASET_NOISE_PATH = os.path.join(DATASET_ROOT, NOISE_SUBFOLDER)
# Percentage of samples to use for validation
VALID_SPLIT = 0.1
# Seed to use when shuffling the dataset and the noise
SHUFFLE_SEED = 43
# The sampling rate to use.
# This is the one used in all of the audio samples.
# We will resample all of the noise to this sampling rate.
# This will also be the output size of the audio wave samples
# (since all samples are of 1 second long)
SAMPLING_RATE = 16000
# The factor to multiply the noise with according to:
# noisy_sample = sample + noise * prop * scale
# where prop = sample_amplitude / noise_amplitude
SCALE = 0.5
BATCH_SIZE = 128
EPOCHS = 100
データ準備
データセットは 2 つのグループに分割された、7 フォルダから成ります :
- 5 人の異なる話者に対する 5 フォルダを持つ発話サンプル。各フォルダは 1500 音声クリップを含み、それぞれ 1 秒で 16000 Hz でサンプリングされています。
- 背景ノイズサンプル、2 フォルダと合計 6 ファイルを含みます。これらのファイルは 1 秒より長いです (そして元々は 160000 Hz でサンプリングされていませんが、それらを 160000 Hz にリサンプリングします)。これらの 6 ファイルを使用して訓練のために使用される 354 の 1秒長ノイズサンプルを作成します。
これらの 2 カテゴリーを 2 フォルダ内にソートします :
- audio フォルダ、これは総ての話者毎の発話サンプル・フォルダを含みます。
- noise フォルダ、これは総てのノイズサンプルを含みます。
audio と noise カテゴリーを 2 フォルダにソートする前、
main_directory/ ...speaker_a/ ...speaker_b/ ...speaker_c/ ...speaker_d/ ...speaker_e/ ...other/ ..._background_noise_/
ソート後、以下の構造になります :
main_directory/ ...audio/ ......speaker_a/ ......speaker_b/ ......speaker_c/ ......speaker_d/ ......speaker_e/ ...noise/ ......other/ ......_background_noise_/
# If folder `audio`, does not exist, create it, otherwise do nothing
if os.path.exists(DATASET_AUDIO_PATH) is False:
os.makedirs(DATASET_AUDIO_PATH)
# If folder `noise`, does not exist, create it, otherwise do nothing
if os.path.exists(DATASET_NOISE_PATH) is False:
os.makedirs(DATASET_NOISE_PATH)
for folder in os.listdir(DATASET_ROOT):
if os.path.isdir(os.path.join(DATASET_ROOT, folder)):
if folder in [AUDIO_SUBFOLDER, NOISE_SUBFOLDER]:
# If folder is `audio` or `noise`, do nothing
continue
elif folder in ["other", "_background_noise_"]:
# If folder is one of the folders that contains noise samples,
# move it to the `noise` folder
shutil.move(
os.path.join(DATASET_ROOT, folder),
os.path.join(DATASET_NOISE_PATH, folder),
)
else:
# Otherwise, it should be a speaker folder, then move it to
# `audio` folder
shutil.move(
os.path.join(DATASET_ROOT, folder),
os.path.join(DATASET_AUDIO_PATH, folder),
)
ノイズの準備
このセクションでは :
- 総てのノイズサンプルをロードします (これは 16000 にリサンプリングされる必要があります)
- それらのノイズサンプルを 16000 サンプルのチャンクに分割します、これらはそれぞれ 1 秒間に対応します。
# Get the list of all noise files
noise_paths = []
for subdir in os.listdir(DATASET_NOISE_PATH):
subdir_path = Path(DATASET_NOISE_PATH) / subdir
if os.path.isdir(subdir_path):
noise_paths += [
os.path.join(subdir_path, filepath)
for filepath in os.listdir(subdir_path)
if filepath.endswith(".wav")
]
print(
"Found {} files belonging to {} directories".format(
len(noise_paths), len(os.listdir(DATASET_NOISE_PATH))
)
)
Found 6 files belonging to 2 directories
総てのノイズサンプルを 16000 Hz にリサンプリングします。
command = (
"for dir in `ls -1 " + DATASET_NOISE_PATH + "`; do "
"for file in `ls -1 " + DATASET_NOISE_PATH + "/$dir/*.wav`; do "
"sample_rate=`ffprobe -hide_banner -loglevel panic -show_streams "
"$file | grep sample_rate | cut -f2 -d=`; "
"if [ $sample_rate -ne 16000 ]; then "
"ffmpeg -hide_banner -loglevel panic -y "
"-i $file -ar 16000 temp.wav; "
"mv temp.wav $file; "
"fi; done; done"
)
os.system(command)
# Split noise into chunks of 16000 each
def load_noise_sample(path):
sample, sampling_rate = tf.audio.decode_wav(
tf.io.read_file(path), desired_channels=1
)
if sampling_rate == SAMPLING_RATE:
# Number of slices of 16000 each that can be generated from the noise sample
slices = int(sample.shape[0] / SAMPLING_RATE)
sample = tf.split(sample[: slices * SAMPLING_RATE], slices)
return sample
else:
print("Sampling rate for {} is incorrect. Ignoring it".format(path))
return None
noises = []
for path in noise_paths:
sample = load_noise_sample(path)
if sample:
noises.extend(sample)
noises = tf.stack(noises)
print(
"{} noise files were split into {} noise samples where each is {} sec. long".format(
len(noise_paths), noises.shape[0], noises.shape[1] // SAMPLING_RATE
)
)
6 noise files were split into 354 noise samples where each is 1 sec. long
データ生成
def paths_and_labels_to_dataset(audio_paths, labels):
"""Constructs a dataset of audios and labels."""
path_ds = tf.data.Dataset.from_tensor_slices(audio_paths)
audio_ds = path_ds.map(lambda x: path_to_audio(x))
label_ds = tf.data.Dataset.from_tensor_slices(labels)
return tf.data.Dataset.zip((audio_ds, label_ds))
def path_to_audio(path):
"""Reads and decodes an audio file."""
audio = tf.io.read_file(path)
audio, _ = tf.audio.decode_wav(audio, 1, SAMPLING_RATE)
return audio
def add_noise(audio, noises=None, scale=0.5):
if noises is not None:
# Create a random tensor of the same size as audio ranging from
# 0 to the number of noise stream samples that we have.
tf_rnd = tf.random.uniform(
(tf.shape(audio)[0],), 0, noises.shape[0], dtype=tf.int32
)
noise = tf.gather(noises, tf_rnd, axis=0)
# Get the amplitude proportion between the audio and the noise
prop = tf.math.reduce_max(audio, axis=1) / tf.math.reduce_max(noise, axis=1)
prop = tf.repeat(tf.expand_dims(prop, axis=1), tf.shape(audio)[1], axis=1)
# Adding the rescaled noise to audio
audio = audio + noise * prop * scale
return audio
def audio_to_fft(audio):
# Since tf.signal.fft applies FFT on the innermost dimension,
# we need to squeeze the dimensions and then expand them again
# after FFT
audio = tf.squeeze(audio, axis=-1)
fft = tf.signal.fft(
tf.cast(tf.complex(real=audio, imag=tf.zeros_like(audio)), tf.complex64)
)
fft = tf.expand_dims(fft, axis=-1)
# Return the absolute value of the first half of the FFT
# which represents the positive frequencies
return tf.math.abs(fft[:, : (audio.shape[1] // 2), :])
# Get the list of audio file paths along with their corresponding labels
class_names = os.listdir(DATASET_AUDIO_PATH)
print("Our class names: {}".format(class_names,))
audio_paths = []
labels = []
for label, name in enumerate(class_names):
print("Processing speaker {}".format(name,))
dir_path = Path(DATASET_AUDIO_PATH) / name
speaker_sample_paths = [
os.path.join(dir_path, filepath)
for filepath in os.listdir(dir_path)
if filepath.endswith(".wav")
]
audio_paths += speaker_sample_paths
labels += [label] * len(speaker_sample_paths)
print(
"Found {} files belonging to {} classes.".format(len(audio_paths), len(class_names))
)
# Shuffle
rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(audio_paths)
rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(labels)
# Split into training and validation
num_val_samples = int(VALID_SPLIT * len(audio_paths))
print("Using {} files for training.".format(len(audio_paths) - num_val_samples))
train_audio_paths = audio_paths[:-num_val_samples]
train_labels = labels[:-num_val_samples]
print("Using {} files for validation.".format(num_val_samples))
valid_audio_paths = audio_paths[-num_val_samples:]
valid_labels = labels[-num_val_samples:]
# Create 2 datasets, one for training and the other for validation
train_ds = paths_and_labels_to_dataset(train_audio_paths, train_labels)
train_ds = train_ds.shuffle(buffer_size=BATCH_SIZE * 8, seed=SHUFFLE_SEED).batch(
BATCH_SIZE
)
valid_ds = paths_and_labels_to_dataset(valid_audio_paths, valid_labels)
valid_ds = valid_ds.shuffle(buffer_size=32 * 8, seed=SHUFFLE_SEED).batch(32)
# Add noise to the training set
train_ds = train_ds.map(
lambda x, y: (add_noise(x, noises, scale=SCALE), y),
num_parallel_calls=tf.data.AUTOTUNE,
)
# Transform audio wave to the frequency domain using `audio_to_fft`
train_ds = train_ds.map(
lambda x, y: (audio_to_fft(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
valid_ds = valid_ds.map(
lambda x, y: (audio_to_fft(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
valid_ds = valid_ds.prefetch(tf.data.AUTOTUNE)
Our class names: ['Julia_Gillard', 'Jens_Stoltenberg', 'Nelson_Mandela', 'Magaret_Tarcher', 'Benjamin_Netanyau'] Processing speaker Julia_Gillard Processing speaker Jens_Stoltenberg Processing speaker Nelson_Mandela Processing speaker Magaret_Tarcher Processing speaker Benjamin_Netanyau Found 7501 files belonging to 5 classes. Using 6751 files for training. Using 750 files for validation.
モデル定義
def residual_block(x, filters, conv_num=3, activation="relu"):
# Shortcut
s = keras.layers.Conv1D(filters, 1, padding="same")(x)
for i in range(conv_num - 1):
x = keras.layers.Conv1D(filters, 3, padding="same")(x)
x = keras.layers.Activation(activation)(x)
x = keras.layers.Conv1D(filters, 3, padding="same")(x)
x = keras.layers.Add()([x, s])
x = keras.layers.Activation(activation)(x)
return keras.layers.MaxPool1D(pool_size=2, strides=2)(x)
def build_model(input_shape, num_classes):
inputs = keras.layers.Input(shape=input_shape, name="input")
x = residual_block(inputs, 16, 2)
x = residual_block(x, 32, 2)
x = residual_block(x, 64, 3)
x = residual_block(x, 128, 3)
x = residual_block(x, 128, 3)
x = keras.layers.AveragePooling1D(pool_size=3, strides=3)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(256, activation="relu")(x)
x = keras.layers.Dense(128, activation="relu")(x)
outputs = keras.layers.Dense(num_classes, activation="softmax", name="output")(x)
return keras.models.Model(inputs=inputs, outputs=outputs)
model = build_model((SAMPLING_RATE // 2, 1), len(class_names))
model.summary()
# Compile the model using Adam's default learning rate
model.compile(
optimizer="Adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
# Add callbacks:
# 'EarlyStopping' to stop training when the model is not enhancing anymore
# 'ModelCheckPoint' to always keep the model that has the best val_accuracy
model_save_filename = "model.h5"
earlystopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
mdlcheckpoint_cb = keras.callbacks.ModelCheckpoint(
model_save_filename, monitor="val_accuracy", save_best_only=True
)
Model: "model" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input (InputLayer) [(None, 8000, 1)] 0 __________________________________________________________________________________________________ conv1d_1 (Conv1D) (None, 8000, 16) 64 input[0][0] __________________________________________________________________________________________________ activation (Activation) (None, 8000, 16) 0 conv1d_1[0][0] __________________________________________________________________________________________________ conv1d_2 (Conv1D) (None, 8000, 16) 784 activation[0][0] __________________________________________________________________________________________________ conv1d (Conv1D) (None, 8000, 16) 32 input[0][0] __________________________________________________________________________________________________ add (Add) (None, 8000, 16) 0 conv1d_2[0][0] conv1d[0][0] __________________________________________________________________________________________________ activation_1 (Activation) (None, 8000, 16) 0 add[0][0] __________________________________________________________________________________________________ max_pooling1d (MaxPooling1D) (None, 4000, 16) 0 activation_1[0][0] __________________________________________________________________________________________________ conv1d_4 (Conv1D) (None, 4000, 32) 1568 max_pooling1d[0][0] __________________________________________________________________________________________________ activation_2 (Activation) (None, 4000, 32) 0 conv1d_4[0][0] __________________________________________________________________________________________________ conv1d_5 (Conv1D) (None, 4000, 32) 3104 activation_2[0][0] __________________________________________________________________________________________________ conv1d_3 (Conv1D) (None, 4000, 32) 544 max_pooling1d[0][0] __________________________________________________________________________________________________ add_1 (Add) (None, 4000, 32) 0 conv1d_5[0][0] conv1d_3[0][0] __________________________________________________________________________________________________ activation_3 (Activation) (None, 4000, 32) 0 add_1[0][0] __________________________________________________________________________________________________ max_pooling1d_1 (MaxPooling1D) (None, 2000, 32) 0 activation_3[0][0] __________________________________________________________________________________________________ conv1d_7 (Conv1D) (None, 2000, 64) 6208 max_pooling1d_1[0][0] __________________________________________________________________________________________________ activation_4 (Activation) (None, 2000, 64) 0 conv1d_7[0][0] __________________________________________________________________________________________________ conv1d_8 (Conv1D) (None, 2000, 64) 12352 activation_4[0][0] __________________________________________________________________________________________________ activation_5 (Activation) (None, 2000, 64) 0 conv1d_8[0][0] __________________________________________________________________________________________________ conv1d_9 (Conv1D) (None, 2000, 64) 12352 activation_5[0][0] __________________________________________________________________________________________________ conv1d_6 (Conv1D) (None, 2000, 64) 2112 max_pooling1d_1[0][0] __________________________________________________________________________________________________ add_2 (Add) (None, 2000, 64) 0 conv1d_9[0][0] conv1d_6[0][0] __________________________________________________________________________________________________ activation_6 (Activation) (None, 2000, 64) 0 add_2[0][0] __________________________________________________________________________________________________ max_pooling1d_2 (MaxPooling1D) (None, 1000, 64) 0 activation_6[0][0] __________________________________________________________________________________________________ conv1d_11 (Conv1D) (None, 1000, 128) 24704 max_pooling1d_2[0][0] __________________________________________________________________________________________________ activation_7 (Activation) (None, 1000, 128) 0 conv1d_11[0][0] __________________________________________________________________________________________________ conv1d_12 (Conv1D) (None, 1000, 128) 49280 activation_7[0][0] __________________________________________________________________________________________________ activation_8 (Activation) (None, 1000, 128) 0 conv1d_12[0][0] __________________________________________________________________________________________________ conv1d_13 (Conv1D) (None, 1000, 128) 49280 activation_8[0][0] __________________________________________________________________________________________________ conv1d_10 (Conv1D) (None, 1000, 128) 8320 max_pooling1d_2[0][0] __________________________________________________________________________________________________ add_3 (Add) (None, 1000, 128) 0 conv1d_13[0][0] conv1d_10[0][0] __________________________________________________________________________________________________ activation_9 (Activation) (None, 1000, 128) 0 add_3[0][0] __________________________________________________________________________________________________ max_pooling1d_3 (MaxPooling1D) (None, 500, 128) 0 activation_9[0][0] __________________________________________________________________________________________________ conv1d_15 (Conv1D) (None, 500, 128) 49280 max_pooling1d_3[0][0] __________________________________________________________________________________________________ activation_10 (Activation) (None, 500, 128) 0 conv1d_15[0][0] __________________________________________________________________________________________________ conv1d_16 (Conv1D) (None, 500, 128) 49280 activation_10[0][0] __________________________________________________________________________________________________ activation_11 (Activation) (None, 500, 128) 0 conv1d_16[0][0] __________________________________________________________________________________________________ conv1d_17 (Conv1D) (None, 500, 128) 49280 activation_11[0][0] __________________________________________________________________________________________________ conv1d_14 (Conv1D) (None, 500, 128) 16512 max_pooling1d_3[0][0] __________________________________________________________________________________________________ add_4 (Add) (None, 500, 128) 0 conv1d_17[0][0] conv1d_14[0][0] __________________________________________________________________________________________________ activation_12 (Activation) (None, 500, 128) 0 add_4[0][0] __________________________________________________________________________________________________ max_pooling1d_4 (MaxPooling1D) (None, 250, 128) 0 activation_12[0][0] __________________________________________________________________________________________________ average_pooling1d (AveragePooli (None, 83, 128) 0 max_pooling1d_4[0][0] __________________________________________________________________________________________________ flatten (Flatten) (None, 10624) 0 average_pooling1d[0][0] __________________________________________________________________________________________________ dense (Dense) (None, 256) 2720000 flatten[0][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, 128) 32896 dense[0][0] __________________________________________________________________________________________________ output (Dense) (None, 5) 645 dense_1[0][0] ================================================================================================== Total params: 3,088,597 Trainable params: 3,088,597 Non-trainable params: 0 _______________________________________________________________________________________
訓練
history = model.fit(
train_ds,
epochs=EPOCHS,
validation_data=valid_ds,
callbacks=[earlystopping_cb, mdlcheckpoint_cb],
)
Epoch 1/100 53/53 [==============================] - 62s 1s/step - loss: 1.0107 - accuracy: 0.6929 - val_loss: 0.3367 - val_accuracy: 0.8640 Epoch 2/100 53/53 [==============================] - 61s 1s/step - loss: 0.2863 - accuracy: 0.8926 - val_loss: 0.2814 - val_accuracy: 0.8813 Epoch 3/100 53/53 [==============================] - 61s 1s/step - loss: 0.2293 - accuracy: 0.9104 - val_loss: 0.2054 - val_accuracy: 0.9160 Epoch 4/100 53/53 [==============================] - 63s 1s/step - loss: 0.1750 - accuracy: 0.9320 - val_loss: 0.1668 - val_accuracy: 0.9320 Epoch 5/100 53/53 [==============================] - 61s 1s/step - loss: 0.2044 - accuracy: 0.9206 - val_loss: 0.1658 - val_accuracy: 0.9347 Epoch 6/100 53/53 [==============================] - 61s 1s/step - loss: 0.1407 - accuracy: 0.9415 - val_loss: 0.0888 - val_accuracy: 0.9720 Epoch 7/100 53/53 [==============================] - 61s 1s/step - loss: 0.1047 - accuracy: 0.9600 - val_loss: 0.1113 - val_accuracy: 0.9587 Epoch 8/100 53/53 [==============================] - 60s 1s/step - loss: 0.1077 - accuracy: 0.9573 - val_loss: 0.0819 - val_accuracy: 0.9693 Epoch 9/100 53/53 [==============================] - 61s 1s/step - loss: 0.0998 - accuracy: 0.9640 - val_loss: 0.1586 - val_accuracy: 0.9427 Epoch 10/100 53/53 [==============================] - 63s 1s/step - loss: 0.1004 - accuracy: 0.9621 - val_loss: 0.1504 - val_accuracy: 0.9333 Epoch 11/100 53/53 [==============================] - 60s 1s/step - loss: 0.0902 - accuracy: 0.9695 - val_loss: 0.1016 - val_accuracy: 0.9600 Epoch 12/100 53/53 [==============================] - 61s 1s/step - loss: 0.0773 - accuracy: 0.9714 - val_loss: 0.0647 - val_accuracy: 0.9800 Epoch 13/100 53/53 [==============================] - 63s 1s/step - loss: 0.0797 - accuracy: 0.9699 - val_loss: 0.0485 - val_accuracy: 0.9853 Epoch 14/100 53/53 [==============================] - 61s 1s/step - loss: 0.0750 - accuracy: 0.9727 - val_loss: 0.0601 - val_accuracy: 0.9787 Epoch 15/100 53/53 [==============================] - 62s 1s/step - loss: 0.0629 - accuracy: 0.9766 - val_loss: 0.0476 - val_accuracy: 0.9787 Epoch 16/100 53/53 [==============================] - 63s 1s/step - loss: 0.0564 - accuracy: 0.9793 - val_loss: 0.0565 - val_accuracy: 0.9813 Epoch 17/100 53/53 [==============================] - 61s 1s/step - loss: 0.0545 - accuracy: 0.9809 - val_loss: 0.0325 - val_accuracy: 0.9893 Epoch 18/100 53/53 [==============================] - 61s 1s/step - loss: 0.0415 - accuracy: 0.9859 - val_loss: 0.0776 - val_accuracy: 0.9693 Epoch 19/100 53/53 [==============================] - 61s 1s/step - loss: 0.0537 - accuracy: 0.9810 - val_loss: 0.0647 - val_accuracy: 0.9853 Epoch 20/100 53/53 [==============================] - 62s 1s/step - loss: 0.0556 - accuracy: 0.9802 - val_loss: 0.0500 - val_accuracy: 0.9880 Epoch 21/100 53/53 [==============================] - 63s 1s/step - loss: 0.0486 - accuracy: 0.9828 - val_loss: 0.0470 - val_accuracy: 0.9827 Epoch 22/100 53/53 [==============================] - 61s 1s/step - loss: 0.0479 - accuracy: 0.9825 - val_loss: 0.0918 - val_accuracy: 0.9693 Epoch 23/100 53/53 [==============================] - 61s 1s/step - loss: 0.0446 - accuracy: 0.9834 - val_loss: 0.0429 - val_accuracy: 0.9867 Epoch 24/100 53/53 [==============================] - 61s 1s/step - loss: 0.0309 - accuracy: 0.9889 - val_loss: 0.0473 - val_accuracy: 0.9867 Epoch 25/100 53/53 [==============================] - 63s 1s/step - loss: 0.0341 - accuracy: 0.9895 - val_loss: 0.0244 - val_accuracy: 0.9907 Epoch 26/100 53/53 [==============================] - 60s 1s/step - loss: 0.0357 - accuracy: 0.9874 - val_loss: 0.0289 - val_accuracy: 0.9893 Epoch 27/100 53/53 [==============================] - 61s 1s/step - loss: 0.0331 - accuracy: 0.9893 - val_loss: 0.0246 - val_accuracy: 0.9920 Epoch 28/100 53/53 [==============================] - 61s 1s/step - loss: 0.0339 - accuracy: 0.9879 - val_loss: 0.0646 - val_accuracy: 0.9787 Epoch 29/100 53/53 [==============================] - 61s 1s/step - loss: 0.0250 - accuracy: 0.9910 - val_loss: 0.0146 - val_accuracy: 0.9947 Epoch 30/100 53/53 [==============================] - 63s 1s/step - loss: 0.0343 - accuracy: 0.9883 - val_loss: 0.0318 - val_accuracy: 0.9893 Epoch 31/100 53/53 [==============================] - 61s 1s/step - loss: 0.0312 - accuracy: 0.9893 - val_loss: 0.0270 - val_accuracy: 0.9880 Epoch 32/100 53/53 [==============================] - 61s 1s/step - loss: 0.0201 - accuracy: 0.9917 - val_loss: 0.0264 - val_accuracy: 0.9893 Epoch 33/100 53/53 [==============================] - 61s 1s/step - loss: 0.0371 - accuracy: 0.9876 - val_loss: 0.0722 - val_accuracy: 0.9773 Epoch 34/100 53/53 [==============================] - 61s 1s/step - loss: 0.0533 - accuracy: 0.9828 - val_loss: 0.0161 - val_accuracy: 0.9947 Epoch 35/100 53/53 [==============================] - 61s 1s/step - loss: 0.0258 - accuracy: 0.9911 - val_loss: 0.0277 - val_accuracy: 0.9867 Epoch 36/100 53/53 [==============================] - 60s 1s/step - loss: 0.0261 - accuracy: 0.9901 - val_loss: 0.0542 - val_accuracy: 0.9787 Epoch 37/100 53/53 [==============================] - 60s 1s/step - loss: 0.0368 - accuracy: 0.9877 - val_loss: 0.0699 - val_accuracy: 0.9813 Epoch 38/100 53/53 [==============================] - 63s 1s/step - loss: 0.0251 - accuracy: 0.9890 - val_loss: 0.0206 - val_accuracy: 0.9907 Epoch 39/100 53/53 [==============================] - 62s 1s/step - loss: 0.0220 - accuracy: 0.9913
評価
print(model.evaluate(valid_ds))
24/24 [==============================] - 6s 244ms/step - loss: 0.0146 - accuracy: 0.9947 [0.014629718847572803, 0.9946666955947876]
~ 98% 検証精度を得ます。
デモ
幾つかのサンプルを取りそして以下を行いましょう :
- 話者を予測する。
- 予測を実際の話者と比較する。
- ノイズがあるサンプルであるにもかかわらず、モデルが依然としてかなり正確であることを見るために音声を聴きます。
SAMPLES_TO_DISPLAY = 10
test_ds = paths_and_labels_to_dataset(valid_audio_paths, valid_labels)
test_ds = test_ds.shuffle(buffer_size=BATCH_SIZE * 8, seed=SHUFFLE_SEED).batch(
BATCH_SIZE
)
test_ds = test_ds.map(lambda x, y: (add_noise(x, noises, scale=SCALE), y))
for audios, labels in test_ds.take(1):
# Get the signal FFT
ffts = audio_to_fft(audios)
# Predict
y_pred = model.predict(ffts)
# Take random samples
rnd = np.random.randint(0, BATCH_SIZE, SAMPLES_TO_DISPLAY)
audios = audios.numpy()[rnd, :, :]
labels = labels.numpy()[rnd]
y_pred = np.argmax(y_pred, axis=-1)[rnd]
for index in range(SAMPLES_TO_DISPLAY):
# For every sample, print the true and predicted label
# as well as run the voice with the noise
print(
"Speaker: {} - Predicted: {}".format(
class_names[labels[index]],
class_names[y_pred[index]],
)
)
display(Audio(audios[index, :, :].squeeze(), rate=SAMPLING_RATE))
以上