SentencePiece 0.1.9 概要 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 12/09/2020 (v0.1.94)

* 本ページは、SentencePiece の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

README

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

★ 無料セミナー開催中 ★ クラスキャット主催人工知能 & ビジネス Web セミナー

人工知能とビジネスをテーマにウェビナー (WEB セミナー) を定期的に開催しています。スケジュールは弊社公式 Web サイトでご確認頂けます。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
Windows PC のブラウザからご参加が可能です。スマートデバイスもご利用可能です。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/

Facebook: https://www.facebook.com/ClassCatJP/

SentencePiece 0.1.9 : 概要

SentencePiece は主としてニューラルネットワーク・ベースのテキスト生成システムのための教師なしテキスト tokenizer と detokenizer で、そこでは語彙サイズはニューラルモデル訓練の前に事前決定されます。SentencePiece は raw センテンスからの直接訓練の拡張とともにサブワード・ユニット (e.g., バイトペア符号化 (BPE) [Sennrich et al.]) とユニグラム言語モデル [Kudo.] を実装します。SentencePiece は言語固有の前/後処理に依拠しない純粋に end-to-end なシステムを作成することを許容します。

※ This is not an official Google product.

テクニカル・ハイライト

純粋にデータ駆動 : SentencePiece はセンテンスから tokenization と detokenization モデルを訓練します。Pre-tokenization (Moses tokenizer / MeCab / KyTea) は必ずしも必要ではありません。
言語独立 : SentencePiece はセンテンスを単にユニコード文字のシークエンスとして扱います。言語依存なロジックはありません。
複数のサブワード・アルゴリズム : BPE [Sennrich et al.] とユニグラム言語モデル [Kudo.] がサポートされます。
サブワード正則化 : SentencePiece はサブワード正則化と BPE-dropout のためのサブワード・サンプリングを実装します、これは NMT モデルの堅牢性と精度を改良するのに役立ちます。
高速そして軽量 : セグメンテーション・スピードは約 50k センテンス/sec で、メモリ使用量は約 6MB です。
自己充足的 : 同じモデルファイルが使用される限り同じ tokenization/detokenization が得られます。
直接的な語彙 id 生成 : SentencePiece は “語彙 to id” マッピングを管理しそして raw センテンスから語彙 id シークエンスを直接生成できます。
NFKC-ベースの正則化 : SentencePiece は NFKC-ベースのテキスト正則化を遂行します。

他の実装との比較

特徴	SentencePiece	subword-nmt	WordPiece
サポートされるアルゴリズム	BPE, ユニグラム、文字、単語	BPE	BPE*
OSS?	Yes	Yes	Google 内部
サブワード正則化	Yes	No	No
Python ライブラリ (pip)	Yes	No	N/A
C++ ライブラリ	Yes	No	N/A
事前セグメンテーションが要求されるか？	No	Yes	Yes
カスタマイズ可能な正則化 (e.g., NFKC)	Yes	No	N/A
直接 id 生成	Yes	No	N/A

※ WordPiece で使用される BPE アルゴリズムは元の BPE とは少し異なることに注意してください。

概要

SentencePiece とは何か？

SentencePiece は サブワード・ユニット の再実装で、ニューラル機械翻訳の open な (未決な) 語彙問題を緩和する効果的な方法です。SentencePiece は 2 つのセグメンテーション・アルゴリズムをサポートします、バイトペア符号化 (BPE) [Sennrich et al.]) とユニグラム言語モデル [Kudo.]です。ここに他の実装との高位な違いがあります。

一意なトークンの数は事前決定される (= predetermined)

ニューラル機械翻訳モデルは典型的には固定された語彙で動作します。無限の語彙を仮定する、殆どの教師なし単語セグメンテーション・アルゴリズムとは違い、SentencePiece は最終的な語彙サイズが固定されるようにセグメンテーション・モデルを訓練します, e.g., 8k, 16k, or 32k。

SentencePiece は訓練のための最終的な語彙サイズを指定することに注意してください、これは merge 演算の数を使用する subword-nmt とは異なります。merge 演算の数は BPE-固有のパラメータで、ユニグラム、単語と文字を含む他のセグメンテーション・アルゴリズムには適用可能ではありません。

raw センテンスから訓練する

以前のサブワード実装は入力センテンスが事前トークン化されていることを仮定しています。この制約は効率的な訓練のためには必要でしたが、前処理を複雑にします、言語依存の tokenizer を前もって実行しなければならないからです。SentencePiece の実装は raw センテンスからモデルを訓練するために十分に高速です。これは中国語や日本語のために tokenizer と detokenizer を訓練するために有用です、そこでは単語間に明示的なスペースが存在しません。

ホワイトスペースは基本的なシンボルとして扱われる

自然言語処理の最初のステップはテキストのトークン化です。例えば、標準的な英語 tokenizer はテキスト “Hello world.” を次の 3 つのトークンに分割するでしょう。

[Hello] [World] [.]

一つの所見は元の入力とトークン化されたシークエンスは 可逆的に変換可能ではない ことです。例えば、 “World” と “.” の間にスペースがないという情報はトークン化されたシークエンスからドロップされます、何故ならば e.g., Tokenize(“World.”) == Tokenize(“World .”) であるからです。

SentencePiece は入力テキストを単にユニコード文字のシークエンスとして扱います。ホワイトスペースはまた通常のシンボルとして処理されます。ホワイトスペースを基本的なトークンとして明示的に処理するには、SentencePiece は次のように最初にメタシンボルl “▁” (U+2581) でホワイトスペースをエスケープします。

Hello▁World.

それから、このテキストは小さいピースに分割されます、例えば :

[Hello] [▁Wor] [ld] [.]

ホワイトスペースは分割されたテキストで保存されますので、テキストを曖昧さなしに detokenize できます。

detokenized = ''.join(pieces).replace('▁', ' ')

この特徴は言語固有のリソースに頼ることなく detokenization を遂行することを可能にします。

センテンスを標準的な単語 segmenter で分割するとき損失のない同じ変換を適用することはできないことに注意してください、何故ならばそれらはホワイトスペースを特殊なシンボルとして扱うからです。トークン化されたシークエンスは元のセンテンスをリストアするために必要な情報を保存していません。

(en) Hello world. → [Hello] [World] [.] (Hello と World 間のスペース)
(ja) こんにちは世界。 → [こんにちは] [世界] [。] (こんにちはと世界の間にスペースがないこと)

サブワード正則化と BPE-ドロップアウト

サブワード正則化 [Kudo.] と BPE-ドロップアウト [Provilkov et al] は単純な正則化法で、訓練データをサブワード・サンプリングで on-the-fly に仮想的に増強します、これは NMT モデルの堅牢性に加えて精度を改良するのに役立ちます。

サブワード正則化を有効にするため、各パラメータ更新のために一つの分割をサンプリングするために SentencePiece ライブラリ (C++ / Python) を NMT システムに統合することを望むでしょう、これは標準的なオフライン・データ準備とは異なります。ここに Python ライブラリの例があります。’New York’ が各 SampleEncode (C++) や enable_sampling=True を伴う encode (Python) 呼び出し上で様々に分割されることを見つけられるでしょう。サンプリング・パラメータの詳細は sentencepiece_processor.h で見つかります。

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

インストール

Python モジュール

SentencePiece は SentencePiece 訓練とセグメンテーションの両者をサポートする Python ラッパーを提供します。SentencePiece の Python バイナリ・パッケージを次でインストールできます。

% pip install sentencepiece

より多くの詳細については、Python モジュールを見てください。

SentencePiece コマンドライン・ツールを C++ ソースからビルドしてインストールする

The following tools and libraries are required to build SentencePiece:

cmake
C++11 compiler
gperftools ライブラリ (オプション、10-40% パフォーマンス改良が得られます。)

Ubuntu では、ビルドツールは apt-get でインストールできます :

% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

それから以下のようにコマンドライン・ツールをビルドしてインストールできます。

% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v

OSX/macOS では、最後のコマンドを sudo update_dyld_shared_cache で置き換えてください。

vcpkg を使用してビルドしてインストールする

vcpkg 依存性マネージャを使用して sentencepiece をダウンロードしてインストールできます :

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

vcpkg の sentencepiece ポートは Microsoft チームメンバーとコミュニティ contributors により最新版に保持されています。バージョンが古い場合には、vcpkg レポジトリ上で issue か pull リクエストを作成してください。

使用方法手順

SentencePiece モデルを訓練する

% spm_train --input= --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>

–input: one-sentence-per-line raw コーパスファイル。No need to run tokenizer, normalizer or preprocessor. デフォルトでは、SentencePiece は入力を Unicode NFKC で正規化します。ファイルのカンマ区切りリストを渡すことができます。
–model_prefix: 出力モデル名 prefix。<model_name>.model と <model_name>.vocab が生成されます。
–vocab_size: 語彙サイズ, e.g., 8000, 16000, or 32000
–character_coverage: モデルによりカバーされる文字の総量、良いデフォルトは : 日本語や中国語のようなリッチな文字セットを持つ言語のために 0.9995、そして小さい文字セットを持つ他の言語のために 1.0 です。
–model_type: モデル・タイプ。ユニグラム (デフォルト), bpe, 文字, or 単語から選択します。単語タイプを使用するとき入力センテンスは事前トークン化 (= pretokenized) されなければなりません。

訓練のための総てのパラメータを表示するために –help フラグを使用するか、あるいは概要のためにここを見てください。

raw テキストをセンテンス pieces/ids にエンコードする

% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output

BOS/EOS マーカーを挿入したり入力センテンスを reverse するために –extra_options フラグを使用します。

% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> </s>)

SentencePiece は –output_format=(nbest|sample)_(piece|id) フラグで nbest セグメンテーションとセグメンテーション・サンプリングをサポートします。

% spm_encode --model= --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model= --output_format=nbest_id --nbest_size=10 < input > output

センテンス pieces/ids を raw テキストにデコードする

% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output

テキストを逆順にデコードするために –extra_options フラグを使用します。

% spm_decode --extra_options=reverse < input > output

End-to-End サンプル

spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000

sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: data/botchan.txt
  input_format: 
  model_prefix: m
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: 
  bos_piece: 
  eos_piece: 
  pad_piece: 
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(320) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(175) LOG(INFO) Loading corpus: data/botchan.txt
trainer_interface.cc(376) LOG(INFO) Loaded all 4288 sentences
trainer_interface.cc(391) LOG(INFO) Adding meta_piece: 
trainer_interface.cc(391) LOG(INFO) Adding meta_piece: 
trainer_interface.cc(391) LOG(INFO) Adding meta_piece: 
trainer_interface.cc(396) LOG(INFO) Normalizing sentences...
trainer_interface.cc(457) LOG(INFO) all chars count=274252
trainer_interface.cc(468) LOG(INFO) Done: 99.957% characters are covered.
trainer_interface.cc(478) LOG(INFO) Alphabet size=69
trainer_interface.cc(479) LOG(INFO) Final character coverage=0.99957
trainer_interface.cc(511) LOG(INFO) Done! preprocessed 4288 sentences.
unigram_model_trainer.cc(138) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(142) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(193) LOG(INFO) Initialized 15177 seed sentencepieces
trainer_interface.cc(517) LOG(INFO) Tokenizing input sentences with whitespace: 4288
trainer_interface.cc(527) LOG(INFO) Done! 9165
unigram_model_trainer.cc(488) LOG(INFO) Using 9165 sentences for EM training
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=0 size=5764 obj=10.7299 num_tokens=19301 num_tokens/piece=3.34854
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=1 size=5029 obj=8.88016 num_tokens=19424 num_tokens/piece=3.8624
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=0 size=3770 obj=8.92998 num_tokens=20711 num_tokens/piece=5.49363
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=1 size=3768 obj=8.87748 num_tokens=20710 num_tokens/piece=5.49628
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=0 size=2826 obj=9.17865 num_tokens=23017 num_tokens/piece=8.14473
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=1 size=2826 obj=9.10648 num_tokens=23019 num_tokens/piece=8.14544
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=0 size=2119 obj=9.47145 num_tokens=25644 num_tokens/piece=12.1019
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=1 size=2119 obj=9.39055 num_tokens=25644 num_tokens/piece=12.1019
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=0 size=1589 obj=9.85957 num_tokens=28803 num_tokens/piece=18.1265
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=1 size=1589 obj=9.77704 num_tokens=28820 num_tokens/piece=18.1372
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=0 size=1191 obj=10.3354 num_tokens=32139 num_tokens/piece=26.9849
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=1 size=1191 obj=10.2456 num_tokens=32141 num_tokens/piece=26.9866
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=0 size=1100 obj=10.3918 num_tokens=32966 num_tokens/piece=29.9691
unigram_model_trainer.cc(504) LOG(INFO) EM sub_iter=1 size=1100 obj=10.3701 num_tokens=32966 num_tokens/piece=29.9691
trainer_interface.cc(605) LOG(INFO) Saving model: m.model
trainer_interface.cc(616) LOG(INFO) Saving vocabs: m.vocab

$ ls -l 
-rw-rw-r-- 1 ubuntu ubuntu 253229 Dec  8 16:03 m.model
-rw-rw-r-- 1 ubuntu ubuntu  16385 Dec  8 16:03 m.vocab
-rw-rw-r-- 1 ubuntu ubuntu     83 Dec  8 16:03 train.sh

$ file *
m.model:  data
m.vocab:  UTF-8 Unicode text

$ head -n 20 m.vocab 
<unk>   0
<s>     0
</s>    0
,       -3.41008
▁       -3.46053
.       -3.54376
▁the    -3.56037
s       -3.7084
▁I      -3.85825
▁to     -4.04246
▁a      -4.1268
ed      -4.15948
e       -4.16876
t       -4.26527
▁and    -4.28118
▁of     -4.28557
ing     -4.35696
a       -4.61566
d       -4.66199
▁in     -4.66923

$ tail -n 20 m.vocab 
▁Bachelor       -9.54422
▁Natsume        -9.54422
▁Probably       -9.54422
▁Toyama -9.54422
▁absurd -9.54422
▁beautiful      -9.54422
▁blunder        -9.54422
▁condition      -9.54422
▁discharge      -9.54422
▁distributing   -9.54422
▁instructor     -9.54422
▁occasion       -9.54422
▁occup  -9.54422
▁replied        -9.54422
(       -10.4623
*       -10.9938
z       -10.9939
q       -10.994
j       -10.9941
v       -10.9942

echo "I saw a girl with a telescope." | spm_encode --model=m.model

▁I ▁sa w ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id

8 312 54 10 927 36 10 4 129 82 7 33 20 156 5

$ echo "8 312 54 10 927 36 10 4 129 82 7 33 20 156 5" | spm_decode --model=m.model --input_format=id

I saw a girl with a telescope.

元の入力センテンスを語彙 id シークエンスからリストアされることを見い出せるでしょう。

語彙リストをエクスポートする

% spm_export_vocab --model=<model_file> --output=<output file>

<output file> は語彙のリストと emission 対数確率をストアします。vocabulary id はこのファイルの行番号に対応します。

特殊メタトークンを再定義する

デフォルトでは、SentencePiece は Unknown (<unk>), BOS (<s>) と EOS </s>) トークンを使用します、これはそれぞれ 0, 1 と 2 の id を持ちます。このマッピングを次のように訓練段階で再定義できます。

% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

-1 id e.g., bos_id=-1を設定するとき、この特殊トークンは無効にされます。unknown id は無効にできないことに注意してください。padding (<pad>) を –pad_id=3 として定義できます。

他の特殊トークンを割り当てることを望む場合には、Use custom symbols を見てください。

語彙制約 (= restriction)

spm_encode は spm_encode が (少なくとも何某かの頻度で) 語彙にも現れるシンボルだけを生成するように –vocabulary と –vocabulary_threshold オプションを受け取ります。この特徴の背景は subword-nmt ページで説明されます。

この使用方法は基本的には subword-nmt のそれと同じです。L1 と L2 は 2 つの言語 (ソース/ターゲット言語) で、共有 spm モデルを訓練し、そして各々に対して結果としての語彙を得ることを仮定しています :

% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shuffle コマンドは念の為に使用されます、何故ならば spm_train はデフォルトではコーパスの最初の 10M 行をロードするからです。

それから train/test コーパスを –vocabulary オプションでセグメント分けします。

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

高度なトピック

SentencePiece 実験
SentencePieceProcessor C++ API
カスタム・テキスト正規化ルールを使用する
カスタム・シンボルを使用する
Python モジュール
TensorFlow モジュール
[Segmentation and training algorithms in detail]

以上

2020年12月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31