🦜️🔗LangChain : モジュール : 検索 – ドキュメント変換 : テキストスプリッター : トークンによる分割 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 09/05/2023

* 本ページは、LangChain の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Modules : Retrieval – Document transformers : Text splitters : Split by tokens

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

🦜️🔗 LangChain : モジュール : 検索 – ドキュメント変換 : テキストスプリッター : トークンによる分割

言語モデルはトークン制限を持ちます。トークン制限を超えないようにする必要があります。従ってテキストをチャンクに分割するときトークン数をカウントすることは良いアイデアです。多くのトークナイザーがあります。テキストでトークンをカウントするとき、言語モデルで使用されたのと同じトークナイザーを使用する必要があります。

tiktoken

tiktoken は OpenAI により作成された高速な BPE トークナイザーです。

それを使用して使用されたトークンを見積もることができます。それは多分 OpenAI モデルに対してより正確です。

テキストが分割される方法: 渡された文字による。
チャンクサイズが測定あれる方法: tiktoken トークナイザーによる。

#!pip install tiktoken

# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter

API リファレンス :

CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

    Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  
    
    Last year COVID-19 kept us apart. This year we are finally together again. 
    
    Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 
    
    With a duty to one another to the American people to the Constitution.

CharacterTextSplitter.from_tiktoken_encoder を使用する場合、テキストは CharacterTextSplitter により分割されるだけで、tiktoken トークナイザーは分割をマージするために使用されることに注意してください。つまり、分割は tiktoken トークナイザーにより測定されたチャンクサイズよりも大きくなる可能性があります。RecursiveCharacterTextSplitter.from_tiktoken_encode を使用して、分割が言語モデルにより許容されたトークンのチャンクサイズよりも大きくないことを確認することができます、そこでは各分割が大きいサイズを持つ場合には再帰的に分割されます。

tiktoken スプリッターを直接ロードすることもできます、これは各分割がチャンクサイズよりも小さいことを確認します。

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

API リファレンス :

TokenTextSplitter

spaCy

Another alternative to NLTK is to use spaCy tokenizer.

テキストが分割される方法: spaCy トークナイザーによる。
チャンクサイズが測定あれる方法: 文字数による。

#!pip install spacy

# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

API リファレンス :

SpacyTextSplitter

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

    Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
    
    Members of Congress and the Cabinet.
    
    Justices of the Supreme Court.
    
    My fellow Americans.  
    
    
    
    Last year COVID-19 kept us apart.
    
    This year we are finally together again. 
    
    
    
    Tonight, we meet as Democrats Republicans and Independents.
    
    But most importantly as Americans. 
    
    
    
    With a duty to one another to the American people to the Constitution. 
    
    
    
    And with an unwavering resolve that freedom will always triumph over tyranny. 
    
    
    
    Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
    
    But he badly miscalculated. 
    
    
    
    He thought he could roll into Ukraine and the world would roll over.
    
    Instead he met a wall of strength he never imagined. 
    
    
    
    He met the Ukrainian people. 
    
    
    
    From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

SentenceTransformers

SentenceTransformersTokenTextSplitter は sentence-transformer モデルで使用するための特殊なテキストスプリッターです。デフォルトの動作は、テキストを、使用したい sentence transformer モデルのトークンウィンドウに収まるチャンクに分割します。

from langchain.text_splitter import SentenceTransformersTokenTextSplitter

API リファレンス :

SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

    tokens in text to split: 514

text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

    lorem

NLTK

単に “\n\n” で分割するのではなく、NLTK トークナイザーに基づいて分割するために NLTK を使用できます。

テキストが分割される方法: NLTK トークナイザーによる。
チャンクサイズが測定あれる方法: 文字数による。

# pip install nltk

# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

API リファレンス :

NLTKTextSplitter

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

    Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
    
    Members of Congress and the Cabinet.
    
    Justices of the Supreme Court.
    
    My fellow Americans.
    
    Last year COVID-19 kept us apart.
    
    This year we are finally together again.
    
    Tonight, we meet as Democrats Republicans and Independents.
    
    But most importantly as Americans.
    
    With a duty to one another to the American people to the Constitution.
    
    And with an unwavering resolve that freedom will always triumph over tyranny.
    
    Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
    
    But he badly miscalculated.
    
    He thought he could roll into Ukraine and the world would roll over.
    
    Instead he met a wall of strength he never imagined.
    
    He met the Ukrainian people.
    
    From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
    
    Groups of citizens blocking tanks with their bodies.

Hugging Face トークナイザー

トークンのテキスト長をカウントするために Hugging Face トークナイザー, GPT2TokenizerFast を使用します。

テキストが分割される方法: 渡された文字による。
チャンクサイズが測定あれる方法: Hugging Face トークナイザーにより計算された文字数による。

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter

API リファレンス :

CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

    Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  
    
    Last year COVID-19 kept us apart. This year we are finally together again. 
    
    Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 
    
    With a duty to one another to the American people to the Constitution.

以上

月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30