🦜️🔗LangChain : モジュール : 検索 – ドキュメント変換 : テキストスプリッター : 文字による分割 / コードの分割 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 09/04/2023

* 本ページは、LangChain の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

🦜️🔗 LangChain : モジュール : 検索 – ドキュメント変換 : テキストスプリッター : 文字による分割

これは最も単純な手法です。これは文字 (デフォルトでは “\n\n”) に基づいて分割し、文字数によりチャンク長を測定します。

テキストが分割される方法 : 単一文字による。
チャンクサイズが測定される方法 : 文字数。

# This is a long document we can split up.
with open('../../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

    page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={} lookup_index=0

ここにドキュメントと共にメタデータを渡す例があります、それはドキュメントと一緒に分割されていることに注意してください。

metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatas=metadatas)
print(documents[0])

    page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' lookup_str='' metadata={'document': 1} lookup_index=0

text_splitter.split_text(state_of_the_union)[0]

    'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'

🦜️🔗 LangChain : モジュール : 検索 – ドキュメント変換 : テキストスプリッター : コードの分割

CodeTextSplitter はコードを分割することを可能にし、複数の言語がサポートされます。列挙型 (enum) Language をインポートして言語を指定します。

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

# Full list of support languages
[e.value for e in Language]

    ['cpp',
     'go',
     'java',
     'js',
     'php',
     'proto',
     'python',
     'rst',
     'ruby',
     'rust',
     'scala',
     'swift',
     'markdown',
     'latex',
     'html',
     'sol',]

# You can also see the separators used for a given language
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

    ['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

Python

ここに PythonTextSplitter を使用した例があります :

PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

    [Document(page_content='def hello_world():\n    print("Hello, World!")', metadata={}),
     Document(page_content='# Call the function\nhello_world()', metadata={})]

JS

ここに JS テキストスプリッターを使用した例があります :

JS_CODE = """
function helloWorld() {
  console.log("Hello, World!");
}

// Call the function
helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)
js_docs = js_splitter.create_documents([JS_CODE])
js_docs

    [Document(page_content='function helloWorld() {\n  console.log("Hello, World!");\n}', metadata={}),
     Document(page_content='// Call the function\nhelloWorld();', metadata={})]

マークダウン

ここにマークダウン・テキストスプリッターの例があります :

markdown_text = """
# 🦜️🔗 LangChain

⚡ Building applications with LLMs through composability ⚡

## Quick Install

```bash
# Hopefully this code block isn't split
pip install langchain
```

As an open source project in a rapidly developing field, we are extremely open to contributions.
"""

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
md_docs = md_splitter.create_documents([markdown_text])
md_docs

[Document(page_content='# 🦜️🔗 LangChain', metadata={}), Document(page_content='⚡ Building applications with LLMs through composability ⚡', metadata={}), Document(page_content='## Quick Install', metadata={}), Document(page_content="```bash\n# Hopefully this code block isn't split", metadata={}), Document(page_content='pip install langchain', metadata={}), Document(page_content='```', metadata={}), Document(page_content='As an open source project in a rapidly developing field, we', metadata={}), Document(page_content='are extremely open to contributions.', metadata={})]

Latex

ここに Latex テキスト上の例があります :

latex_text = """
\documentclass{article}

\begin{document}

\maketitle

\section{Introduction}
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.

\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.

\subsection{Applications of LLMs}
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.

\end{document}
"""

latex_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
latex_docs = latex_splitter.create_documents([latex_text])
latex_docs

    [Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle', metadata={}),
     Document(page_content='\\section{Introduction}', metadata={}),
     Document(page_content='Large language models (LLMs) are a type of machine learning', metadata={}),
     Document(page_content='model that can be trained on vast amounts of text data to', metadata={}),
     Document(page_content='generate human-like language. In recent years, LLMs have', metadata={}),
     Document(page_content='made significant advances in a variety of natural language', metadata={}),
     Document(page_content='processing tasks, including language translation, text', metadata={}),
     Document(page_content='generation, and sentiment analysis.', metadata={}),
     Document(page_content='\\subsection{History of LLMs}', metadata={}),
     Document(page_content='The earliest LLMs were developed in the 1980s and 1990s,', metadata={}),
     Document(page_content='but they were limited by the amount of data that could be', metadata={}),
     Document(page_content='processed and the computational power available at the', metadata={}),
     Document(page_content='time. In the past decade, however, advances in hardware and', metadata={}),
     Document(page_content='software have made it possible to train LLMs on massive', metadata={}),
     Document(page_content='datasets, leading to significant improvements in', metadata={}),
     Document(page_content='performance.', metadata={}),
     Document(page_content='\\subsection{Applications of LLMs}', metadata={}),
     Document(page_content='LLMs have many applications in industry, including', metadata={}),
     Document(page_content='chatbots, content creation, and virtual assistants. They', metadata={}),
     Document(page_content='can also be used in academia for research in linguistics,', metadata={}),
     Document(page_content='psychology, and computational linguistics.', metadata={}),
     Document(page_content='\\end{document}', metadata={})]

HTML

ここに HTML テキストスプリッターを使用した例があります :

html_text = """


    
        🦜️🔗 LangChain
        
    
    
        
            🦜️🔗 LangChain
            ⚡ Building applications with LLMs through composability ⚡
        
        
            As an open source project in a rapidly developing field, we are extremely open to contributions.
        
    

"""

html_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.HTML, chunk_size=60, chunk_overlap=0
)
html_docs = html_splitter.create_documents([html_text])
html_docs

    [Document(page_content='\n', metadata={}),
     Document(page_content='\n        🦜️🔗 LangChain', metadata={}),
     Document(page_content='\n    
 
Solidity
ここに Solidity テキスト・スプリッターを使用した例があります :
SOL_CODE = """
pragma solidity ^0.8.20;
contract HelloWorld {
   function add(uint a, uint b) pure public returns(uint) {
       return a + b;
   }
}
"""

sol_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.SOL, chunk_size=128, chunk_overlap=0
)
sol_docs = sol_splitter.create_documents([SOL_CODE])
sol_docs

[
    Document(page_content='pragma solidity ^0.8.20;', metadata={}),
    Document(page_content='contract HelloWorld {\n   function add(uint a, uint b) pure public returns(uint) {\n       return a + b;\n   }\n}', metadata={})
]


 
以上

月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30