🦜️🔗LangChain : モジュール : 検索 – Retrievers : アンサンブル / マルチベクトル Retriever (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 09/13/2023

* 本ページは、LangChain の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

🦜️🔗 LangChain : モジュール : 検索 – Retrievers : アンサンブル Retriever

EnsembleRetriever は入力として retriever のリストを受け取り、get_relevant_documents() メソッドの結果をアンサンブルし、そして Reciprocal ランク融合 (Reciprocal Rank Fusion) アルゴリズムに基づいて結果を再ランク付けします。

様々なアルゴリズムの長所を活用して、EnsembleRetriever は単一アルゴリズムよりも良いパフォーマンスを得られます。

最も一般的なパターンは、(BM25 のような) sparse retriever を (埋め込み類似性のような) dense retriever と組み合わせることです、何故ならそれらの強みは相補的であるためです。それは「ハイブリッド検索」とも呼ばれます。sparse retriever はキーワードに基づいて関連ドキュメントを見つけるのに優れている一方で、dense retriever は意味類似性に基づいて関連ドキュメントを見つけることに優れています。

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS

API リファレンス :

doc_list = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(doc_list)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
faiss_vectorstore = FAISS.from_texts(doc_list, embedding)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5])

docs = ensemble_retriever.get_relevant_documents("apples")
docs

    [Document(page_content='I like apples', metadata={}),
     Document(page_content='Apples and oranges are fruits', metadata={})]

🦜️🔗 LangChain : モジュール : 検索 – Retrievers : マルチベクトル Retriever

ドキュメント毎に複数のベクトルをストアすることは有益である場合が多いです。これが有益である複数のユースケースがあります。LangChain はベース MultiVectorRetriever を持ち、これはこのタイプのセットアップをクエリーすることを簡単にします。ドキュメント毎に複数のベクトルを作成する方法には多くの複雑さがあります。このノートブックはこれらのベクトルを作成して MultiVectorRetriever を使用する一般的な方法のいくつかをカバーします。

ドキュメント毎に複数のベクトルを作成する手法は以下を含みます :

Smaller チャンク: ドキュメントをより小さいチャンクに分割して、それらを埋め込みます (これは ParentDocumentRetriever です)。
要約 (Summary): 各ドキュメントに対する要約を作成し、それらをドキュメントと共に (or ドキュメントの代わりに) 埋め込みます。
仮定の質問 (Hypothetical questions) : 各ドキュメントが答えるのに適切な仮定の質問を作成し、それらをドキュメントと共に (or ドキュメントの代わりに) 埋め込みます。

これはまた埋め込みを – 手動で – 追加する別の手法を可能にすることにも注意してください。これは、ドキュメントがリカバーされることに繋がる質問やクエリーを明示的に追加できて、更なる制御を与えるので素晴らしいです。

from langchain.retrievers.multi_vector import MultiVectorRetriever

API リファレンス :

MultiVectorRetriever

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader

API リファレンス :

loaders = [
    TextLoader('../../paul_graham_essay.txt'),
    TextLoader('../../state_of_the_union.txt'),
]
docs = []
for l in loaders:
    docs.extend(l.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

Smaller chunks

多くの場合、情報のより大きいチャンクを検索取得し、小さいチャンクを埋め込むことは役立ちます。これは埋め込みができる限り密接に意味論的意味 (semantic meaning) を捕捉し、できる限り多くのコンテキストが下流に渡されることを可能にします。これが ParentDocumentRetriever が行なっていることであることに注意してください。ここで内部的に何が起きているのかを示します。

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    id_key=id_key,
)
import uuid
doc_ids = [str(uuid.uuid4()) for _ in docs]

# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("justice breyer")[0]

    Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '10e9cbc0-4ba5-4d79-a09b-c033d1ba7b01', 'source': '../../state_of_the_union.txt'})

# Retriever returns larger chunks
len(retriever.get_relevant_documents("justice breyer")[0].page_content)

要約

多くの場合、要約はチャンクが何についてかを正確に蒸留することができて、より良い検索取得に繋がる可能性があります。ここでは要約を作成してからそれらを埋め込む方法を示します。

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
import uuid
from langchain.schema.document import Document

API リファレンス :

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

summary_docs = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# # We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
#     doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)

sub_docs = vectorstore.similarity_search("justice breyer")

sub_docs[0]

    Document(page_content="The document is a transcript of a speech given by the President of the United States. The President discusses several important issues and initiatives, including the nomination of a Supreme Court Justice, border security and immigration reform, protecting women's rights, advancing LGBTQ+ equality, bipartisan legislation, addressing the opioid epidemic and mental health, supporting veterans, investigating the health effects of burn pits on military personnel, ending cancer, and the strength and resilience of the American people.", metadata={'doc_id': '79fa2e9f-28d9-4372-8af3-2caf4f1de312'})

retrieved_docs = retriever.get_relevant_documents("justice breyer")

len(retrieved_docs[0].page_content)

仮定のクエリー

LLM はまた特定のドキュメントに対して尋ねられる可能性のある仮定の質問のリストを生成するために使用できます。そしてこれらの質問は埋め込むことができます。

functions = [
    {
      "name": "hypothetical_questions",
      "description": "Generate hypothetical questions",
      "parameters": {
        "type": "object",
        "properties": {
          "questions": {
            "type": "array",
            "items": {
                "type": "string"
              },
          },
        },
        "required": ["questions"]
      }
    }
  ]

from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template("Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{doc}")
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(functions=functions, function_call={"name": "hypothetical_questions"})
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

API リファレンス :

JsonKeyOutputFunctionsParser

chain.invoke(docs[0])

    ["What was the author's initial impression of philosophy as a field of study, and how did it change when they got to college?",
     'Why did the author decide to switch their focus to Artificial Intelligence (AI)?',
     "What led to the author's disillusionment with the field of AI as it was practiced at the time?"]

hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend([Document(page_content=s,metadata={id_key: doc_ids[i]}) for s in question_list])

retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

sub_docs = vectorstore.similarity_search("justice breyer")

sub_docs

    [Document(page_content="What is the President's stance on immigration reform?", metadata={'doc_id': '505d73e3-8350-46ec-a58e-3af032f04ab3'}),
     Document(page_content="What is the President's stance on immigration reform?", metadata={'doc_id': '1c9618f0-7660-4b4f-a37c-509cbbbf6dba'}),
     Document(page_content="What is the President's stance on immigration reform?", metadata={'doc_id': '82c08209-b904-46a8-9532-edd2380950b7'}),
     Document(page_content='What measures is the President proposing to protect the rights of LGBTQ+ Americans?', metadata={'doc_id': '82c08209-b904-46a8-9532-edd2380950b7'})]

retrieved_docs = retriever.get_relevant_documents("justice breyer")

len(retrieved_docs[0].page_content)

以上

月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30