OpenThaiGPT + RAG with LangChain

 

OpenThaiGPT + RAG with LangChain

OpenThaiGPT เป็นแชทบอทภาษาไทยที่ พัฒนาโดยทีมนักวิจัยจาก AIEAT, AIAT, NECTEC, NSTDA, ThaiSC และ Pantip.com ซึ่ง เทรนบนข้อมูลมากกว่า 2 ล้านล้าน token ทำให้สามารถเข้าใจและตอบคำถามภาษาไทยได้ลึกซึ้งและครอบคลุม

Official website: https://openthaigpt.aieat.or.th

What is LangChain?

LangChain is a framework for developing applications powered by language models.

TLDR: LangChain makes the complicated parts of working & building with AI models easier. It helps do this in two ways:

  1. Integration - Bring external data, such as your files, other applications, and api data, to your LLMs

  2. Agency - Allow your LLMs to interact with it's environment via decision making. Use LLMs to help decide which action to take next

Why LangChain?

  1. Components - LangChain makes it easy to swap out abstractions and components necessary to work with language models.

  2. Customized Chains - LangChain provides out of the box support for using and customizing 'chains' - a series of actions strung together.

  3. Speed 🚢 - This team ships insanely fast. You'll be up to date with the latest LLM features.

  4. Community 👥 - Wonderful discord and community support, meet ups, hackathons, etc.

Though LLMs can be straightforward (text-in, text-out) you'll quickly run into friction points that LangChain helps with once you develop more complicated applications.

Note: This cookbook will not cover all aspects of LangChain. It's contents have been curated to get you to building & impact as quick as possible. For more, please check out LangChain Conceptual Documentation

Update Oct '23: This notebook has been expanded from it's original form


Introduction

LangChain is a framework for developing applications powered by language models. It enables applications that:

  • Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
  • Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)

This framework consists of several parts.

  • LangChain Libraries: The Python and JavaScript libraries. Contains interfaces and integrations for a myriad of components, a basic run time for combining these components into chains and agents, and off-the-shelf implementations of chains and agents.
  • LangChain Templates: A collection of easily deployable reference architectures for a wide variety of tasks.
  • LangServe: A library for deploying LangChain chains as a REST API.
  • LangSmith: A developer platform that lets you debug, test, evaluate, and monitor chains built on any LLM framework and seamlessly integrates with LangChain.


Retrieval

Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.

LangChain provides all the building blocks for RAG applications - from simple to complex. This section of the documentation covers everything related to the retrieval step - e.g. the fetching of the data. Although this sounds simple, it can be subtly complex. This encompasses several key modules.



Document loaders

Document loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the space, like AirByte and Unstructured. LangChain provides integrations to load all types of documents (HTML, PDF, code) from all types of locations (private S3 buckets, public websites).

Text Splitting

A key part of retrieval is fetching only the relevant parts of documents. This involves several transformation steps to prepare the documents for retrieval. One of the primary ones here is splitting (or chunking) a large document into smaller chunks. LangChain provides several transformation algorithms for doing this, as well as logic optimized for specific document types (code, markdown, etc).

Text embedding models

Another key part of retrieval is creating embeddings for documents. Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of a text that are similar. LangChain provides integrations with over 25 different embedding providers and methods, from open-source to proprietary API, allowing you to choose the one best suited for your needs. LangChain provides a standard interface, allowing you to easily swap between models.

Vector stores

With the rise of embeddings, there has emerged a need for databases to support efficient storage and searching of these embeddings. LangChain provides integrations with over 50 different vectorstores, from open-source local ones to cloud-hosted proprietary ones, allowing you to choose the one best suited for your needs. LangChain exposes a standard interface, allowing you to easily swap between vector stores.

Retrievers

Once the data is in the database, you still need to retrieve it. LangChain supports many different retrieval algorithms and is one of the places where we add the most value. LangChain supports basic methods that are easy to get started - namely simple semantic search. However, we have also added a collection of algorithms on top of this to increase performance. These include:

  • Parent Document Retriever: This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.
  • Self Query Retriever: User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the semantic part of a query from other metadata filters present in the query.
  • Ensemble Retriever: Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.
  • And more!

Indexing

The LangChain Indexing API syncs your data from any source into a vector store, helping you:

  • Avoid writing duplicated content into the vector store
  • Avoid re-writing unchanged content
  • Avoid re-computing embeddings over unchanged content

All of which should save you time and money, as well as improve your vector search results.





Adun Nantakaew อดุลย์ นันทะแก้ว
LINE : adunnan

ความคิดเห็น

โพสต์ยอดนิยมจากบล็อกนี้

OpenThaiGPT 1.0.0 7B beta GPTQ 4 bit

OpenThaiGPT 1.0.0 70B Demo on Colab

Demo OpenThaiGPT 1.0.0-beta on Colab