토큰 기반 분할

언어 모델에는 토큰 제한이 있습니다. 토큰 제한을 초과해서는 안 됩니다. 텍스트를 청크로 분할할 때는 토큰 수를 세는 것이 좋습니다. 토크나이저에는 여러 종류가 있습니다. 텍스트의 토큰 수를 셀 때는 언어 모델에서 사용하는 것과 동일한 토크나이저를 사용해야 합니다.

js-tiktoken

js-tiktoken은 OpenAI가 만든 BPE 토크나이저의 JavaScript 버전입니다.

@[TokenTextSplitter]를 사용하여 tiktoken으로 사용된 토큰을 추정할 수 있습니다. OpenAI 모델에는 더 정확할 것입니다.

텍스트 분할 방식: 전달된 문자 기준.
청크 크기 측정 방식: tiktoken 토크나이저 기준.

npm install @langchain/textsplitters

import { TokenTextSplitter } from "@langchain/textsplitters";
import { readFileSync } from "fs";

// Example: read a long document
const stateOfTheUnion = readFileSync("state_of_the_union.txt", "utf8");

@[TokenTextSplitter]로 분할한 후 tiktoken으로 청크를 병합하려면 @[TokenTextSplitter]를 초기화할 때 encodingName(예: cl100k_base)을 전달하세요. 이 메서드로 분할한 결과는 tiktoken 토크나이저로 측정한 청크 크기보다 클 수 있습니다.

import { TokenTextSplitter } from "@langchain/textsplitters";

// Example: use cl100k_base encoding
const splitter = new TokenTextSplitter({ encodingName: "cl100k_base", chunkSize: 10, chunkOverlap: 0 });

const texts = splitter.splitText(stateOfTheUnion);
console.log(texts[0]);

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

General integrations

RAG integrations

js-tiktoken

Popular Providers

General integrations

RAG integrations

​js-tiktoken

js-tiktoken