재귀적 분할

이 텍스트 분할기는 일반적인 텍스트에 권장되는 방식입니다. 문자 목록을 매개변수로 받아 청크가 충분히 작아질 때까지 순서대로 분할을 시도합니다. 기본 목록은 ["\n\n", "\n", " ", ""]입니다. 이는 문단(그 다음에는 문장, 그 다음에는 단어)을 가능한 한 함께 유지하려고 시도하는 효과를 가지며, 이러한 단위들이 일반적으로 의미상 가장 강하게 연관된 텍스트 조각으로 보이기 때문입니다.

텍스트 분할 방식: 문자 목록 기준
청크 크기 측정 방식: 문자 수 기준

아래는 사용 예시입니다.

pip install -qU langchain-text-splitters

문자열 콘텐츠를 직접 가져오려면 .split_text를 사용하세요. LangChain Document 객체를 생성하려면(예: 다운스트림 작업에 사용) .create_documents를 사용하세요.

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load example document
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and'
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'

print(text_splitter.split_text(state_of_the_union)[:2])

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
 'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']

RecursiveCharacterTextSplitter에 설정된 매개변수들을 살펴보겠습니다:

chunk_size: 청크의 최대 크기이며, 크기는 length_function으로 결정됩니다.
chunk_overlap: 청크 간 목표 중첩 범위입니다. 청크를 중첩시키면 컨텍스트가 청크 간에 분할될 때 정보 손실을 완화하는 데 도움이 됩니다.
length_function: 청크 크기를 결정하는 함수입니다.
is_separator_regex: 구분자 목록(기본값: ["\n\n", "\n", " ", ""])을 정규식으로 해석할지 여부입니다.

단어 경계가 없는 언어의 텍스트 분할

일부 문자 체계는 중국어, 일본어, 태국어처럼 단어 경계가 없습니다. 기본 구분자 목록 ["\n\n", "\n", " ", ""]로 텍스트를 분할하면 단어가 청크 사이에서 분할될 수 있습니다. 단어를 함께 유지하려면 구분자 목록을 재정의하여 추가 구두점을 포함할 수 있습니다:

ASCII 마침표 “.”, 유니코드 전각 마침표 “．”(중국어 텍스트에 사용), 표意 마침표 “。”(일본어 및 중국어에 사용) 추가
태국어, 미얀마어, 크메르어, 일본어에 사용되는 너비 없는 공백 추가
ASCII 쉼표 “,”, 유니코드 전각 쉼표 “，”, 유니코드 표의 쉼표 “、” 추가

text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
        " ",
        ".",
        ",",
        "\u200b",  # Zero-width space
        "\uff0c",  # Fullwidth comma
        "\u3001",  # Ideographic comma
        "\uff0e",  # Fullwidth full stop
        "\u3002",  # Ideographic full stop
        "",
    ],
    # Existing args
)

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

단어 경계가 없는 언어의 텍스트 분할

Popular Providers

Integrations by component

​단어 경계가 없는 언어의 텍스트 분할

단어 경계가 없는 언어의 텍스트 분할