HTML 분할

HTML 문서를 관리 가능한 청크로 분할하는 것은 자연어 처리, 검색 인덱싱 등 다양한 텍스트 처리 작업에서 필수적입니다. 이 가이드에서는 HTML 콘텐츠를 효과적으로 분할하는 데 사용할 수 있는 LangChain의 세 가지 텍스트 분할기를 살펴봅니다:

HTMLHeaderTextSplitter
HTMLSectionSplitter
HTMLSemanticPreservingSplitter

각 분할기는 고유한 기능과 사용 사례를 가지고 있습니다. 이 가이드는 각 분할기 간의 차이점, 특정 분할기를 선택하는 이유, 그리고 효과적으로 사용하는 방법을 이해하는 데 도움을 드립니다.

pip install -qU langchain-text-splitters

분할기 개요

HTMLHeaderTextSplitter

제목을 기준으로 문서의 계층 구조를 보존하고자 할 때 유용합니다.

설명: 헤더 태그(예: <h1>, <h2>, <h3> 등)를 기준으로 HTML 텍스트를 분할하고, 각 청크와 관련된 각 헤더에 대한 메타데이터를 추가합니다. 기능:

HTML 요소 수준에서 텍스트를 분할합니다.
문서 구조에 인코딩된 컨텍스트가 풍부한 정보를 보존합니다.
요소별로 청크를 반환하거나 동일한 메타데이터를 가진 요소를 결합할 수 있습니다.

HTMLSectionSplitter

<section>, <div> 또는 사용자 정의 섹션과 같이 HTML 문서를 더 큰 섹션으로 분할하고자 할 때 유용합니다.

설명: HTMLHeaderTextSplitter와 유사하지만 지정된 태그를 기준으로 HTML을 섹션으로 분할하는 데 중점을 둡니다. 기능:

XSLT 변환을 사용하여 섹션을 감지하고 분할합니다.
큰 섹션의 경우 내부적으로 RecursiveCharacterTextSplitter를 사용합니다.
글꼴 크기를 고려하여 섹션을 결정합니다.

HTMLSemanticPreservingSplitter

구조화된 요소가 청크 간에 분할되지 않도록 하여 맥락적 관련성을 보존해야 할 때 이상적입니다.

설명: HTML 콘텐츠를 관리 가능한 청크로 분할하면서 테이블, 목록 및 기타 HTML 구성 요소와 같은 중요한 요소의 의미적 구조를 보존합니다. 기능:

테이블, 목록 및 기타 지정된 HTML 요소를 보존합니다.
특정 HTML 태그에 대한 사용자 정의 핸들러를 허용합니다.
문서의 의미적 의미가 유지되도록 보장합니다.
정규화 및 불용어 제거 기능이 내장되어 있습니다.

적합한 분할기 선택

HTMLHeaderTextSplitter를 사용해야 하는 경우: HTML 문서를 헤더 계층 구조에 따라 분할하고 헤더에 대한 메타데이터를 유지해야 할 때
HTMLSectionSplitter를 사용해야 하는 경우: 문서를 더 크고 일반적인 섹션으로 분할해야 하며, 사용자 정의 태그나 글꼴 크기를 기준으로 할 수 있을 때
HTMLSemanticPreservingSplitter를 사용해야 하는 경우: 테이블과 목록 같은 의미적 요소를 보존하면서 문서를 청크로 분할해야 할 때, 이러한 요소가 분할되지 않고 맥락이 유지되도록 보장할 때

기능	HTMLHeaderTextSplitter	HTMLSectionSplitter	HTMLSemanticPreservingSplitter
헤더 기준 분할	예	예	예
의미적 요소(테이블, 목록) 보존	아니오	아니오	예
헤더에 대한 메타데이터 추가	예	예	예
HTML 태그에 대한 사용자 정의 핸들러	아니오	아니오	예
미디어(이미지, 비디오) 보존	아니오	아니오	예
글꼴 크기 고려	아니오	예	아니오
XSLT 변환 사용	아니오	예	아니오

예제 HTML 문서

다음 HTML 문서를 예제로 사용하겠습니다:

html_string = """
<!DOCTYPE html>
  <html lang='en'>
  <head>
    <meta charset='UTF-8'>
    <meta name='viewport' content='width=device-width, initial-scale=1.0'>
    <title>Fancy Example HTML Page</title>
  </head>
  <body>
    <h1>Main Title</h1>
    <p>This is an introductory paragraph with some basic content.</p>

    <h2>Section 1: Introduction</h2>
    <p>This section introduces the topic. Below is a list:</p>
    <ul>
      <li>First item</li>
      <li>Second item</li>
      <li>Third item with <strong>bold text</strong> and <a href='#'>a link</a></li>
    </ul>

    <h3>Subsection 1.1: Details</h3>
    <p>This subsection provides additional details. Here's a table:</p>
    <table border='1'>
      <thead>
        <tr>
          <th>Header 1</th>
          <th>Header 2</th>
          <th>Header 3</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>Row 1, Cell 1</td>
          <td>Row 1, Cell 2</td>
          <td>Row 1, Cell 3</td>
        </tr>
        <tr>
          <td>Row 2, Cell 1</td>
          <td>Row 2, Cell 2</td>
          <td>Row 2, Cell 3</td>
        </tr>
      </tbody>
    </table>

    <h2>Section 2: Media Content</h2>
    <p>This section contains an image and a video:</p>
      <img src='example_image_link.mp4' alt='Example Image'>
      <video controls width='250' src='example_video_link.mp4' type='video/mp4'>
      Your browser does not support the video tag.
    </video>

    <h2>Section 3: Code Example</h2>
    <p>This section contains a code block:</p>
    <pre><code data-lang="html">
    &lt;div&gt;
      &lt;p&gt;This is a paragraph inside a div.&lt;/p&gt;
    &lt;/div&gt;
    </code></pre>

    <h2>Conclusion</h2>
    <p>This is the conclusion of the document.</p>
  </body>
  </html>
"""

HTMLHeaderTextSplitter 사용

HTMLHeaderTextSplitter는 “구조 인식” 텍스트 분할기로, HTML 요소 수준에서 텍스트를 분할하고 각 청크와 “관련된” 각 헤더에 대한 메타데이터를 추가합니다. 요소별로 청크를 반환하거나 동일한 메타데이터를 가진 요소를 결합할 수 있으며, (a) 관련 텍스트를 (어느 정도) 의미론적으로 그룹화하고 (b) 문서 구조에 인코딩된 컨텍스트가 풍부한 정보를 보존한다는 목표를 가지고 있습니다. 청크 파이프라인의 일부로 다른 텍스트 분할기와 함께 사용할 수 있습니다. 마크다운 파일의 MarkdownHeaderTextSplitter와 유사합니다. 분할할 헤더를 지정하려면, 아래와 같이 HTMLHeaderTextSplitter를 인스턴스화할 때 headers_to_split_on을 지정합니다.

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a list:  \nFirst item Second item Third item with bold text and a link'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction', 'Header 3': 'Subsection 1.1: Details'}, page_content="This subsection provides additional details. Here's a table:"),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block:'),
 Document(metadata={'Header 1': 'Main Title', 'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

각 요소를 연관된 헤더와 함께 반환하려면, HTMLHeaderTextSplitter를 인스턴스화할 때 return_each_element=True를 지정합니다:

html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on,
    return_each_element=True,
)
html_header_splits_elements = html_splitter.split_text(html_string)

위의 경우와 비교하면, 요소가 헤더별로 집계되는 것을 볼 수 있습니다:

for element in html_header_splits[:2]:
    print(element)

page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}
page_content='This section introduces the topic. Below is a list:
First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}

이제 각 요소가 개별 Document로 반환됩니다:

for element in html_header_splits_elements[:3]:
    print(element)

page_content='This is an introductory paragraph with some basic content.' metadata={'Header 1': 'Main Title'}
page_content='This section introduces the topic. Below is a list:' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}
page_content='First item Second item Third item with bold text and a link' metadata={'Header 1': 'Main Title', 'Header 2': 'Section 1: Introduction'}

URL 또는 HTML 파일에서 분할하는 방법:

URL에서 직접 읽으려면 URL 문자열을 split_text_from_url 메서드에 전달합니다. 마찬가지로, 로컬 HTML 파일은 split_text_from_file 메서드에 전달할 수 있습니다.

url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

# 로컬 파일의 경우 html_splitter.split_text_from_file(<path_to_file>) 사용
html_header_splits = html_splitter.split_text_from_url(url)

청크 크기를 제한하는 방법:

HTML 헤더를 기준으로 분할하는 HTMLHeaderTextSplitter는 RecursiveCharacterTextSplitter와 같이 문자 길이를 기준으로 분할을 제한하는 다른 분할기와 함께 구성할 수 있습니다. 이는 두 번째 분할기의 .split_documents 메서드를 사용하여 수행할 수 있습니다:

from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# 분할
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]

[Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel's Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox ("This sentence is false") and Berry's paradox ("The least number not defined by an expression consisting of just fourteen English words"). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth'),
 Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel's Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.'),
 Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel's Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='This account of Gödel's discovery was told to Hao Wang very much after the fact; but in Gödel's contemporary correspondence with Bernays and Zermelo, essentially the same description of his path to the theorems is given. (See Gödel 2003a and Gödel 2003b respectively.) From those accounts we see that the undefinability of truth in arithmetic, a result credited to Tarski, was likely obtained in some form by Gödel by 1931. But he neither publicized nor published the result; the biases logicians'),
 Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel's Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}, page_content='result; the biases logicians had expressed at the time concerning the notion of truth, biases which came vehemently to the fore when Tarski announced his results on the undefinability of truth in formal systems 1935, may have served as a deterrent to Gödel's publication of that theorem.'),
 Document(metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel's Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.2 The proof of the First Incompleteness Theorem'}, page_content='We now describe the proof of the two theorems, formulating Gödel's results in Peano arithmetic. Gödel himself used a system related to that defined in Principia Mathematica, but containing Peano arithmetic. In our presentation of the First and Second Incompleteness Theorems we refer to Peano arithmetic as P, following Gödel's notation.')]

제한 사항

HTML 문서마다 구조적 변화가 상당히 클 수 있으며, HTMLHeaderTextSplitter는 주어진 청크에 모든 “관련” 헤더를 첨부하려고 시도하지만 때때로 특정 헤더를 놓칠 수 있습니다. 예를 들어, 이 알고리즘은 헤더가 항상 관련 텍스트 “위”의 노드, 즉 이전 형제, 조상 및 이들의 조합에 있는 정보 계층 구조를 가정합니다. 다음 뉴스 기사(이 문서 작성 당시)에서 문서는 “h1”로 태그된 최상위 헤드라인의 텍스트가 “위”에 있을 것으로 예상되는 텍스트 요소와는 별도의 하위 트리에 있도록 구조화되어 있습니다. 따라서 “h1” 요소와 관련 텍스트가 청크 메타데이터에 표시되지 않는 것을 볼 수 있습니다(그러나 해당되는 경우 “h2”와 관련 텍스트는 볼 수 있습니다):

url = "https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
print(html_header_splits[1].page_content[:500])

No two El Niño winters are the same, but many have temperature and precipitation trends in common.
Average conditions during an El Niño winter across the continental US.
One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA.
Because the jet stream is essentially a river of air that storms flow through, they c

HTMLSectionSplitter 사용

HTMLHeaderTextSplitter와 개념적으로 유사한 HTMLSectionSplitter는 요소 수준에서 텍스트를 분할하고 각 청크와 “관련된” 각 헤더에 대한 메타데이터를 추가하는 “구조 인식” 텍스트 분할기입니다. HTML을 섹션별로 분할할 수 있습니다. 요소별로 청크를 반환하거나 동일한 메타데이터를 가진 요소를 결합할 수 있으며, (a) 관련 텍스트를 (어느 정도) 의미론적으로 그룹화하고 (b) 문서 구조에 인코딩된 컨텍스트가 풍부한 정보를 보존한다는 목표를 가지고 있습니다. xslt_path를 사용하여 HTML을 변환하는 절대 경로를 제공하면 제공된 태그를 기반으로 섹션을 감지할 수 있습니다. 기본값은 data_connection/document_transformers 디렉토리의 converting_to_header.xslt 파일을 사용하는 것입니다. 이는 HTML을 섹션을 더 쉽게 감지할 수 있는 형식/레이아웃으로 변환하기 위한 것입니다. 예를 들어, 글꼴 크기를 기반으로 한 span을 헤더 태그로 변환하여 섹션으로 감지할 수 있습니다.

HTML 문자열을 분할하는 방법:

from langchain_text_splitters import HTMLSectionSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content="Section 1: Introduction \n This section introduces the topic. Below is a list: \n \n First item \n Second item \n Third item with  bold text  and  a link \n \n \n Subsection 1.1: Details \n This subsection provides additional details. Here's a table: \n \n \n \n Header 1 \n Header 2 \n Header 3 \n \n \n \n \n Row 1, Cell 1 \n Row 1, Cell 2 \n Row 1, Cell 3 \n \n \n Row 2, Cell 1 \n Row 2, Cell 2 \n Row 2, Cell 3"),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content \n This section contains an image and a video: \n \n \n      Your browser does not support the video tag.'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example \n This section contains a code block: \n \n    <div>\n      <p>This is a paragraph inside a div.</p>\n    </div>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion \n This is the conclusion of the document.')]

청크 크기를 제한하는 방법:

HTMLSectionSplitter는 청크 파이프라인의 일부로 다른 텍스트 분할기와 함께 사용할 수 있습니다. 내부적으로 섹션 크기가 청크 크기보다 클 때 RecursiveCharacterTextSplitter를 사용합니다. 또한 결정된 글꼴 크기 임계값을 기반으로 텍스트의 글꼴 크기를 고려하여 섹션인지 여부를 결정합니다.

from langchain_text_splitters import RecursiveCharacterTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLSectionSplitter(headers_to_split_on)

html_header_splits = html_splitter.split_text(html_string)

chunk_size = 50
chunk_overlap = 5
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# 분할
splits = text_splitter.split_documents(html_header_splits)
splits

[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title'),
 Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some'),
 Document(metadata={'Header 1': 'Main Title'}, page_content='some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Section 1: Introduction'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic. Below is a'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='is a list:'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='First item \n Second item'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='Third item with  bold text  and  a link'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Subsection 1.1: Details'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='This subsection provides additional details.'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content="Here's a table:"),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Header 1 \n Header 2 \n Header 3'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 1 \n Row 1, Cell 2'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 1, Cell 3 \n \n \n Row 2, Cell 1'),
 Document(metadata={'Header 3': 'Subsection 1.1: Details'}, page_content='Row 2, Cell 2 \n Row 2, Cell 3'),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Section 2: Media Content'),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video:'),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='Your browser does not support the video'),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='tag.'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='Section 3: Code Example'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: \n \n    <div>'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='<p>This is a paragraph inside a div.</p>'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='</div>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='Conclusion'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

HTMLSemanticPreservingSplitter 사용

HTMLSemanticPreservingSplitter는 테이블, 목록 및 기타 HTML 구성 요소와 같은 중요한 요소의 의미적 구조를 보존하면서 HTML 콘텐츠를 관리 가능한 청크로 분할하도록 설계되었습니다. 이는 이러한 요소가 청크 간에 분할되어 테이블 헤더, 목록 헤더 등과 같은 맥락적 관련성이 손실되지 않도록 보장합니다. 이 분할기는 본질적으로 맥락적으로 관련된 청크를 생성하도록 설계되었습니다. HTMLHeaderTextSplitter를 사용한 일반적인 재귀 분할은 테이블, 목록 및 기타 구조화된 요소가 중간에 분할되어 중요한 맥락을 잃고 나쁜 청크를 생성할 수 있습니다. HTMLSemanticPreservingSplitter는 테이블과 목록과 같은 구조화된 요소를 포함하는 HTML 콘텐츠를 분할하는 데 필수적이며, 특히 이러한 요소를 그대로 보존하는 것이 중요할 때 유용합니다. 또한 특정 HTML 태그에 대한 사용자 정의 핸들러를 정의할 수 있는 기능은 복잡한 HTML 문서를 처리하는 데 다재다능한 도구입니다. 중요: max_chunk_size는 청크의 명확한 최대 크기가 아닙니다. 최대 크기 계산은 보존된 콘텐츠가 청크의 일부가 아닐 때 발생하여 분할되지 않도록 보장합니다. 보존된 데이터를 청크에 다시 추가할 때 청크 크기가 max_chunk_size를 초과할 가능성이 있습니다. 이는 원본 문서의 구조를 유지하기 위해 중요합니다.

참고:

코드 블록의 내용을 재구성하기 위해 사용자 정의 핸들러를 정의했습니다.
전처리 시 특정 HTML 요소를 분해하고 해당 내용을 제거하기 위해 거부 목록을 정의했습니다.
요소의 비분할을 시연하기 위해 의도적으로 작은 청크 크기를 설정했습니다.

# 사용자 정의 핸들러를 사용하려면 BeautifulSoup이 필요합니다
from bs4 import Tag
from langchain_text_splitters import HTMLSemanticPreservingSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]


def code_handler(element: Tag) -> str:
    data_lang = element.get("data-lang")
    code_format = f"<code:{data_lang}>{element.get_text()}</code>"

    return code_format


splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    separators=["\n\n", "\n", ". ", "! ", "? "],
    max_chunk_size=50,
    preserve_images=True,
    preserve_videos=True,
    elements_to_preserve=["table", "ul", "ol", "code"],
    denylist_tags=["script", "style", "head"],
    custom_handlers={"code": code_handler},
)

documents = splitter.split_text(html_string)
documents

[Document(metadata={'Header 1': 'Main Title'}, page_content='This is an introductory paragraph with some basic content.'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='This section introduces the topic'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content='. Below is a list: First item Second item Third item with bold text and a link Subsection 1.1: Details This subsection provides additional details'),
 Document(metadata={'Header 2': 'Section 1: Introduction'}, page_content=". Here's a table: Header 1 Header 2 Header 3 Row 1, Cell 1 Row 1, Cell 2 Row 1, Cell 3 Row 2, Cell 1 Row 2, Cell 2 Row 2, Cell 3"),
 Document(metadata={'Header 2': 'Section 2: Media Content'}, page_content='This section contains an image and a video: ![image:example_image_link.mp4](example_image_link.mp4) ![video:example_video_link.mp4](example_video_link.mp4)'),
 Document(metadata={'Header 2': 'Section 3: Code Example'}, page_content='This section contains a code block: <code:html> <div> <p>This is a paragraph inside a div.</p> </div> </code>'),
 Document(metadata={'Header 2': 'Conclusion'}, page_content='This is the conclusion of the document.')]

테이블과 목록 보존

이 예제에서는 HTMLSemanticPreservingSplitter가 HTML 문서 내에서 테이블과 큰 목록을 어떻게 보존할 수 있는지 보여줍니다. 청크 크기는 50자로 설정되어 분할기가 정의된 최대 청크 크기를 초과하더라도 이러한 요소가 분할되지 않도록 보장하는 방법을 설명합니다.

from langchain_text_splitters import HTMLSemanticPreservingSplitter

html_string = """
<!DOCTYPE html>
<html>
    <body>
        <div>
            <h1>Section 1</h1>
            <p>This section contains an important table and list that should not be split across chunks.</p>
            <table>
                <tr>
                    <th>Item</th>
                    <th>Quantity</th>
                    <th>Price</th>
                </tr>
                <tr>
                    <td>Apples</td>
                    <td>10</td>
                    <td>$1.00</td>
                </tr>
                <tr>
                    <td>Oranges</td>
                    <td>5</td>
                    <td>$0.50</td>
                </tr>
                <tr>
                    <td>Bananas</td>
                    <td>50</td>
                    <td>$1.50</td>
                </tr>
            </table>
            <h2>Subsection 1.1</h2>
            <p>Additional text in subsection 1.1 that is separated from the table and list.</p>
            <p>Here is a detailed list:</p>
            <ul>
                <li>Item 1: Description of item 1, which is quite detailed and important.</li>
                <li>Item 2: Description of item 2, which also contains significant information.</li>
                <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
            </ul>
        </div>
    </body>
</html>
"""

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    max_chunk_size=50,
    elements_to_preserve=["table", "ul"],
)

documents = splitter.split_text(html_string)
print(documents)

[Document(metadata={'Header 1': 'Section 1'}, page_content='This section contains an important table and list'), Document(metadata={'Header 1': 'Section 1'}, page_content='that should not be split across chunks.'), Document(metadata={'Header 1': 'Section 1'}, page_content='Item Quantity Price Apples 10 $1.00 Oranges 5 $0.50 Bananas 50 $1.50'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='Additional text in subsection 1.1 that is'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='separated from the table and list. Here is a'), Document(metadata={'Header 2': 'Subsection 1.1'}, page_content="detailed list: Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]

설명

이 예제에서 HTMLSemanticPreservingSplitter는 전체 테이블과 순서 없는 목록(<ul>)이 각각의 청크 내에 보존되도록 보장합니다. 청크 크기가 50자로 설정되어 있더라도 분할기는 이러한 요소가 분할되어서는 안 된다는 것을 인식하고 그대로 유지합니다. 이는 데이터 테이블이나 목록을 다룰 때 특히 중요합니다. 콘텐츠를 분할하면 맥락이 손실되거나 혼란을 초래할 수 있기 때문입니다. 결과 Document 객체는 이러한 요소의 전체 구조를 유지하여 정보의 맥락적 관련성이 유지되도록 보장합니다.

사용자 정의 핸들러 사용

HTMLSemanticPreservingSplitter를 사용하면 특정 HTML 요소에 대한 사용자 정의 핸들러를 정의할 수 있습니다. 일부 플랫폼에는 BeautifulSoup이 기본적으로 파싱하지 않는 사용자 정의 HTML 태그가 있으며, 이 경우 사용자 정의 핸들러를 활용하여 형식 논리를 쉽게 추가할 수 있습니다. 이는 <iframe> 태그나 특정 ‘data-’ 요소와 같이 특별한 처리가 필요한 요소에 특히 유용할 수 있습니다. 이 예제에서는 iframe 태그를 마크다운 스타일의 링크로 변환하는 사용자 정의 핸들러를 만들어 보겠습니다.

def custom_iframe_extractor(iframe_tag):
    iframe_src = iframe_tag.get("src", "")
    return f"[iframe:{iframe_src}]({iframe_src})"


splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    max_chunk_size=50,
    separators=["\n\n", "\n", ". "],
    elements_to_preserve=["table", "ul", "ol"],
    custom_handlers={"iframe": custom_iframe_extractor},
)

html_string = """
<!DOCTYPE html>
<html>
    <body>
        <div>
            <h1>Section with Iframe</h1>
            <iframe src="https://example.com/embed"></iframe>
            <p>Some text after the iframe.</p>
            <ul>
                <li>Item 1: Description of item 1, which is quite detailed and important.</li>
                <li>Item 2: Description of item 2, which also contains significant information.</li>
                <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
            </ul>
        </div>
    </body>
</html>
"""

documents = splitter.split_text(html_string)
print(documents)

[Document(metadata={'Header 1': 'Section with Iframe'}, page_content='[iframe:https://example.com/embed](https://example.com/embed) Some text after the iframe'), Document(metadata={'Header 1': 'Section with Iframe'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]

설명

이 예제에서는 iframe 태그를 마크다운 스타일의 링크로 변환하는 사용자 정의 핸들러를 정의했습니다. 분할기가 HTML 콘텐츠를 처리할 때 이 사용자 정의 핸들러를 사용하여 iframe 태그를 변환하면서 테이블과 목록과 같은 다른 요소를 보존합니다. 결과 Document 객체는 제공한 사용자 정의 로직에 따라 iframe이 어떻게 처리되는지 보여줍니다. 중요: 링크와 같은 항목을 보존할 때는 구분자에 .을 포함하지 않거나 구분자를 비워두도록 주의해야 합니다. RecursiveCharacterTextSplitter는 마침표로 분할하므로 링크가 반으로 잘립니다. . 을 포함하는 구분자 목록을 제공해야 합니다.

사용자 정의 핸들러를 사용하여 LLM으로 이미지 분석

사용자 정의 핸들러를 사용하면 모든 요소에 대한 기본 처리를 재정의할 수도 있습니다. 이에 대한 좋은 예는 문서 내의 이미지에 대한 의미 분석을 청크 플로우에 직접 삽입하는 것입니다. 함수는 태그가 발견될 때 호출되므로 <img> 태그를 재정의하고 preserve_images를 끄면 청크에 포함하려는 모든 콘텐츠를 삽입할 수 있습니다.

"""이 예제는 헬퍼 메서드 `load_image_from_url`과 이미지 데이터를 처리할 수 있는 LLM 에이전트 `llm`이 있다고 가정합니다."""

from langchain.agents import AgentExecutor

# 이 예제는 자체 에이전트로 교체해야 합니다
llm = AgentExecutor(...)


# 이 메서드는 URL에서 이미지 데이터를 로드하기 위한 자리 표시자이며 여기에서 구현되지 않았습니다
def load_image_from_url(image_url: str) -> bytes:
    # 이 메서드가 URL에서 이미지 데이터를 가져온다고 가정합니다
    return b"image_data"


html_string = """
<!DOCTYPE html>
<html>
    <body>
        <div>
            <h1>Section with Image and Link</h1>
            <p>
                <img src="https://example.com/image.jpg" alt="An example image" />
                Some text after the image.
            </p>
            <ul>
                <li>Item 1: Description of item 1, which is quite detailed and important.</li>
                <li>Item 2: Description of item 2, which also contains significant information.</li>
                <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
            </ul>
        </div>
    </body>
</html>
"""


def custom_image_handler(img_tag) -> str:
    img_src = img_tag.get("src", "")
    img_alt = img_tag.get("alt", "No alt text provided")

    image_data = load_image_from_url(img_src)
    semantic_meaning = llm.invoke(image_data)

    markdown_text = f"[Image Alt Text: {img_alt} | Image Source: {img_src} | Image Semantic Meaning: {semantic_meaning}]"

    return markdown_text


splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    max_chunk_size=50,
    separators=["\n\n", "\n", ". "],
    elements_to_preserve=["ul"],
    preserve_images=False,
    custom_handlers={"img": custom_image_handler},
)

documents = splitter.split_text(html_string)

print(documents)

[Document(metadata={'Header 1': 'Section with Image and Link'}, page_content='[Image Alt Text: An example image | Image Source: https://example.com/image.jpg | Image Semantic Meaning: semantic-meaning] Some text after the image'),
Document(metadata={'Header 1': 'Section with Image and Link'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")]

설명:

HTML의 <img> 요소에서 특정 필드를 추출하도록 작성된 사용자 정의 핸들러를 사용하면 에이전트를 통해 데이터를 추가로 처리하고 결과를 청크에 직접 삽입할 수 있습니다. 그렇지 않으면 <img> 필드의 기본 처리가 수행되므로 preserve_images가 False로 설정되어 있는지 확인하는 것이 중요합니다.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

분할기 개요

HTMLHeaderTextSplitter

HTMLSectionSplitter

HTMLSemanticPreservingSplitter

적합한 분할기 선택

예제 HTML 문서

HTMLHeaderTextSplitter 사용

URL 또는 HTML 파일에서 분할하는 방법:

청크 크기를 제한하는 방법:

제한 사항

HTMLSectionSplitter 사용

HTML 문자열을 분할하는 방법:

청크 크기를 제한하는 방법:

HTMLSemanticPreservingSplitter 사용

테이블과 목록 보존

설명

사용자 정의 핸들러 사용

설명

사용자 정의 핸들러를 사용하여 LLM으로 이미지 분석

설명:

Popular Providers

Integrations by component

​분할기 개요

​HTMLHeaderTextSplitter

​HTMLSectionSplitter

​HTMLSemanticPreservingSplitter

​적합한 분할기 선택

​예제 HTML 문서

​HTMLHeaderTextSplitter 사용

​URL 또는 HTML 파일에서 분할하는 방법:

​청크 크기를 제한하는 방법:

​제한 사항

​HTMLSectionSplitter 사용

​HTML 문자열을 분할하는 방법:

​청크 크기를 제한하는 방법:

​HTMLSemanticPreservingSplitter 사용

​테이블과 목록 보존

​설명

​사용자 정의 핸들러 사용

​설명

​사용자 정의 핸들러를 사용하여 LLM으로 이미지 분석

​설명:

분할기 개요

HTMLHeaderTextSplitter

HTMLSectionSplitter

HTMLSemanticPreservingSplitter

적합한 분할기 선택

예제 HTML 문서

HTMLHeaderTextSplitter 사용

URL 또는 HTML 파일에서 분할하는 방법:

청크 크기를 제한하는 방법:

제한 사항

HTMLSectionSplitter 사용

HTML 문자열을 분할하는 방법:

청크 크기를 제한하는 방법:

HTMLSemanticPreservingSplitter 사용

테이블과 목록 보존

설명

사용자 정의 핸들러 사용

설명

사용자 정의 핸들러를 사용하여 LLM으로 이미지 분석

설명: