Oracle AI Vector Search 문서 처리

Oracle AI Vector Search는 키워드가 아닌 의미론을 기반으로 데이터를 쿼리할 수 있는 인공지능(AI) 워크로드를 위해 설계되었습니다. Oracle AI Vector Search의 가장 큰 장점 중 하나는 비정형 데이터에 대한 의미론적 검색을 하나의 단일 시스템에서 비즈니스 데이터에 대한 관계형 검색과 결합할 수 있다는 것입니다. 이는 강력할 뿐만 아니라 특수 벡터 데이터베이스를 추가할 필요가 없어 여러 시스템 간의 데이터 단편화 문제를 제거하므로 훨씬 더 효과적입니다. 또한, 벡터는 다음과 같은 Oracle Database의 가장 강력한 기능을 모두 활용할 수 있습니다:

이 가이드에서는 OracleDocLoader와 OracleTextSplitter를 사용하여 각각 문서를 로드하고 청크하는 Oracle AI Vector Search 내의 문서 처리 기능을 사용하는 방법을 보여줍니다. Oracle Database를 처음 시작하는 경우, 데이터베이스 환경 설정에 대한 훌륭한 소개를 제공하는 무료 Oracle 23 AI 탐색을 고려해보세요. 데이터베이스 작업 시, 기본적으로 시스템 사용자를 사용하지 않는 것이 좋으며, 대신 보안 강화 및 사용자 지정을 위해 자체 사용자를 생성할 수 있습니다. 사용자 생성에 대한 자세한 단계는 Oracle에서 사용자를 설정하는 방법도 보여주는 엔드 투 엔드 가이드를 참조하세요. 또한, 사용자 권한을 이해하는 것은 데이터베이스 보안을 효과적으로 관리하는 데 중요합니다. 사용자 계정 및 보안 관리에 대한 공식 Oracle 가이드에서 이 주제에 대해 자세히 알아볼 수 있습니다.

사전 요구 사항

Oracle AI Vector Search와 함께 LangChain을 사용하려면 Oracle Python Client 드라이버를 설치하세요.

# pip install oracledb

Oracle Database에 연결

다음 샘플 코드는 Oracle Database에 연결하는 방법을 보여줍니다. 기본적으로 python-oracledb는 Oracle Database에 직접 연결하는 ‘Thin’ 모드로 실행됩니다. 이 모드는 Oracle Client 라이브러리가 필요하지 않습니다. 그러나 python-oracledb가 이를 사용할 때 일부 추가 기능을 사용할 수 있습니다. Oracle Client 라이브러리를 사용하는 경우 Python-oracledb는 ‘Thick’ 모드에 있다고 합니다. 두 모드 모두 Python Database API v2.0 사양을 지원하는 포괄적인 기능을 갖추고 있습니다. 각 모드에서 지원되는 기능에 대해 설명하는 다음 가이드를 참조하세요. thin 모드를 사용할 수 없는 경우 thick 모드로 전환할 수 있습니다.

import sys

import oracledb

# please update with your username, password, hostname and service_name
username = "<username>"
password = "<password>"
dsn = "<hostname>/<service_name>"

try:
    conn = oracledb.connect(user=username, password=password, dsn=dsn)
    print("Connection successful!")
except Exception as e:
    print("Connection failed!")
    sys.exit(1)

이제 테이블을 생성하고 테스트할 샘플 문서를 삽입해보겠습니다.

try:
    cursor = conn.cursor()

    drop_table_sql = """drop table if exists demo_tab"""
    cursor.execute(drop_table_sql)

    create_table_sql = """create table demo_tab (id number, data clob)"""
    cursor.execute(create_table_sql)

    insert_row_sql = """insert into demo_tab values (:1, :2)"""
    rows_to_insert = [
        (
            1,
            "If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.",
        ),
        (
            2,
            "A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.",
        ),
        (
            3,
            "The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.",
        ),
    ]
    cursor.executemany(insert_row_sql, rows_to_insert)

    conn.commit()

    print("Table created and populated.")
    cursor.close()
except Exception as e:
    print("Table creation failed.")
    cursor.close()
    conn.close()
    sys.exit(1)

문서 로드

사용자는 로더 매개변수를 적절하게 구성하여 Oracle Database, 파일 시스템 또는 둘 다에서 문서를 유연하게 로드할 수 있습니다. 이러한 매개변수에 대한 포괄적인 세부 정보는 Oracle AI Vector Search 가이드를 참조하세요. OracleDocLoader를 활용하는 주요 장점은 150가지 이상의 다양한 파일 형식을 처리할 수 있어 다양한 문서 유형에 대해 여러 로더가 필요하지 않다는 것입니다. 지원되는 형식의 전체 목록은 Oracle Text 지원 문서 형식을 참조하세요. 다음은 OracleDocLoader를 사용하는 방법을 보여주는 샘플 코드 스니펫입니다.

from langchain_community.document_loaders.oracleai import OracleDocLoader
from langchain_core.documents import Document

"""
# loading a local file
loader_params = {}
loader_params["file"] = "<file>"

# loading from a local directory
loader_params = {}
loader_params["dir"] = "<directory>"
"""

# loading from Oracle Database table
loader_params = {
    "owner": "<owner>",
    "tablename": "demo_tab",
    "colname": "data",
}

""" load the docs """
loader = OracleDocLoader(conn=conn, params=loader_params)
docs = loader.load()

""" verify """
print(f"Number of docs loaded: {len(docs)}")
# print(f"Document-0: {docs[0].page_content}") # content

문서 분할

문서의 크기는 작은 것부터 매우 큰 것까지 다양할 수 있습니다. 사용자는 임베딩 생성을 용이하게 하기 위해 문서를 더 작은 섹션으로 청크하는 것을 선호하는 경우가 많습니다. 이 분할 프로세스에 사용할 수 있는 다양한 사용자 지정 옵션이 있습니다. 이러한 매개변수에 대한 포괄적인 세부 정보는 Oracle AI Vector Search 가이드를 참조하세요. 다음은 이를 구현하는 방법을 보여주는 샘플 코드입니다:

from langchain_community.document_loaders.oracleai import OracleTextSplitter
from langchain_core.documents import Document

"""
# Some examples
# split by chars, max 500 chars
splitter_params = {"split": "chars", "max": 500, "normalize": "all"}

# split by words, max 100 words
splitter_params = {"split": "words", "max": 100, "normalize": "all"}

# split by sentence, max 20 sentences
splitter_params = {"split": "sentence", "max": 20, "normalize": "all"}
"""

# split by default parameters
splitter_params = {"normalize": "all"}

# get the splitter instance
splitter = OracleTextSplitter(conn=conn, params=splitter_params)

list_chunks = []
for doc in docs:
    chunks = splitter.split_text(doc.page_content)
    list_chunks.extend(chunks)

""" verify """
print(f"Number of Chunks: {len(list_chunks)}")
# print(f"Chunk-0: {list_chunks[0]}") # content

엔드 투 엔드 데모

Oracle AI Vector Search의 도움으로 엔드 투 엔드 RAG 파이프라인을 구축하는 전체 데모 가이드 Oracle AI Vector Search 엔드 투 엔드 데모 가이드를 참조하세요.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

Oracle AI Vector Search 문서 처리

사전 요구 사항

Oracle Database에 연결

문서 로드

문서 분할

엔드 투 엔드 데모

Popular Providers

Integrations by component

​사전 요구 사항

​Oracle Database에 연결

​문서 로드

​문서 분할

​엔드 투 엔드 데모

사전 요구 사항

Oracle Database에 연결

문서 로드

문서 분할

엔드 투 엔드 데모