Unstructured document loader to load files of many types. Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more.
Please see this guide for more instructions on setting up Unstructured locally, including setting up required system dependencies.
Overview
Integration details
| Class | Package | Local | Serializable | JS support |
|---|---|---|---|---|
| UnstructuredLoader | langchain-unstructured | ✅ | ❌ | ✅ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| UnstructuredLoader | ✅ | ❌ |
Setup
Credentials
By default,langchain-unstructured installs a smaller footprint that requires offloading of the partitioning logic to the Unstructured API, which requires an API key. If you use the local installation, you do not need an API key. To get your API key, head over to this site and get an API key, and then set it in the cell below:
Installation
Normal Installation
The following packages are required to run the rest of this notebook.Installation for Local
If you would like to run the partitioning logic locally, you will need to install a combination of system dependencies, as outlined in the Unstructured documentation here. For example, on Macs you can install the required dependencies with:pip dependencies needed for local with:
Initialization
TheUnstructuredLoader allows loading from a variety of different file types. To read all about the unstructured package please refer to their documentation/. In this example, we show loading from both a text file and a PDF file.
Load
Lazy Load
Post Processing
If you need to post process theunstructured elements after extraction, you can pass in a list of
str -> str functions to the post_processors kwarg when you instantiate the UnstructuredLoader. This applies to other Unstructured loaders as well. Below is an example.
Unstructured API
If you want to get up and running with smaller packages and get the most up-to-date partitioning you canpip install unstructured-client and pip install langchain-unstructured. For
more information about the UnstructuredLoader, refer to the
Unstructured provider page.
The loader will process your document using the hosted Unstructured serverless API when you pass in
your api_key and set partition_via_api=True. You can generate a free
Unstructured API key here.
Check out the instructions here
if you’d like to self-host the Unstructured API or run it locally.
UnstructuredLoader.
Unstructured SDK Client
Partitioning with the Unstructured API relies on the Unstructured SDK Client. If you want to customize the client, you will have to pass anUnstructuredClient instance to the UnstructuredLoader. Below is an example showing how you can customize features of the client such as using your own requests.Session(), passing an alternative server_url, and customizing the RetryConfig object. For more information about customizing the client or what additional parameters the sdk client accepts, refer to the Unstructured Python SDK docs and the client section of the API Parameters docs. Note that all API Parameters should be passed to the UnstructuredLoader.
Warning: The example below may not use the latest version of the UnstructuredClient and there could be breaking changes in future releases. For the latest examples, refer to the Unstructured Python SDK docs.
Chunking
TheUnstructuredLoader does not support mode as parameter for grouping text like the older
loader UnstructuredFileLoader and others did. It instead supports “chunking”. Chunking in
unstructured differs from other chunking mechanisms you may be familiar with that form chunks based
on plain-text features—character sequences like “\n\n” or “\n” that might indicate a paragraph
boundary or list-item boundary. Instead, all documents are split using specific knowledge about each
document format to partition the document into semantic units (document elements) and we only need to
resort to text-splitting when a single element exceeds the desired maximum chunk size. In general,
chunking combines consecutive elements to form chunks as large as possible without exceeding the
maximum chunk size. Chunking produces a sequence of CompositeElement, Table, or TableChunk elements.
Each “chunk” is an instance of one of these three types.
See this page for more
details about chunking options, but to reproduce the same behavior as mode="single", you can set
chunking_strategy="basic", max_characters=<some-really-big-number>, and include_orig_elements=False.
Loading web pages
UnstructuredLoader accepts a web_url kwarg when run locally that populates the url parameter of the underlying Unstructured partition. This allows for the parsing of remotely hosted documents, such as HTML web pages.
Example usage:
API reference
For detailed documentation of allUnstructuredLoader features and configurations head to the API reference: python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.