Dedoc in combination with LangChain as a DocumentLoader.
Overview
Dedoc is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats.Dedoc supports DOCX, XLSX, PPTX, EML, HTML, PDF, images and more.
Full list of supported formats can be found here.
Integration details
| Class | Package | Local | Serializable | JS support |
|---|---|---|---|---|
| DedocFileLoader | langchain_community | ❌ | beta | ❌ |
| DedocPDFLoader | langchain_community | ❌ | beta | ❌ |
| DedocAPIFileLoader | langchain_community | ❌ | beta | ❌ |
Loader features
Methods for lazy loading and async loading are available, but in fact, document loading is executed synchronously.| Source | Document Lazy Loading | Async Support |
|---|---|---|
| DedocFileLoader | ❌ | ❌ |
| DedocPDFLoader | ❌ | ❌ |
| DedocAPIFileLoader | ❌ | ❌ |
Setup
- To access
DedocFileLoaderandDedocPDFLoaderdocument loaders, you’ll need to install thededocintegration package. - To access
DedocAPIFileLoader, you’ll need to run theDedocservice, e.g.Dockercontainer (please see the documentation for more details):
Dedoc installation instruction is given here.
Instantiation
Load
Lazy Load
API reference
For detailed information on configuring and callingDedoc loaders, please see the API references:
- python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.dedoc.DedocFileLoader.html
- python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.DedocPDFLoader.html
- python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.dedoc.DedocAPIFileLoader.html
Loading any file
For automatic handling of any file in a supported format,DedocFileLoader can be useful.
The file loader automatically detects the file type with a correct extension.
File parsing process can be configured through dedoc_kwargs during the DedocFileLoader class initialization.
Here the basic examples of some options usage are given,
please see the documentation of DedocFileLoader and
dedoc documentation
to get more details about configuration parameters.
Basic example
Modes of split
DedocFileLoader supports different types of document splitting into parts (each part is returned separately).
For this purpose, split parameter is used with the following options:
document(default value): document text is returned as a single langchainDocumentobject (don’t split);page: split document text into pages (works forPDF,DJVU,PPTX,PPT,ODP);node: split document text intoDedoctree nodes (title nodes, list item nodes, raw text nodes);line: split document text into textual lines.
Handling tables
DedocFileLoader supports tables handling when with_tables parameter is
set to True during loader initialization (with_tables=True by default).
Tables are not split - each table corresponds to one langchain Document object.
For tables, Document object has additional metadata fields type="table"
and text_as_html with table HTML representation.
Handling attached files
DedocFileLoader supports attached files handling when with_attachments is set
to True during loader initialization (with_attachments=False by default).
Attachments are split according to the split parameter.
For attachments, langchain Document object has an additional metadata
field type="attachment".
Loading PDF file
If you want to handle onlyPDF documents, you can use DedocPDFLoader with only PDF support.
The loader supports the same parameters for document split, tables and attachments extraction.
Dedoc can extract PDF with or without a textual layer,
as well as automatically detect its presence and correctness.
Several PDF handlers are available, you can use pdf_with_text_layer
parameter to choose one of them.
Please see parameters description
to get more details.
For PDF without a textual layer, Tesseract OCR and its language packages should be installed.
In this case, the instruction can be useful.
Dedoc API
If you want to get up and running with less set up, you can useDedoc as a service.
DedocAPIFileLoader can be used without installation of dedoc library.
The loader supports the same parameters as DedocFileLoader and
also automatically detects input file types.
To use DedocAPIFileLoader, you should run the Dedoc service, e.g. Docker container (please see the documentation
for more details):
https://dedoc-readme.hf.space in your code.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.