RecursiveUrlLoader lets you recursively scrape all child links from a root URL and parse them into Documents.
Overview
Integration details
| Class | Package | Local | Serializable | JS support |
|---|---|---|---|---|
| RecursiveUrlLoader | langchain-community | ✅ | ❌ | ✅ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| RecursiveUrlLoader | ✅ | ❌ |
Setup
Credentials
No credentials are required to use theRecursiveUrlLoader.
Installation
TheRecursiveUrlLoader lives in the langchain-community package. There’s no other required packages, though you will get richer default Document metadata if you have “beautifulsoup4` installed as well.
Instantiation
Now we can instantiate our document loader object and load Documents:Load
Use.load() to synchronously load into memory all Documents, with one
Document per visited URL. Starting from the initial URL, we recurse through
all linked URLs up to the specified max_depth.
Let’s run through a basic example of how to use the RecursiveUrlLoader on the Python 3.9 Documentation.
Lazy loading
If we’re loading a large number of Documents and our downstream operations can be done over subsets of all loaded Documents, we can lazily load our Documents one at a time to minimize our memory footprint:Adding an Extractor
By default the loader sets the raw HTML from each link as the Document page content. To parse this HTML into a more human/LLM-friendly format you can pass in a customextractor method:
metadata_extractor to customize how Document metadata is extracted from the HTTP response. See the API reference for more on this.
API reference
These examples show just a few of the ways in which you can modify the defaultRecursiveUrlLoader, but there are many more modifications that can be made to best fit your use case. Using the parameters link_regex and exclude_dirs can help you filter out unwanted URLs, aload() and alazy_load() can be used for aynchronous loading, and more.
For detailed information on configuring and calling the RecursiveUrlLoader, please see the API reference: python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.