Overview
Integration details
This example goes over how to load data from webpages using Cheerio. One document will be created for each webpage. Cheerio is a fast and lightweight library that allows you to parse and traverse HTML documents using a jQuery-like syntax. You can use Cheerio to extract data from web pages, without having to render them in a browser. However, Cheerio does not simulate a web browser, so it cannot execute JavaScript code on the page. This means that it cannot extract data from dynamic web pages that require JavaScript to render. To do that, you can use thePlaywrightWebBaseLoader or PuppeteerWebBaseLoader instead.
| Class | Package | Local | Serializable | PY support |
|---|---|---|---|---|
| CheerioWebBaseLoader | @langchain/community | ✅ | ✅ | ❌ |
Loader features
| Source | Web Support | Node Support |
|---|---|---|
| CheerioWebBaseLoader | ✅ | ✅ |
Setup
To accessCheerioWebBaseLoader document loader you’ll need to install the @langchain/community integration package, along with the cheerio peer dependency.
Credentials
If you want to get automated tracing of your model calls you can also set your LangSmith API key by uncommenting below:Installation
The LangChain CheerioWebBaseLoader integration lives in the@langchain/community package:
Instantiation
Now we can instantiate our model object and load documents:Load
Additional configurations
CheerioWebBaseLoader supports additional configuration when instantiating the loader. Here is an example of how to use it with the selector field passed, making it only load content from the provided HTML class names:
API reference
For detailed documentation of all CheerioWebBaseLoader features and configurations head to the API reference.Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.