TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as tf.data.Datasets, enabling easy-to-use and high-performance input pipelines. To get started see the guide and the list of datasets.This notebook shows how to load
TensorFlow Datasets into a Document format that we can use downstream.
Installation
You need to installtensorflow and tensorflow-datasets python packages.
Example
As an example, we use themlqa/en dataset.
MLQA(Multilingual Question Answering Dataset) is a benchmark dataset for evaluating multilingual question answering performance. The dataset consists of 7 languages: Arabic, German, Spanish, English, Hindi, Vietnamese, Chinese.
- Homepage: github.com/facebookresearch/MLQA
- Source code:
tfds.datasets.mlqa.Builder- Download size: 72.21 MiB
context field as the Document.page_content and place other fields in the Document.metadata.
TensorflowDatasetLoader has these parameters:
dataset_name: the name of the dataset to loadsplit_name: the name of the split to load. Defaults to “train”.load_max_docs: a limit to the number of loaded documents. Defaults to 100.sample_to_document_function: a function that converts a dataset sample to a Document
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.