6. How to manage datasets¶
Need to figure out which dataset IDs exist and how to manage their caches? This recipe walks you through discovery, metadata inspection, and cache management with CLI and Python examples side by side. If ModSSC is not installed yet, start with Installation.
6.1 Problem statement¶
You want to list, inspect, and download datasets that ModSSC can load, and understand where they are cached. [1][2][3] Once you pick a dataset key, continue with sampling and preprocess to shape the data for a run.
6.2 When to use¶
Use these steps when you are starting a new experiment, switching modalities, or pre-downloading datasets before a large run. [4][5]
Use providers when you need to know which backends are wired up, and list when you need curated dataset keys for configs and CLI commands. Use info to check the required_extra field before downloading a dataset. [2][6]
6.3 Steps¶
1) List providers and dataset keys. [2]
2) Inspect a dataset spec (modality, provider, required extra). [6][2]
3) Download a dataset into the local cache. [1][2]
Use --all when you want an offline cache for a full modality, and --dataset when you only need a single dataset key. [2]
4) Inspect or clean the cache index. [3][2]
6.4 Copy-paste example¶
Use the CLI for quick inspection in the terminal (modssc datasets in src/modssc/cli/datasets.py), and use Python when you want dataset access inside a script (helpers in src/modssc/data_loader/api.py). [2][1]
CLI:
modssc datasets providers
modssc datasets list --modalities text
modssc datasets info --dataset ag_news
modssc datasets download --dataset ag_news
modssc datasets cache ls
Python:
from modssc.data_loader import available_datasets, dataset_info, download_dataset
print(available_datasets())
print(dataset_info("toy").as_dict())
_ = download_dataset("toy")
6.5 Pitfalls¶
Warning
If a dataset requires optional dependencies, download will fail with an actionable error message. Install the suggested extra from pyproject.toml. [7][8]
Tip
Override the dataset cache directory with MODSSC_CACHE_DIR if you want to store datasets outside the repo or default user cache. [3]