This folder contains datasets loading scripts that are used to train OpenAssistant. The current list of datasets can be found here.
To add a new dataset to OpenAssistant, follow these steps:
-
Create an issue: Create a new issue and describe your proposal for the new dataset.
-
Create a dataset on HuggingFace: Create a dataset on HuggingFace. See below for more details.
-
Make a pull request: Add a new dataset loading script to this folder and link the issue in the pull request description. For more information, see below.
To create a new dataset on HuggingFace, follow these steps:
1. Convert your dataset file(s) to the Parquet format using the pandas library:
import pandas as pd
# Create a pandas dataframe from your dataset file(s)
df = pd.read_json(...) # or any other way
# Save the file in the Parquet format
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow")
pip install huggingface-cli
Use your access token to login:
- Via terminal
huggingface-cli login
- in Jupyter notebook
from huggingface_hub import notebook_login
notebook_login()
from datasets import Dataset
ds = Dataset.from_parquet("dataset.parquet")
ds.push_to_hub("your_huggingface_name/dataset_name")
Update the README.md
file of your dataset by visiting this link:
https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md
(paste your HuggingFace name and dataset)
-
Create a folder with the name of your dataset.
-
Add a loading script that loads your dataset from HuggingFace, for example:
from datasets import load_dataset if __name__ == "__main__": ds = load_dataset("your_huggingface_name/dataset_name") print(ds["train"])
-
Optionally, add any other files that describe your dataset and its creation, such as a README, notebooks, scrapers, etc.
pre-commit run
- Submit a pull request and include a link to the issue it resolves in the
description, for example:
Resolves #123