Name		Name	Last commit message	Last commit date
parent directory ..
mt_note_generation		mt_note_generation
soda_synthetic_dialogue		soda_synthetic_dialogue
README.md		README.md
__init__.py		__init__.py

README.md

Datasets

This folder contains datasets loading scripts that are used to train OpenAssistant. The current list of datasets can be found here.

Adding a New Dataset

To add a new dataset to OpenAssistant, follow these steps:

Create an issue: Create a new issue and describe your proposal for the new dataset.
Create a dataset on HuggingFace: Create a dataset on HuggingFace. See below for more details.
Make a pull request: Add a new dataset loading script to this folder and link the issue in the pull request description. For more information, see below.

Creating a Dataset on HuggingFace

To create a new dataset on HuggingFace, follow these steps:

1. Convert your dataset file(s) to the Parquet format using the pandas library:

import pandas as pd

# Create a pandas dataframe from your dataset file(s)
df = pd.read_json(...) # or any other way

# Save the file in the Parquet format
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow")

2. Install HuggingFace CLI

pip install huggingface-cli

3. Log in to HuggingFace

Use your access token to login:

Via terminal

huggingface-cli login

in Jupyter notebook

from huggingface_hub import notebook_login
notebook_login()

4. Push the Parquet file to HuggingFace using the following code:

from datasets import Dataset
ds = Dataset.from_parquet("dataset.parquet")
ds.push_to_hub("your_huggingface_name/dataset_name")

5. Update the `README.md` file

Update the README.md file of your dataset by visiting this link: https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md (paste your HuggingFace name and dataset)

Making a Pull Request

1. Fork this repository

2. Create a new branch in your fork

3. Add your dataset to the repository

Create a folder with the name of your dataset.

Add a loading script that loads your dataset from HuggingFace, for example:

from datasets import load_dataset

if __name__ == "__main__":
    ds = load_dataset("your_huggingface_name/dataset_name")
    print(ds["train"])

Optionally, add any other files that describe your dataset and its creation, such as a README, notebooks, scrapers, etc.

4. Stage your changes and run the pre-commit hook

pre-commit run

5. Submit a pull request

Submit a pull request and include a link to the issue it resolves in the description, for example: Resolves #123

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

README.md

Datasets

Adding a New Dataset

Creating a Dataset on HuggingFace

1. Convert your dataset file(s) to the Parquet format using the pandas library:

2. Install HuggingFace CLI

3. Log in to HuggingFace

4. Push the Parquet file to HuggingFace using the following code:

5. Update the `README.md` file

Making a Pull Request

1. Fork this repository

2. Create a new branch in your fork

3. Add your dataset to the repository

4. Stage your changes and run the pre-commit hook

5. Submit a pull request

Files

datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

datasets

Folders and files

parent directory

README.md

Datasets

Adding a New Dataset

Creating a Dataset on HuggingFace

1. Convert your dataset file(s) to the Parquet format using the pandas library:

2. Install HuggingFace CLI

3. Log in to HuggingFace

4. Push the Parquet file to HuggingFace using the following code:

5. Update the README.md file

Making a Pull Request

1. Fork this repository

2. Create a new branch in your fork

3. Add your dataset to the repository

4. Stage your changes and run the pre-commit hook

5. Submit a pull request

5. Update the `README.md` file