Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
dialogue_collator.py		dialogue_collator.py
extra_rm_datasets.py		extra_rm_datasets.py
formatting.py		formatting.py
instruction.py		instruction.py
oasst_dataset.py		oasst_dataset.py
pretrain_datasets.py		pretrain_datasets.py
prompt_dialogue.py		prompt_dialogue.py
qa_datasets.py		qa_datasets.py
rank_datasets.py		rank_datasets.py
ranking_collator.py		ranking_collator.py
summarization.py		summarization.py
toxic_conversation.py		toxic_conversation.py
translation.py		translation.py
utils.py		utils.py

README.md

Dataset collections overview:

currently dataset can be divided into 3 classes

language knowledge
- summarization
- translation
dialogue : don't let user know you are a robot
STEM : knowledge about the world
- code
- world knowledge <= ideally we want to handle this via prefix context
qa

Issues and TODO:

as dataset are growing, how can we update this section less
ideally we can update the config yaml and new dataset will be download from hub
- one possible idea is we upload the transform format of these dataset to the OA hub