currently dataset can be divided into 3 classes
-
language knowledge
-
summarization
-
translation
-
-
dialogue : don't let user know you are a robot
-
STEM : knowledge about the world
-
code
-
world knowledge <= ideally we want to handle this via prefix context
-
-
qa
Issues and TODO:
-
as dataset are growing, how can we update this section less
-
ideally we can update the config yaml and new dataset will be download from hub
- one possible idea is we upload the transform format of these dataset to the OA hub