Skip to content

Latest commit

 

History

History
 
 

bart_searchgpt_wiki_nlp_augment

Dataset: Retrieval-based grounded model generated Q-A pairs #2004

Related to Issue #2004

How it work?

  1. Base data: hugging face: wikipedia
  2. Cleanse data to shorten the length of the articles
  3. Generate Q-A pairs using doc2query
  4. Generate Q-A pairs using BART or SearchGPT

Output data

Synthetic data based on BART

wiki_augment_bart

Synthetic data based on SearchGPT

wiki_augment_searchgpt

Code

  1. pip install -r requirements.txt (using python 3.10.8)
  2. Clean data: 1_clean_wikitext.py
  3. Get queries by doc2query 2_wikitext_doc2query.ipynb (I run using colab+local PC)
  4. Get responses by BART 3_10k_bart_trial.py or 3_10k_bart_trial.ipynb
  5. Convert to OA format 4_convert_to_oa_format.py