Related to Issue #2004
- Base data: hugging face: wikipedia
- Cleanse data to shorten the length of the articles
- Generate Q-A pairs using doc2query
- Generate Q-A pairs using BART or SearchGPT
- raw data (BART-based): https://huggingface.co/datasets/michaelthwan/wiki_qa_bart_10000row
- OA format data (BART-based): https://huggingface.co/datasets/michaelthwan/oa_wiki_qa_bart_10000row
pip install -r requirements.txt
(using python 3.10.8)- Clean data:
1_clean_wikitext.py
- Get queries by doc2query
2_wikitext_doc2query.ipynb
(I run using colab+local PC) - Get responses by BART
3_10k_bart_trial.py
or3_10k_bart_trial.ipynb
- Convert to OA format
4_convert_to_oa_format.py