Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
get_biostars_dataset.py		get_biostars_dataset.py
requirements.txt		requirements.txt

README.md

Dataset Summary

This dataset contains 4803 question/answer pairs extracted from the BioStars website. The site focuses on bioinformatics, computational genomics, and biological data analysis.

Dataset Location and Details

https://huggingface.co/datasets/cannin/biostars_qa

Source Data as a single JSON file

This dataset was generated by downloading individual posts; only limited metadata is included with the dataset. The following Zenodo dataset has the entirety of the downloaded post content as a single JSON file.

https://zenodo.org/record/7813785

Code Details

Executing the script will perform the entire process end-to-end.
get_biostars_dataset(): This function downloads the content from Biostars API; each post is downloaded as an individual JSON file
extract_accepted_data(): This function loads the individual files to Pandas then extracts out question/answer pairs. Questions were included if they were an accepted answer and the question had at least 1 vote. The content is then formatted as a Apache Parquet dataset with columns: INSTRUCTION, RESPONSE, SOURCE, METADATA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

biostars_qa

biostars_qa

README.md

Dataset Summary

Dataset Location and Details

Source Data as a single JSON file

Code Details

Files

biostars_qa

Directory actions

More options

Directory actions

More options

Latest commit

History

biostars_qa

Folders and files

parent directory

README.md

Dataset Summary

Dataset Location and Details

Source Data as a single JSON file

Code Details