This program converts data obtained from the subreddit r/changemyview into a cleaner format for further data processing. The data is not clean enough to be used directly in a model yet, and additional preprocessing is required.
The cleaned data is stored in an Apache Parquet file with the following columns:
Column Name | Description | Data Type |
---|---|---|
INSTRUCTION | Post title + body text | String |
RESPONSE | Body text of comments attempting to change OP's mind of INSTRUCTION . |
List<String> |
SOURCE | Permalink to the reddit post | String |
METADATA | Metadata related to RESPONSE . |
Dict<Variant> |
Currently, metadata is only broken into one category:
detoxify_labels
- A Dictionary of values outputted by the Unitaryai Detoxifier model, fitted to every comment under any given post.
To use the program, follow these instructions:
- Clone the repository -
git clone https://github.com/LAION-AI/Open-Assistant.git
- Navigate to the project directory -
cd notebooks/data-augmentation/changemyview-builder
- Open the Jupyter Notebook -
jupyter notebook data_processor.ipynb
- Run the program - Go through the notebook and run the cells
If you would like to contribute to this project, please fork the repository and submit a pull request with your changes.
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.