Skip to content

Commit ee9271e

Browse files
authored
Add StackExchange dataset. Resolves #191 (LAION-AI#2848)
Here's an initial implementation of stackexchange data processing. The dataset can be found here: https://huggingface.co/datasets/donfu/oa-stackexchange All 180+ stackexchange sites where taken into account. There's a variety of very interesting topics included beyond coding questions (see stats in readme). The following filtering was done: - Only questions with an accepted answer where taken into account - Only one answer per question (the accepted one) was added - Only Q/A pairs, for which both the Q and A are shorter than 1'000 characters (tbd) - HTML was converted to markdown, http links where removed --------- Co-authored-by: donfu <>
1 parent 0ba118c commit ee9271e

File tree

8 files changed

+664
-0
lines changed

8 files changed

+664
-0
lines changed

data/datasets/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
"LogicInference_OA": "KK04/LogicInference_OA",
2424
"oa_dolly_15k": "OllieStanley/oa_dolly_15k",
2525
"poetry_instruction": "checkai/instruction-poems",
26+
"oa_stackexchange": "donfu/oa-stackexchange",
2627
}
2728

2829
SAFETY_DATASETS = {
+279
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
---
2+
dataset_info:
3+
features:
4+
- name: INSTRUCTION
5+
dtype: string
6+
- name: RESPONSE
7+
dtype: string
8+
- name: SOURCE
9+
dtype: string
10+
- name: METADATA
11+
struct:
12+
- name: answer_score
13+
dtype: int64
14+
- name: question_score
15+
dtype: int64
16+
- name: tags
17+
dtype: string
18+
splits:
19+
- name: train
20+
num_bytes: 6549838664
21+
num_examples: 6331083
22+
download_size: 3755782987
23+
dataset_size: 6549838664
24+
license: cc-by-sa-4.0
25+
language:
26+
- en
27+
- uk
28+
- ru
29+
- de
30+
- fr
31+
- it
32+
- es
33+
pretty_name: Open-Assistant StackExchange Instruction
34+
---
35+
36+
# Stackexchange Instructions for OpenAssistant
37+
38+
This dataset is taken from https://archive.org/details/stackexchange.
39+
40+
There's a single parquet file combining all stackexchange sites. The threads
41+
have been filtered as follows: only threads with an accepted answer, for which
42+
both the question and response is less than 1000 characters have been choosen.
43+
Other answers, or questions without accepted answers, or long entries have been
44+
droppped.
45+
46+
Each row consists of
47+
48+
- INSTRUCTION
49+
- RESPONSE
50+
- SOURCE («stackexchange-ai«)
51+
- METADATA (tags, question_score, answer_score).
52+
53+
Original extraction code by https://github.com/b-mc2
54+
55+
## How to Reproduce this Dataset
56+
57+
1. Download all XML files from the stackexchange archive into the xml/ folder
58+
59+
```
60+
./download.py
61+
```
62+
63+
2. Process the XML, filter conversations and convert to OA format into parquet/
64+
folder
65+
66+
```
67+
./process.py
68+
```
69+
70+
3. Run stats on all files in the parquet/ folder
71+
72+
```
73+
./stats.py
74+
```
75+
76+
4. Combine all parquet files into one large stackexchange.parquet file
77+
78+
```
79+
./combine.py
80+
```
81+
82+
5. Upload to huggingface hub, you'll first need use huggingface-cli login
83+
84+
```
85+
./upload.py
86+
```
87+
88+
## Statistics
89+
90+
- 3dprinting: 1,006
91+
- academia: 6,956
92+
- ai: 1,169
93+
- android: 11,591
94+
- anime: 3,688
95+
- apple: 32,603
96+
- arduino: 3,725
97+
- askubuntu: 78,472
98+
- astronomy: 2,425
99+
- aviation: 4,945
100+
- avp: 1,949
101+
- beer: 387
102+
- bicycles: 4,835
103+
- bioacoustics: 70
104+
- bioinformatics: 903
105+
- biology: 5,344
106+
- bitcoin: 7,456
107+
- blender: 25,527
108+
- boardgames: 4,538
109+
- bricks: 1,457
110+
- buddhism: 911
111+
- cardano: 670
112+
- chemistry: 7,430
113+
- chess: 2,185
114+
- chinese: 4,897
115+
- christianity: 1,248
116+
- civicrm: 3,221
117+
- codegolf: 943
118+
- codereview: 2,171
119+
- coffee: 350
120+
- cogsci: 645
121+
- computergraphics: 540
122+
- conlang: 101
123+
- cooking: 7,951
124+
- craftcms: 4,533
125+
- crafts: 438
126+
- crypto: 4,425
127+
- cs: 9,478
128+
- cseducators: 71
129+
- cstheory: 2,196
130+
- datascience: 5,045
131+
- dba: 16,850
132+
- devops: 961
133+
- diy: 14,400
134+
- drones: 190
135+
- drupal: 24,090
136+
- dsp: 4,470
137+
- earthscience: 922
138+
- ebooks: 323
139+
- economics: 2,120
140+
- electronics: 41,717
141+
- elementaryos: 1,769
142+
- ell: 30,428
143+
- emacs: 7,140
144+
- engineering: 2,314
145+
- english: 42,415
146+
- eosio: 626
147+
- es_stackoverflow: 21,475
148+
- esperanto: 617
149+
- ethereum: 9,603
150+
- expatriates: 973
151+
- expressionengine: 3,638
152+
- fitness: 1,833
153+
- freelancing: 338
154+
- french: 5,193
155+
- gamedev: 9,678
156+
- gaming: 44,899
157+
- gardening: 4,492
158+
- genealogy: 487
159+
- german: 6,715
160+
- gis: 30,249
161+
- graphicdesign: 10,563
162+
- ham: 790
163+
- hardwarerecs: 647
164+
- health: 804
165+
- hermeneutics: 782
166+
- hinduism: 1,036
167+
- history: 1,776
168+
- homebrew: 2,357
169+
- hsm: 484
170+
- interpersonal: 199
171+
- iot: 331
172+
- iota: 292
173+
- islam: 1,496
174+
- italian: 1,356
175+
- ja_stackoverflow: 9,734
176+
- japanese: 13,862
177+
- joomla: 1,875
178+
- judaism: 6,156
179+
- korean: 754
180+
- languagelearning: 135
181+
- latin: 1,387
182+
- law: 3,475
183+
- lifehacks: 934
184+
- linguistics: 1,507
185+
- literature: 582
186+
- magento: 20,537
187+
- martialarts: 364
188+
- materials: 338
189+
- math: 501,019
190+
- matheducators: 316
191+
- mathematica: 19,529
192+
- mathoverflow_net_7z: 23,803
193+
- mechanics: 4,735
194+
- meta: 34,161
195+
- meta_askubuntu: 2,076
196+
- meta_mathoverflow_net_7z: 333
197+
- meta_serverfault: 823
198+
- meta_stackoverflow: 12,641
199+
- meta_superuser: 1,748
200+
- moderators: 39
201+
- monero: 1,443
202+
- money: 7,996
203+
- movies: 6,789
204+
- music: 5,740
205+
- musicfans: 781
206+
- mythology: 271
207+
- networkengineering: 4,637
208+
- opendata: 1,117
209+
- opensource: 805
210+
- or: 586
211+
- outdoors: 1,503
212+
- parenting: 815
213+
- patents: 582
214+
- pets: 1,081
215+
- philosophy: 1,505
216+
- photo: 6,386
217+
- physics: 35,386
218+
- pm: 982
219+
- poker: 431
220+
- politics: 1,903
221+
- portuguese: 658
222+
- proofassistants: 87
223+
- pt_stackoverflow: 27,650
224+
- puzzling: 11,959
225+
- quant: 3,303
226+
- quantumcomputing: 1,604
227+
- raspberrypi: 6,794
228+
- retrocomputing: 1,016
229+
- reverseengineering: 1,606
230+
- robotics: 1,020
231+
- rpg: 9,517
232+
- ru_stackoverflow: 106,714
233+
- rus: 8,210
234+
- russian: 1,960
235+
- salesforce: 27,962
236+
- scicomp: 1,403
237+
- scifi: 15,174
238+
- security: 11,733
239+
- serverfault: 81,229
240+
- sharepoint: 24,934
241+
- sitecore: 2,691
242+
- skeptics: 1,043
243+
- softwareengineering: 10,526
244+
- softwarerecs: 3,032
245+
- solana: 602
246+
- sound: 2,031
247+
- space: 3,145
248+
- spanish: 3,049
249+
- sports: 1,715
250+
- sqa: 1,944
251+
- stackapps: 702
252+
- stackoverflow: 4,269,779
253+
- stats: 23,102
254+
- stellar: 373
255+
- substrate: 812
256+
- superuser: 128,488
257+
- sustainability: 240
258+
- tex: 42,808
259+
- tezos: 635
260+
- tor: 887
261+
- travel: 9,957
262+
- tridion: 1,769
263+
- ukrainian: 577
264+
- unix: 54,338
265+
- ux: 7,403
266+
- vegetarianism: 151
267+
- vi: 4,360
268+
- webapps: 10,159
269+
- webmasters: 9,413
270+
- windowsphone: 1,110
271+
- woodworking: 677
272+
- wordpress: 24,270
273+
- workplace: 4,104
274+
- worldbuilding: 2,766
275+
- writers: 1,957
276+
277+
---
278+
279+
## license: cc-by-sa-4.0 // See https://archive.org/details/stackexchange for details
+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
#!/usr/bin/env python3
2+
# Combine (and shorten) parquet files into a single file
3+
4+
import glob
5+
6+
import pandas as pd
7+
from merge_parquets import merge_parquet_dir
8+
9+
MAX_LENGTH = 1000 # max length of question or answer
10+
11+
for file in glob.glob("full/*.parquet"):
12+
df = pd.read_parquet(file)
13+
before = len(df)
14+
df = df[df["INSTRUCTION"].str.len() < MAX_LENGTH]
15+
df = df[df["RESPONSE"].str.len() < MAX_LENGTH]
16+
df["METADATA"] = df["METADATA"].apply(
17+
lambda meta: {
18+
"tags": meta["tags"],
19+
"answer_score": int(meta["answer_score"]) if "answer_score" in meta and meta["answer_score"] else 0,
20+
"question_score": int(meta["question_score"]) if "question_score" in meta and meta["question_score"] else 0,
21+
}
22+
)
23+
df.to_parquet(file)
24+
after = len(df)
25+
print(f"Shortened {file} from {before} to {after} rows ({100 * after / before:.2f})")
26+
27+
merge_parquet_dir("full", "stackexchange.parquet")
+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
#!/usr/bin/env python3
2+
#
3+
# Simple script to download StackExchange archive XML files with posts (threaded version)
4+
#
5+
# Note: you probably want to download stackoverflow.com-Posts.7z manually, as it is 18GB
6+
# and takes a days to download otherwise. You can try using the torrent:
7+
#
8+
# webtorrent https://archive.org/download/stackexchange/stackexchange_archive.torrent --select 658
9+
#
10+
11+
import concurrent.futures
12+
import os
13+
import re
14+
15+
import requests
16+
from bs4 import BeautifulSoup as bs
17+
18+
BASE_URL = "https://ia600107.us.archive.org/view_archive.php?archive=/27/items/stackexchange/{0}&file=Posts.xml"
19+
DOWNLOAD_DIR = "xml/"
20+
NUM_PARALLEL = 20
21+
RE_IGNORE = r"_meta|stackoverflow\.com\-"
22+
23+
24+
def get_all_filenames():
25+
"""
26+
Retrieve all urls from stackexchange archive.
27+
This needs quite some mangling because of special cases (i.e. stackoverflow is not in one 7z archive).
28+
Ignore meta files.
29+
"""
30+
response = requests.get("https://archive.org/download/stackexchange")
31+
if response.ok:
32+
soup = bs(response.content, "html.parser")
33+
table = soup.find("table")
34+
link_tags = table.find_all("a")
35+
urls = {"stackoverflow": "https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z"}
36+
for link in link_tags:
37+
url = link["href"]
38+
name = url.split(".stackexchange")[0].replace(".", "_").replace("-", "_")
39+
name = name.replace("_com_7z", "")
40+
if url.endswith("7z") and not re.search(RE_IGNORE, url):
41+
urls[name] = BASE_URL.format(url)
42+
return urls
43+
44+
45+
def download_url(dataset_name: str, url: str):
46+
os.makedirs(DOWNLOAD_DIR, exist_ok=True)
47+
cache_path = os.path.join(DOWNLOAD_DIR, dataset_name + ".xml")
48+
if os.path.exists(cache_path):
49+
print("Using cached: ", cache_path)
50+
return cache_path
51+
else:
52+
print("Downloading xml: ", dataset_name)
53+
response = requests.get(url)
54+
print("Finished downloading: ", dataset_name)
55+
with open(cache_path, "wb") as f:
56+
f.write(response.content)
57+
return cache_path
58+
59+
60+
def download_all():
61+
urls = get_all_filenames()
62+
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_PARALLEL) as executor:
63+
futures = [executor.submit(download_url, dataset, url) for dataset, url in urls.items()]
64+
65+
# Wait for all downloads to complete
66+
concurrent.futures.wait(futures)
67+
print("All downloads complete, except for the large stackoverflow XML file")
68+
print("Use torrent to download this one much quicker, then uncompress the 7z file")
69+
print("and move the extracted stackoverflow.com-Posts.xml to xml/stackoverflow.xml")
70+
print("webtorrent https://archive.org/download/stackexchange/stackexchange_archive.torrent --select 658")
71+
72+
73+
if __name__ == "__main__":
74+
download_all()

0 commit comments

Comments
 (0)