Skip to content

Commit 482f33f

Browse files
chenbjinYibing Liu
authored and
Yibing Liu
committed
Cherry pick PaddleNLP emotion_detection and ernie from release1.6 to develop (#3815)
* update PaddleNLP emotion_detection and ernie for Release/1.6 (#3608) * emotion-detection => 1.6 * ERNIE => 1.6 * [PaddleNLP] update emotion_detection readme * [PaddleNLP] emotion_detection add download.py (#3649) * emotion-detection => 1.6 * ERNIE => 1.6 * [PaddleNLP] update emotion_detection readme * [PaddleNLP] emotion_detection add download.py for windows user * [PaddleNLP] fix emotion_detection open problem, add paddlehub version (#3706) * update emotion-detection readme and fix open problem * fix ernie
1 parent bafd5b9 commit 482f33f

File tree

11 files changed

+233
-120
lines changed

11 files changed

+233
-120
lines changed

PaddleNLP/emotion_detection/README.md

+16-11
Original file line numberDiff line numberDiff line change
@@ -25,15 +25,15 @@
2525
| BERT | 93.6% | 92.3% | 78.6% |
2626
| ERNIE | 94.4% | 94.0% | 80.6% |
2727

28-
同时推荐用户参考[IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/122291)
28+
同时推荐用户参考[IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/122291)
2929

3030
## 快速开始
3131

3232
### 安装说明
3333

3434
1. PaddlePaddle 安装
3535

36-
本项目依赖于 PaddlePaddle Fluid 1.3.2 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
36+
本项目依赖于 PaddlePaddle Fluid 1.6 及以上版本,请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
3737

3838
2. 代码安装
3939

@@ -46,7 +46,7 @@
4646

4747
3. 环境依赖
4848

49-
请参考 PaddlePaddle [安装说明](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/install/index_cn.html) 部分的内容
49+
Python 2 的版本要求 2.7.15+,Python 3 的版本要求 3.5.1+/3.6/3.7,其它环境请参考 PaddlePaddle [安装说明](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/beginners_guide/install/index_cn.html) 部分的内容
5050

5151
### 代码结构说明
5252

@@ -56,7 +56,8 @@
5656
.
5757
├── config.json # 配置文件
5858
├── config.py # 配置文件读取接口
59-
├── inference_model.py # 保存 inference_model 的脚本
59+
├── download.py # 下载数据及预训练模型脚本
60+
├── inference_model.py # 保存 inference_model 的脚本
6061
├── reader.py # 数据读取接口
6162
├── run_classifier.py # 项目的主程序入口,包括训练、预测、评估
6263
├── run.sh # 训练、预测、评估运行脚本
@@ -86,15 +87,15 @@ python tokenizer.py --test_data_dir ./test.txt.utf8 --batch_size 1 > test.txt.ut
8687

8788
#### 公开数据集
8889

89-
这里我们提供一份已标注的、经过分词预处理的机器人聊天数据集,只需运行数据下载脚本 ```sh download_data.sh```,运行成功后,会生成文件夹 ```data```,其目录结构如下:
90+
这里我们提供一份已标注的、经过分词预处理的机器人聊天数据集,只需运行数据下载脚本 ```sh download_data.sh```或者 ```python download.py dataset``` 运行成功后,会生成文件夹 ```data```,其目录结构如下:
9091

9192
```text
9293
.
93-
├── train.tsv # 训练集
94-
├── dev.tsv # 验证集
95-
├── test.tsv # 测试集
96-
├── infer.tsv # 待预测数据
97-
├── vocab.txt # 词典
94+
├── train.tsv # 训练集
95+
├── dev.tsv # 验证集
96+
├── test.tsv # 测试集
97+
├── infer.tsv # 待预测数据
98+
├── vocab.txt # 词典
9899
```
99100

100101
### 单机训练
@@ -181,6 +182,8 @@ tar xvf emotion_detection_ernie_finetune-1.0.0.tar.gz
181182

182183
```shell
183184
sh download_model.sh
185+
# 或者
186+
python download.py model
184187
```
185188

186189
以上两种方式会将预训练的 TextCNN 模型和 ERNIE模型,保存在```pretrain_models```目录下,可直接修改```run.sh```脚本中的```init_checkpoint```参数进行评估、预测。
@@ -302,7 +305,7 @@ Final test result:
302305

303306
我们也提供了使用 PaddleHub 加载 ERNIE 模型的选项,PaddleHub 是 PaddlePaddle 的预训练模型管理工具,可以一行代码完成预训练模型的加载,简化预训练模型的使用和迁移学习。更多相关的介绍,可以查看 [PaddleHub](https://github.com/PaddlePaddle/PaddleHub)
304307

305-
注意:使用该选项需要先安装PaddleHub,安装命令如下
308+
注意:使用该选项需要先安装PaddleHub >= 1.2.0,安装命令如下
306309
```shell
307310
pip install paddlehub
308311
```
@@ -333,6 +336,8 @@ sh run_ernie.sh infer
333336

334337
## 版本更新
335338

339+
2019/10/21 PaddlePaddle1.6适配,添加download.py脚本。
340+
336341
2019/08/26 规范化配置的使用,对模块内数据处理代码进行了重构,更新README结构,提高易用性。
337342

338343
2019/06/13 添加PaddleHub调用ERNIE方式。

PaddleNLP/emotion_detection/config.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
from __future__ import division
2020
from __future__ import print_function
2121

22+
import io
2223
import os
2324
import six
2425
import json
@@ -122,7 +123,7 @@ def load_json(self, file_path):
122123
return
123124

124125
try:
125-
with open(file_path, "r") as fin:
126+
with io.open(file_path, "r") as fin:
126127
self.json_config = json.load(fin)
127128
except Exception as e:
128129
raise IOError("Error in parsing json config file '%s'" % file_path)
+153
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""
16+
Download script, download dataset and pretrain models.
17+
"""
18+
19+
from __future__ import absolute_import
20+
from __future__ import division
21+
from __future__ import print_function
22+
23+
import io
24+
import os
25+
import sys
26+
import time
27+
import hashlib
28+
import tarfile
29+
import requests
30+
31+
32+
def usage():
33+
desc = ("\nDownload datasets and pretrained models for EmotionDetection task.\n"
34+
"Usage:\n"
35+
" 1. python download.py dataset\n"
36+
" 2. python download.py model\n")
37+
print(desc)
38+
39+
40+
def md5file(fname):
41+
hash_md5 = hashlib.md5()
42+
with io.open(fname, "rb") as fin:
43+
for chunk in iter(lambda: fin.read(4096), b""):
44+
hash_md5.update(chunk)
45+
return hash_md5.hexdigest()
46+
47+
48+
def extract(fname, dir_path):
49+
"""
50+
Extract tar.gz file
51+
"""
52+
try:
53+
tar = tarfile.open(fname, "r:gz")
54+
file_names = tar.getnames()
55+
for file_name in file_names:
56+
tar.extract(file_name, dir_path)
57+
print(file_name)
58+
tar.close()
59+
except Exception as e:
60+
raise e
61+
62+
63+
def download(url, filename, md5sum):
64+
"""
65+
Download file and check md5
66+
"""
67+
retry = 0
68+
retry_limit = 3
69+
chunk_size = 4096
70+
while not (os.path.exists(filename) and md5file(filename) == md5sum):
71+
if retry < retry_limit:
72+
retry += 1
73+
else:
74+
raise RuntimeError("Cannot download dataset ({0}) with retry {1} times.".
75+
format(url, retry_limit))
76+
try:
77+
start = time.time()
78+
size = 0
79+
res = requests.get(url, stream=True)
80+
filesize = int(res.headers['content-length'])
81+
if res.status_code == 200:
82+
print("[Filesize]: %0.2f MB" % (filesize / 1024 / 1024))
83+
# save by chunk
84+
with io.open(filename, "wb") as fout:
85+
for chunk in res.iter_content(chunk_size=chunk_size):
86+
if chunk:
87+
fout.write(chunk)
88+
size += len(chunk)
89+
pr = '>' * int(size * 50 / filesize)
90+
print('\r[Process ]: %s%.2f%%' % (pr, float(size / filesize*100)), end='')
91+
end = time.time()
92+
print("\n[CostTime]: %.2f s" % (end - start))
93+
except Exception as e:
94+
print(e)
95+
96+
97+
def download_dataset(dir_path):
98+
BASE_URL = "https://baidu-nlp.bj.bcebos.com/"
99+
DATASET_NAME = "emotion_detection-dataset-1.0.0.tar.gz"
100+
DATASET_MD5 = "512d256add5f9ebae2c101b74ab053e9"
101+
file_path = os.path.join(dir_path, DATASET_NAME)
102+
url = BASE_URL + DATASET_NAME
103+
104+
if not os.path.exists(dir_path):
105+
os.makedirs(dir_path)
106+
# download dataset
107+
print("Downloading dataset: %s" % url)
108+
download(url, file_path, DATASET_MD5)
109+
# extract dataset
110+
print("Extracting dataset: %s" % file_path)
111+
extract(file_path, dir_path)
112+
os.remove(file_path)
113+
114+
115+
def download_model(dir_path):
116+
MODELS = {}
117+
BASE_URL = "https://baidu-nlp.bj.bcebos.com/"
118+
CNN_NAME = "emotion_detection_textcnn-1.0.0.tar.gz"
119+
CNN_MD5 = "b7ee648fcd108835c880a5f5fce0d8ab"
120+
ERNIE_NAME = "emotion_detection_ernie_finetune-1.0.0.tar.gz"
121+
ERNIE_MD5 = "dfeb68ddbbc87f466d3bb93e7d11c03a"
122+
MODELS[CNN_NAME] = CNN_MD5
123+
MODELS[ERNIE_NAME] = ERNIE_MD5
124+
125+
if not os.path.exists(dir_path):
126+
os.makedirs(dir_path)
127+
128+
for model in MODELS:
129+
url = BASE_URL + model
130+
model_path = os.path.join(dir_path, model)
131+
print("Downloading model: %s" % url)
132+
# download model
133+
download(url, model_path, MODELS[model])
134+
# extract model.tar.gz
135+
print("Extracting model: %s" % model_path)
136+
extract(model_path, dir_path)
137+
os.remove(model_path)
138+
139+
140+
if __name__ == '__main__':
141+
if len(sys.argv) != 2:
142+
usage()
143+
sys.exit(1)
144+
145+
if sys.argv[1] == "dataset":
146+
pwd = os.path.join(os.path.dirname(__file__), './')
147+
download_dataset(pwd)
148+
elif sys.argv[1] == "model":
149+
pwd = os.path.join(os.path.dirname(__file__), './pretrain_models')
150+
download_model(pwd)
151+
else:
152+
usage()
153+

PaddleNLP/emotion_detection/download_model.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#!/bin/bash
22

3-
mkdir -p models
4-
cd models
3+
mkdir -p pretrain_models
4+
cd pretrain_models
55

66
# download pretrain model file to ./models/
77
MODEL_CNN=https://baidu-nlp.bj.bcebos.com/emotion_detection_textcnn-1.0.0.tar.gz

PaddleNLP/emotion_detection/inference_model.py

+4-19
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# -*- encoding: utf8 -*-
1+
# -*- coding: UTF-8 -*-
22
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
33
#
44
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -44,9 +44,8 @@ def do_save_inference_model(args):
4444

4545
with fluid.program_guard(test_prog, startup_prog):
4646
with fluid.unique_name.guard():
47-
infer_pyreader, probs, feed_target_names = create_model(
47+
infer_loader, probs, feed_target_names = create_model(
4848
args,
49-
pyreader_name='infer_reader',
5049
num_labels=args.num_labels,
5150
is_prediction=True)
5251

@@ -79,20 +78,7 @@ def test_inference_model(args, texts):
7978
dev_count = int(os.environ.get('CPU_NUM', 1))
8079
place = fluid.CPUPlace()
8180

82-
test_prog = fluid.default_main_program()
83-
startup_prog = fluid.default_startup_program()
84-
85-
with fluid.program_guard(test_prog, startup_prog):
86-
with fluid.unique_name.guard():
87-
infer_pyreader, probs, feed_target_names = create_model(
88-
args,
89-
pyreader_name='infer_reader',
90-
num_labels=args.num_labels,
91-
is_prediction=True)
92-
93-
test_prog = test_prog.clone(for_test=True)
9481
exe = fluid.Executor(place)
95-
exe.run(startup_prog)
9682

9783
assert (args.inference_model_dir)
9884
infer_program, feed_names, fetch_targets = fluid.io.load_inference_model(
@@ -107,9 +93,8 @@ def test_inference_model(args, texts):
10793
wids, seq_len = utils.pad_wid(wids)
10894
data.append(wids)
10995
seq_lens.append(seq_len)
110-
batch_size = len(data)
111-
data = np.array(data).reshape((batch_size, 128, 1))
112-
seq_lens = np.array(seq_lens).reshape((batch_size, 1))
96+
data = np.array(data)
97+
seq_lens = np.array(seq_lens)
11398

11499
pred = exe.run(infer_program,
115100
feed={

PaddleNLP/emotion_detection/reader.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -96,16 +96,16 @@ def data_generator(self, batch_size, phase='train', epoch=1):
9696
Generate data for train, dev or test
9797
"""
9898
if phase == "train":
99-
return paddle.batch(
99+
return fluid.io.batch(
100100
self.get_train_examples(self.data_dir, epoch, self.max_seq_len), batch_size)
101101
elif phase == "dev":
102-
return paddle.batch(
102+
return fluid.io.batch(
103103
self.get_dev_examples(self.data_dir, epoch, self.max_seq_len), batch_size)
104104
elif phase == "test":
105-
return paddle.batch(
105+
return fluid.io.batch(
106106
self.get_test_examples(self.data_dir, epoch, self.max_seq_len), batch_size)
107107
elif phase == "infer":
108-
return paddle.batch(
108+
return fluid.io.batch(
109109
self.get_infer_examples(self.data_dir, epoch, self.max_seq_len), batch_size)
110110
else:
111111
raise ValueError(

0 commit comments

Comments
 (0)