项目09:自然语言处理—IMDb网络电影数据集分析

文本情感分析:又称意见挖掘、倾向性分析等。简单而言,是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程。Nasukawa和Yi最早提出情感分析这一概念,文本情感分析,也称为文本意见挖掘、倾向性分析,是指对主观性文本数据及带有感情色彩的文本数据,通过自动化或半自动化的方法进行分析、处理、归纳和推导,从而分析出信息的情感色彩和情感倾向性。情感分析技术在传统的数据挖掘、自然语言处理和计算语言学领域基础上又增加了一定的文本理解能力。是对数据挖掘和机器学习方法的拓展,能解决传统方法不能解决的问题。例如,一些新闻类的网站,根据新闻的评论可以知道这个新闻的热点情况,是积极导向,还是消极导向,从而进行舆论新闻的有效控制。

1. IMDb数据库

互联网电影资料库(Internet Movie Database,简称IMDb)是一个关于电影演员、电影、电视节目、电视明星和电影制作的在线数据库。类似国内的豆瓣影评。 IMDb创建于1990年10月17日,从1998年开始成为亚马逊公司旗下网站,2010年是IMDb成立20周年纪念。 IMDb的资料中包括了影片的众多信息、演员、片长、内容介绍、分级、评论等。对于电影的评分目前使用最多的就是IMDb评分。官网为http://www.imdb.com/ 。我们可以进入到官网搜索你喜欢的一部电影,这里以《美丽心灵》为实例,打开https://www.imdb.com/title/tt0268978/reviews 这个页面,可以看到底下有许多用户留下的评论。

img1

2. 自然语言处理

2.1 Keras自然语言处理介绍

  1. 读取数据集
  2. 建立token
  3. 使用token将影评文字转成数字列表
  4. 截长补短让所有数字列表长度都为380
  5. Embedding层将数字列表换成向量列表
  6. 向量了列表送入深度学习模型进行训练

2.2 建立token

因为深度学习模型只能接受数字,使用我们必须将文字转成数字列表。 如何做?我们在要将另外一种语言翻译成一种语言时,必须要有字典。相同的道理,我们将文字转成数字,也要有字典。 而Keras为我们提供了Tokenizer模块,类似Token的功能。 * 建立token要指定字典字数,如2000个字的字典 * 然后读取训练数据的25000项影评文字,按照每一个英文单词所出现的次数进行排序,排序在前2000的英文单词会进入字典中 * 这样一来就形成了一个常用的字典集合

2.3 转换

建立token后,会出现如下,一个单词对应一个数字

{('the',1),('and',2),('a',3),('of'),4...}

如果我们将文字转成数字如: the chases were very good

21,187,371,44,26

这样就是一个数字列表了

2.4 截长补短

由于文字的数字都不固定,有些可能是200字,有些可能是490字。这样一来转换成数字列表的数字也不固定。所以我们要做处理。 我们将数字列表的长度都设置为380,长的去掉,短的补上0。

2.5 数字列表转成向量列表

我们除了知道文字的数字列表以为,还得知道文字的语义。所以我们需要将数字列表转成空间向量,向量夹角和方向越接近,就表示词的意思越接近。有兴趣的朋友可以去搜搜“Word2Vec”,这边主要介绍入门的情感分析。

3. 构建项目

3.1 创建项目文件

在指定的磁盘路径创建存放当前项目的目录,linux或macos可使用mkdir命令创建文件夹目录,Windows直接使用图形化界面右键新建文件夹即可,例如我们的存放项目的目录名为project09:

    (dlwork) jingyudeMacBook-Pro:~ jingyuyan$ mkdir project09

进入项目文件夹后,创建dataset文件夹。

    (dlwork) jingyudeMacBook-Pro:~ jingyuyan$ cd project09

    (dlwork) jingyudeMacBook-Pro:project09 jingyuyan$ mkdir dataset

3.2 下载IMDb数据集

将数据集从http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz下载到dataset中,若下载速度过慢,可使用本书附录提供的下载地址和下载方式进行下载后将文件放入dataset文件中。

import urllib.request
import os
import tarfile
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
filepath = './dataset/aclImdb_v1.tar.gz'
if not os.path.isfile(filepath):
    result = urllib.request.urlretrieve(url,filepath)
    print('downloaded:',result)

3.2解压IMDb数据集

将下载好的数据集进行解压处理

if not os.path.exists('./dataset/aclImdb'):
    tfile = tarfile.open('./dataset/aclImdb_v1.tar.gz','r:gz')
    result = tfile.extractall('./dataset/')
aclImdbpath = './dataset/aclImdb/'

4. IMDb数据集预处理

IMDb数据集共有50000项“文字影评”,分别有25000项训练集和25000项测试集,其中每一项数据都标有“正面评价”和“负面评价”。

4.1 读取数据

首先我们先读取数据

from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
Using TensorFlow backend.

由于数据集是从互联网上收集的信息,所以我们需要对HTML的标签做一定的处理,读取imdb_simple_util模块,使用read_files文件读取文件夹目录并对数据进行格式化的读取。

from imdb_simple_util import read_files
# 使用读取函数传入参数train读取训练数据
y_train,train_text = read_files('train', aclImdbpath)
read train files: 25000
# # 使用读取函数传入参数test读取训练数据
y_test,test_text = read_files('test', aclImdbpath)
read test files: 25000

读取数据后,随意查看一项数据,为了方便查看结果,我们定义好格式化字典。随机查看一项数据和数据的结果

format_dict={1:'正面评价',0:"负面评价"}
# 查看第0项数据
train_text[100], format_dict[y_train[100]]
("Sure, it was cheesy and nonsensical and at times corny, but at least the filmmakers didn't try. While most TV movies border on the brink of mediocrity, this film actually has some redeeming qualities to it. The cinematography was pretty good for a TV film, and Viggo Mortensen displays shades of Aragorn in a film about a man who played by his own rules. Most of the flashback sequences were kind of cheesy, but the scene with the mountain lion was intense. I was kind of annoyed by Jason Priestly's role in the film as a rebellious shock-jock, but then again, it's a TV MOVIE! Despite all of the good things, the soundtrack was atrocious. However, it was nice to see Tucson, Arizona prominently featured in the film.",
 '正面评价')

4.2 建立Token

使用Tokenizer建立Token,输入num_words为单词数量,我们这边选择使用2000个单词,建立拥有2000个单词的字典。并且读取所有训练集中的影评,并将token字典里的单词按照出现次数进行排序,排在前2000的单词会被列入字典当中。

# 建立2000个词的字典
# 并按照每一个英文单词在影评出现的次数排序
token = Tokenizer(num_words=2000)
token.fit_on_texts(train_text)

查看读取了多少文章

token.document_count
25000

查看出现次数最高的前10个单词

for k, v in token.word_index.items():
    print(v, k)
    if v == 9:
        break
1 the
2 and
3 a
4 of
5 to
6 is
7 in
8 it
9 i

可以看到the、and、a等这些单词是影评当中出现次数最高的;使用token.texts_to_sequences将训练数据与测试数据的影评文字转换成数字列表。

# 使用token.texts_to_sequences将训练数据与测试数据的影评文字转换成数字列表
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)

查看转换后的随机一项影评文字与数字序列

train_text[100]
"Sure, it was cheesy and nonsensical and at times corny, but at least the filmmakers didn't try. While most TV movies border on the brink of mediocrity, this film actually has some redeeming qualities to it. The cinematography was pretty good for a TV film, and Viggo Mortensen displays shades of Aragorn in a film about a man who played by his own rules. Most of the flashback sequences were kind of cheesy, but the scene with the mountain lion was intense. I was kind of annoyed by Jason Priestly's role in the film as a rebellious shock-jock, but then again, it's a TV MOVIE! Despite all of the good things, the soundtrack was atrocious. However, it was nice to see Tucson, Arizona prominently featured in the film."
print(x_train_seq[100])
[248, 8, 12, 950, 2, 2, 29, 207, 17, 29, 218, 1, 1054, 157, 349, 133, 87, 244, 98, 19, 1, 4, 10, 18, 161, 44, 45, 1650, 5, 8, 1, 623, 12, 180, 48, 14, 3, 244, 18, 2, 4, 7, 3, 18, 40, 3, 128, 33, 252, 30, 23, 202, 87, 4, 1, 841, 67, 239, 4, 950, 17, 1, 132, 15, 1, 12, 1590, 9, 12, 239, 4, 30, 1651, 213, 7, 1, 18, 13, 3, 1461, 17, 91, 170, 41, 3, 244, 16, 463, 28, 4, 1, 48, 179, 1, 811, 12, 186, 8, 12, 323, 5, 63, 7, 1, 18]

显示数字序列的长度

len(x_train_seq[100])
105

4.2 格式化数据操作

使用截长补短的操作,数字列表总长度设置为100

### 进行截长补短的操作,数字列表总长度设置为380
x_train = sequence.pad_sequences(x_train_seq,maxlen=100)
x_test = sequence.pad_sequences(x_test_seq,maxlen=100)

可以看到,随机选择一条影评,字数长度为78,经过处理后长度为100,若达不到100的,会在前面补全0。

# 显示之前的数字列表
print('len:',len(x_train_seq[40]))
print(x_train_seq[40])
len: 78
[2, 10, 303, 18, 6, 1, 146, 205, 1214, 546, 7, 1059, 1852, 7, 1305, 8, 404, 48, 4, 1, 4, 699, 11, 199, 69, 26, 4, 31, 46, 6, 53, 821, 7, 57, 326, 1, 872, 6, 2, 71, 22, 1, 1, 51, 4, 1942, 11, 2, 44, 676, 174, 79, 138, 36, 19, 90, 92, 5, 1, 2, 780, 15, 8, 6, 175, 31, 824, 581, 29, 218, 633, 111, 57, 2, 1, 168, 92, 170]
# 显示处理过后的数字列表 可以看到结果是不够长度的前面被补了0,直到100为止
print('len:',len(x_train[40]))
print(x_train[40])
len: 100
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    2   10  303   18    6    1
  146  205 1214  546    7 1059 1852    7 1305    8  404   48    4    1
    4  699   11  199   69   26    4   31   46    6   53  821    7   57
  326    1  872    6    2   71   22    1    1   51    4 1942   11    2
   44  676  174   79  138   36   19   90   92    5    1    2  780   15
    8    6  175   31  824  581   29  218  633  111   57    2    1  168
   92  170]

再次随机选择一条影评,字数长度为385,经过处理后长度为100,超出的部分会被截取掉。

# 显示之前的数字列表
print('len:',len(x_train_seq[30]))
print(x_train_seq[30])
len: 385
[10, 422, 24, 73, 124, 1, 562, 19, 1948, 422, 24, 73, 995, 7, 97, 81, 92, 823, 1541, 1, 114, 235, 4, 23, 607, 181, 31, 915, 391, 1, 105, 14, 1, 61, 2, 719, 404, 27, 4, 23, 114, 105, 350, 7, 3, 192, 54, 10, 18, 6, 444, 19, 3, 663, 30, 2, 742, 10, 6, 3, 332, 488, 61, 59, 688, 1, 103, 14, 1, 551, 1948, 65, 53, 332, 13, 3, 502, 34, 304, 558, 86, 7, 258, 3, 128, 439, 570, 305, 7, 818, 1614, 1, 4, 12, 60, 5, 26, 1533, 121, 7, 1232, 305, 7, 1, 966, 146, 44, 3, 318, 2, 103, 472, 2, 6, 265, 5, 397, 23, 472, 17, 29, 1, 168, 54, 986, 141, 23, 1406, 1, 683, 5, 790, 76, 2, 23, 1949, 15, 3, 83, 1250, 30, 624, 528, 1, 411, 262, 42, 17, 242, 51, 1, 61, 7, 3, 303, 953, 525, 494, 454, 30, 742, 15, 3, 461, 19, 1, 1559, 1946, 1, 61, 6, 385, 599, 7, 60, 50, 1666, 17, 1, 87, 670, 1754, 1421, 7, 1, 18, 6, 303, 1376, 10, 6, 45, 4, 1, 114, 623, 203, 106, 2, 9, 102, 3, 172, 4, 104, 1, 132, 50, 1948, 2, 484, 22, 7, 1, 516, 6, 175, 1155, 1, 320, 513, 29, 1, 1004, 4, 1, 516, 1948, 139, 1, 8, 183, 5, 1, 495, 4, 1, 516, 5, 63, 484, 2, 42, 23, 495, 13, 8, 1994, 8, 2, 141, 336, 183, 1, 516, 2, 3, 1483, 15, 3, 83, 173, 2, 1047, 1, 128, 3, 476, 18, 142, 659, 40, 1279, 33, 89, 23, 787, 18, 7, 15, 295, 930, 1195, 86, 994, 155, 295, 930, 948, 13, 3, 293, 163, 14, 1, 697, 747, 17, 868, 11, 25, 469, 5, 843, 19, 657, 11, 46, 12, 160, 158, 14, 86, 7, 747, 15, 60, 103, 104, 1279, 44, 305, 7, 57, 270, 13, 27, 4, 1, 83, 904, 7, 3, 1025, 4, 40, 1780, 673, 1, 18, 3, 508, 11, 44, 90, 1928, 2, 2, 1, 279, 752, 1, 1376, 112, 2, 61, 6, 142, 127, 1060, 14, 4, 158, 778, 17, 14, 146, 142, 1310, 7, 657, 11, 10, 6, 1, 114, 18, 4, 1, 287, 1065, 41, 1580, 3, 562]
# 显示处理过后的数字列表 可以看到结果是不够长度的前面被补了0,直到100为止
print('len:',len(x_train[30]))
print(x_train[30])
len: 100
[ 155  295  930  948   13    3  293  163   14    1  697  747   17  868
   11   25  469    5  843   19  657   11   46   12  160  158   14   86
    7  747   15   60  103  104 1279   44  305    7   57  270   13   27
    4    1   83  904    7    3 1025    4   40 1780  673    1   18    3
  508   11   44   90 1928    2    2    1  279  752    1 1376  112    2
   61    6  142  127 1060   14    4  158  778   17   14  146  142 1310
    7  657   11   10    6    1  114   18    4    1  287 1065   41 1580
    3  562]

5. 建立模型

我们在上节中处理好了数据,本次实验需要建立多次模型作为测试,模型训练和测试实验分为多部分进行。

5.1 尝试一:建立多层感知器进行预测

本次实验需要搭建多层感知器加入嵌入层的形式训练模型。

5.1.1 处理数据集

为了方便查阅代码,我们将上小节所处理的数据集的代码再次放入本小节,后面的小节需要返回修改调试。

from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from imdb_simple_util import read_files
import os
if not os.path.exists('./dataset/aclImdb'):
    tfile = tarfile.open('./dataset/aclImdb_v1.tar.gz','r:gz')
    result = tfile.extractall('./dataset/')

NUM_WORDS = 2000
MAXLEN = 100
aclImdbpath = './dataset/aclImdb/'
# 使用读取函数传入参数train读取训练数据
y_train,train_text = read_files('train', aclImdbpath)
# 使用读取函数传入参数test读取训练数据
y_test,test_text = read_files('test', aclImdbpath)
# 建立2000个词的字典
# 并按照每一个英文单词在影评出现的次数排序
token = Tokenizer(num_words=NUM_WORDS)
token.fit_on_texts(train_text)
# 使用token.texts_to_sequences将训练数据与测试数据的影评文字转换成数字列表
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
### 进行截长补短的操作,数字列表总长度设置为380
x_train = sequence.pad_sequences(x_train_seq,maxlen=MAXLEN)
x_test = sequence.pad_sequences(x_test_seq,maxlen=MAXLEN)
read train files: 25000
read test files: 25000

5.1.2 建立模型

搭建模型,这里和以往不同的是加入了嵌入层,可以将数字列表转换成向量列表

from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.embeddings import  Embedding
# 建立模型
model = Sequential()
# 加入嵌入层,输出维数为32,输入维数为2000,代表之前的那2000个单词字典,数字列表为100
# 加入Dropout避免过度拟合,每次迭代训练随机丢弃20%神经元
model.add(Embedding(output_dim=32,
                   input_dim=NUM_WORDS,
                   input_length=MAXLEN))
model.add(Dropout(0,2))

# 加入平坦层,因为数字列表每一项有100个数字,所以每一个数字转换成32维的向量,所以平坦层有3200个神经元
model.add(Flatten())
# 加入隐藏层
model.add(Dense(units=256,activation='relu'))
model.add(Dropout(0.35))
# 建立输出层
model.add(Dense(units=1,activation='sigmoid'))
# 查看模型摘要
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               819456    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
=================================================================
Total params: 883,713
Trainable params: 883,713
Non-trainable params: 0
_________________________________________________________________

搭建模型结构如图所示

1

5.1.3 开始训练

# 验证集划分比例
VALIDATION_SPLIT = 0.2
# 训练周期
EPOCHS = 10
# 单批次数据量
BATCH_SIZE = 100
# 训练LOG打印形式
VERBOSE = 2
# 格式化label字典
format_dict={1:'正面评价',0:"负面评价"}
# 定义训练方法
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# 开始训练
train_history = model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 10s - loss: 0.4637 - acc: 0.7692 - val_loss: 0.6435 - val_acc: 0.6996
Epoch 2/10
 - 8s - loss: 0.2495 - acc: 0.8980 - val_loss: 0.5881 - val_acc: 0.7476
Epoch 3/10
 - 7s - loss: 0.1137 - acc: 0.9622 - val_loss: 0.7675 - val_acc: 0.7444
Epoch 4/10
 - 8s - loss: 0.0276 - acc: 0.9942 - val_loss: 1.0454 - val_acc: 0.7442
Epoch 5/10
 - 10s - loss: 0.0054 - acc: 0.9999 - val_loss: 1.2506 - val_acc: 0.7386
Epoch 6/10
 - 12s - loss: 0.0018 - acc: 1.0000 - val_loss: 1.3122 - val_acc: 0.7458
Epoch 7/10
 - 12s - loss: 9.4674e-04 - acc: 1.0000 - val_loss: 1.3667 - val_acc: 0.7460
Epoch 8/10
 - 12s - loss: 6.0775e-04 - acc: 1.0000 - val_loss: 1.4104 - val_acc: 0.7476
Epoch 9/10
 - 9s - loss: 4.4319e-04 - acc: 1.0000 - val_loss: 1.4795 - val_acc: 0.7444
Epoch 10/10
 - 10s - loss: 3.3365e-04 - acc: 1.0000 - val_loss: 1.5059 - val_acc: 0.7456
# 评估模型准确度
scores = model.evaluate(x_test,y_test)
print('loss=',scores[0])
print('accuracy=',scores[1])
25000/25000 [==============================] - 2s 89us/step
loss= 0.9991980175465346
accuracy= 0.8156
# 显示预测结果
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train histoty')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train','validation',],loc = 'upper left')
    plt.show()
show_train_history(train_history,'acc','val_acc')

png

show_train_history(train_history,'loss','val_loss')

png

5.1.4 开始预测

我们把训练好的模型传入刚刚划分好的测试集数据进行预测。

# 进行预测
predict = model.predict_classes(x_test)
# 查看测试结果前十项数据
predict[:10].reshape(-1)
array([1, 1, 1, 1, 1, 1, 1, 1, 0, 1], dtype=int32)
# 将预测结果转换成一纬数组
predict_classes = predict.reshape(-1)
predict_classes[:10]
array([1, 1, 1, 1, 1, 1, 1, 1, 0, 1], dtype=int32)
# 创建show_text_and_label函数显示预测结果
def show_text_and_label(i):
    print(test_text[i])
    print('label的真实值:', format_dict[y_test[i]],
         '    预测结果:', format_dict[predict_classes[i]])
# 查看索引为50的预测结果
show_text_and_label(50)
Latcho Drom, or Safe Journey, is the second film in Tony Gatlif's trilogy of the Romany people. The film is a visual depiction and historical record of Romany life in European and Middle Eastern countries. Even though the scenes are mostly planned, rehearsed, and staged there is not a conventional story line and the dialog does not explain activities from scene to scene. Instead, the film allows the viewer to have sometimes a glimpse, sometimes a more in-depth view of these people during different eras and in different countries, ranging from India, Egypt, Romania, Hungary, Slovakia, France, and Spain.  The importance of music in Romany culture is clearly expressed throughout the film. It is a vital part of every event and an important means of communication. Everything they do is expressed with music. Dance is another important activity. Like Romany music, it is specialized and deeply personal, something they alone know how to do correctly. We are provided glimpses into their everyday activities, but the film is not a detailed study of their lives. Rather, it is a testament to their culture, focusing on the music and dance they have created and which have made them unique.  Mr. Gatlif portrays the nomadic groups in a positive way. However, we also witness the rejection, distrust, and alienation they receive from the non-Romany population. It seems that the culture they have developed over countless generations, and inspired from diverse countries, will fade into oblivion because conventional society has no place for nomadic ways.  The other films in the trilogy are Les Princes (1983) and Gadjo Dilo (1998).
label的真实值: 正面评价     预测结果: 正面评价

5.2 尝试二:尝试加大文字处理的规模

我们在“尝试一”时采用的是建立1000个单词字典的方式,现在我们将其改成4000个单词,将文字截长补短长度改成400长度。

5.2.1 修改预处理数据集参数生成新的数据集
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from imdb_simple_util import read_files
import os
import numpy as np
if not os.path.exists('./dataset/aclImdb'):
    tfile = tarfile.open('./dataset/aclImdb_v1.tar.gz','r:gz')
    result = tfile.extractall('./dataset/')

NUM_WORDS = 4000
MAXLEN = 400
aclImdbpath = './dataset/aclImdb/'
# 使用读取函数传入参数train读取训练数据
y_train,train_text = read_files('train', aclImdbpath)
# 使用读取函数传入参数test读取训练数据
y_test,test_text = read_files('test', aclImdbpath)
# 建立2000个词的字典
# 并按照每一个英文单词在影评出现的次数排序
token = Tokenizer(num_words=NUM_WORDS)
token.fit_on_texts(train_text)
# 使用token.texts_to_sequences将训练数据与测试数据的影评文字转换成数字列表
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
### 进行截长补短的操作,数字列表总长度设置为380
x_train = sequence.pad_sequences(x_train_seq,maxlen=MAXLEN)
x_test = sequence.pad_sequences(x_test_seq,maxlen=MAXLEN)
read train files: 25000
read test files: 25000
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.embeddings import  Embedding
# 建立模型
model = Sequential()
# 加入嵌入层,输出维数为32,输入维数为4000,代表之前的那4000个单词字典,数字列表为400
# 加入Dropout避免过度拟合,每次迭代训练随机丢弃20%神经元
model.add(Embedding(output_dim=32,
                   input_dim=NUM_WORDS,
                   input_length=MAXLEN))
model.add(Dropout(0,2))

# 加入平坦层,因为数字列表每一项有400个数字,所以每一个数字转换成32维的向量,所以平坦层有12800个神经元
model.add(Flatten())
# 加入隐藏层
model.add(Dense(units=256,activation='relu'))
model.add(Dropout(0.35))
# 建立输出层
model.add(Dense(units=1,activation='sigmoid'))
# 查看模型摘要
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 100, 32)           64000     
_________________________________________________________________
dropout_3 (Dropout)          (None, 100, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               819456    
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 257       
=================================================================
Total params: 883,713
Trainable params: 883,713
Non-trainable params: 0
_________________________________________________________________

搭建模型结构如图所示

1

# 验证集划分比例
VALIDATION_SPLIT = 0.2
# 训练周期
EPOCHS = 10
# 单批次数据量
BATCH_SIZE = 100
# 训练LOG打印形式
VERBOSE = 1
# 格式化label字典
format_dict={1:'正面评价',0:"负面评价"}
# 定义训练方法
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# 开始训练
train_history = model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 34s 2ms/step - loss: 0.4736 - acc: 0.7511 - val_loss: 0.4240 - val_acc: 0.8098
Epoch 2/10
20000/20000 [==============================] - 26s 1ms/step - loss: 0.1729 - acc: 0.9345 - val_loss: 0.5732 - val_acc: 0.7740
Epoch 3/10
20000/20000 [==============================] - 27s 1ms/step - loss: 0.0493 - acc: 0.9862 - val_loss: 0.4824 - val_acc: 0.8440
Epoch 4/10
20000/20000 [==============================] - 27s 1ms/step - loss: 0.0089 - acc: 0.9990 - val_loss: 0.9248 - val_acc: 0.7710
Epoch 5/10
20000/20000 [==============================] - 28s 1ms/step - loss: 0.0022 - acc: 1.0000 - val_loss: 0.8898 - val_acc: 0.7992
Epoch 6/10
20000/20000 [==============================] - 27s 1ms/step - loss: 9.1428e-04 - acc: 1.0000 - val_loss: 0.9884 - val_acc: 0.7916
Epoch 7/10
20000/20000 [==============================] - 27s 1ms/step - loss: 5.4301e-04 - acc: 1.0000 - val_loss: 0.9820 - val_acc: 0.7996
Epoch 8/10
20000/20000 [==============================] - 27s 1ms/step - loss: 3.7675e-04 - acc: 1.0000 - val_loss: 1.0406 - val_acc: 0.7968
Epoch 9/10
20000/20000 [==============================] - 27s 1ms/step - loss: 2.7092e-04 - acc: 1.0000 - val_loss: 1.0613 - val_acc: 0.7980
Epoch 10/10
20000/20000 [==============================] - 27s 1ms/step - loss: 2.0728e-04 - acc: 1.0000 - val_loss: 1.0881 - val_acc: 0.7972
# 评估模型准确度
scores = model.evaluate(x_test,y_test)
print('loss=',scores[0])
print('accuracy=',scores[1])
25000/25000 [==============================] - 4s 169us/step
loss= 0.7162510861606896
accuracy= 0.85664
# 进行预测
predict = model.predict_classes(x_test)
# 显示预测结果
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train histoty')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train','validation',],loc = 'upper left')
    plt.show()

show_train_history(train_history,'acc','val_acc')
show_train_history(train_history,'loss','val_loss')

png

png

经过修改后的模型,把字典单词书扩充到4000和把数字长度扩充到400后,虽然训练时长较长,但是准确率从最初的0.81提升到了0.85

5.3 尝试三:使用RNN模型进行模型建立和预测

本节将修改上小节定义的模型,使用循环神经网络(Recurrent Neural Network, RNN)。循环神经网络是一种强大的技术,已经被用于语音识别、语言翻译、股票预测等等,它甚至用于图像识别来描述图片中的内容。

5.3.1 RNN模型介绍

RNN背后的思想是利用顺序信息。在传统的神经网络中,我们假设所有的输入(包括输出)之间是相互独立的。对于很多任务来说,这是一个非常糟糕的假设。如果你想预测一个序列中的下一个词,你最好能知道哪些词在它前面。RNN之所以循环的,是因为它针对系列中的每一个元素都执行相同的操作,每一个操作都依赖于之前的计算结果。换一种方式思考,可以认为RNN记忆了到当前为止已经计算过的信息。理论上,RNN可以利用任意长的序列信息,但实际中只能回顾之前的几步。下面是RNN的一个典型结构图:

image.png

上面的图中展示了RNN被展开成一个全网络后的结构。这里展开的意思是把针对整个序列的网络结构表示出来。例如,如果这里我们关心的是一个包含5个词的句子,那这里网络将会被展开成一个5层的网络,每个词对应一层。

5.3.2 RNN模型搭建

使用上节的数据继续作为本次的实验数据集。

from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from imdb_simple_util import read_files
import os
import numpy as np
if not os.path.exists('./dataset/aclImdb'):
    tfile = tarfile.open('./dataset/aclImdb_v1.tar.gz','r:gz')
    result = tfile.extractall('./dataset/'
                             )
aclImdbpath = './dataset/aclImdb/'
# 使用读取函数传入参数train读取训练数据
y_train,train_text = read_files('train', aclImdbpath)
# 使用读取函数传入参数test读取训练数据
y_test,test_text = read_files('test', aclImdbpath)
# 建立2000个词的字典
# 并按照每一个英文单词在影评出现的次数排序
token = Tokenizer(num_words=4000)
token.fit_on_texts(train_text)
# 使用token.texts_to_sequences将训练数据与测试数据的影评文字转换成数字列表
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
### 进行截长补短的操作,数字列表总长度设置为380
x_train = sequence.pad_sequences(x_train_seq,maxlen=400)
x_test = sequence.pad_sequences(x_test_seq,maxlen=400)
Using TensorFlow backend.


read train files: 25000
read test files: 25000
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.embeddings import  Embedding
from keras.layers.recurrent import SimpleRNN
# 建立模型
model = Sequential()
# 加入嵌入层,输出维数为32,输入维数为4000,代表之前的那4000个单词字典,数字列表为400
# 加入Dropout避免过度拟合,每次迭代训练随机丢弃35%神经元
model.add(Embedding(output_dim=32,
                   input_dim=4000,
                   input_length=400))
model.add(Dropout(0.35))

model.add(SimpleRNN(units=16))
# 加入隐藏层
model.add(Dense(units=256,activation='relu'))
model.add(Dropout(0.35))
# 建立输出层
model.add(Dense(units=1,activation='sigmoid'))
# 查看模型摘要
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_6 (Embedding)      (None, 400, 32)           128000    
_________________________________________________________________
dropout_11 (Dropout)         (None, 400, 32)           0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 16)                784       
_________________________________________________________________
dense_11 (Dense)             (None, 256)               4352      
_________________________________________________________________
dropout_12 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 257       
=================================================================
Total params: 133,393
Trainable params: 133,393
Non-trainable params: 0
_________________________________________________________________

搭建的模型结果如图所示

1

# 验证集划分比例
VALIDATION_SPLIT = 0.2
# 训练周期
EPOCHS = 10
# 单批次数据量
BATCH_SIZE = 100
# 训练LOG打印形式
VERBOSE = 1
# 格式化label字典
format_dict={1:'正面评价',0:"负面评价"}

# 定义训练方法
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# 开始训练
train_history = model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 65s 3ms/step - loss: 0.5016 - acc: 0.7487 - val_loss: 0.5265 - val_acc: 0.7658
Epoch 2/10
20000/20000 [==============================] - 60s 3ms/step - loss: 0.3322 - acc: 0.8646 - val_loss: 0.5412 - val_acc: 0.7652
Epoch 3/10
20000/20000 [==============================] - 61s 3ms/step - loss: 0.2743 - acc: 0.8917 - val_loss: 0.5810 - val_acc: 0.7660
Epoch 4/10
20000/20000 [==============================] - 60s 3ms/step - loss: 0.2420 - acc: 0.9053 - val_loss: 0.4979 - val_acc: 0.8010
Epoch 5/10
20000/20000 [==============================] - 60s 3ms/step - loss: 0.2199 - acc: 0.9159 - val_loss: 0.8347 - val_acc: 0.6836
Epoch 6/10
20000/20000 [==============================] - 60s 3ms/step - loss: 0.1971 - acc: 0.9236 - val_loss: 0.6687 - val_acc: 0.7688
Epoch 7/10
20000/20000 [==============================] - 61s 3ms/step - loss: 0.1661 - acc: 0.9377 - val_loss: 0.4780 - val_acc: 0.8248
Epoch 8/10
20000/20000 [==============================] - 61s 3ms/step - loss: 0.1677 - acc: 0.9360 - val_loss: 0.7464 - val_acc: 0.7804
Epoch 9/10
20000/20000 [==============================] - 63s 3ms/step - loss: 0.1170 - acc: 0.9568 - val_loss: 0.6739 - val_acc: 0.7964
Epoch 10/10
20000/20000 [==============================] - 62s 3ms/step - loss: 0.1065 - acc: 0.9601 - val_loss: 0.8532 - val_acc: 0.7836
scores = model.evaluate(x_test,y_test)
print('loss=',scores[0])
print('accuracy=',scores[1])
25000/25000 [==============================] - 37s 1ms/step
loss= 0.5778125722301006
accuracy= 0.8438

使用RNN模型准确率大约为0.84,误差比之前两次实验更低了些。

5.4 尝试四:使用LSTM模型进行模型建立和预测

长短期记忆(Long Short Term Memory, LSTM)即我们所称呼的LSTM,是为了解决长期以来问题而专门设计出来的,所有的RNN都具有一种重复神经网络模块的链式形式。在标准RNN中,这个重复的结构模块只有一个非常简单的结构。

之前介绍的RNN在训练时会有长期依赖的问题,因为RNN模型在训练的过程中会遇到梯度消失或者梯度爆炸的问题,训练计算和反向传播时经过一定的时间,梯度会发散到无穷或收敛到零。 有时,我们只需要查看最近的信息来执行当前任务。例如,考虑一种语言模型,试图根据之前的单词预测下一个单词。如果我们试图预测“云在天空中”的最后一个词,我们不需要任何进一步的背景 - 很明显下一个词将是天空。在这种情况下,如果相关信息与所需信息之间的差距很小,则RNN可以学习使用过去的信息 通俗的说长期依赖的问题会使得RNN丢失学习能力,所以需要LSTM能更好的处理这些问题。

5.4.1 尝试搭建LSTM模型

from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from imdb_simple_util import read_files
import os
import numpy as np
if not os.path.exists('./dataset/aclImdb'):
    tfile = tarfile.open('./dataset/aclImdb_v1.tar.gz','r:gz')
    result = tfile.extractall('./dataset/'
                             )
NUM_WORDS = 4000
MAXLEN = 400
aclImdbpath = './dataset/aclImdb/'
# 使用读取函数传入参数train读取训练数据
y_train,train_text = read_files('train', aclImdbpath)
# 使用读取函数传入参数test读取训练数据
y_test,test_text = read_files('test', aclImdbpath)
# 建立2000个词的字典
# 并按照每一个英文单词在影评出现的次数排序
token = Tokenizer(num_words=NUM_WORDS)
token.fit_on_texts(train_text)
# 使用token.texts_to_sequences将训练数据与测试数据的影评文字转换成数字列表
x_train_seq = token.texts_to_sequences(train_text)
x_test_seq = token.texts_to_sequences(test_text)
### 进行截长补短的操作,数字列表总长度设置为380
x_train = sequence.pad_sequences(x_train_seq,maxlen=MAXLEN)
x_test = sequence.pad_sequences(x_test_seq,maxlen=MAXLEN)
Using TensorFlow backend.


read train files: 25000
read test files: 25000
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.embeddings import  Embedding
from keras.layers.recurrent import LSTM
# 建立模型
model = Sequential()
# 加入嵌入层,输出维数为32,输入维数为4000,代表之前的那4000个单词字典,数字列表为400
# 加入Dropout避免过度拟合,每次迭代训练随机丢弃20%神经元
model.add(Embedding(output_dim=32,
                   input_dim=4000,
                   input_length=400))
model.add(Dropout(0.2))

model.add(LSTM(32))
# 加入隐藏层
model.add(Dense(units=256,activation='relu'))
model.add(Dropout(0.2))
# 建立输出层
model.add(Dense(units=1,activation='sigmoid'))
# 查看模型摘要
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 400, 32)           128000    
_________________________________________________________________
dropout_3 (Dropout)          (None, 400, 32)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_3 (Dense)              (None, 256)               8448      
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 257       
=================================================================
Total params: 145,025
Trainable params: 145,025
Non-trainable params: 0
_________________________________________________________________
# 验证集划分比例
VALIDATION_SPLIT = 0.2
# 训练周期
EPOCHS = 10
# 单批次数据量
BATCH_SIZE = 100
# 训练LOG打印形式
VERBOSE = 1
# 格式化label字典
format_dict={1:'正面评价',0:"负面评价"}

# 定义训练方法
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# 开始训练
train_history = model.fit(x_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
W0107 17:58:51.004291 4418139584 deprecation_wrapper.py:119] From /Users/jingyuyan/anaconda3/envs/dlwork/lib/python3.6/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0107 17:58:51.044086 4418139584 deprecation_wrapper.py:119] From /Users/jingyuyan/anaconda3/envs/dlwork/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0107 17:58:51.054200 4418139584 deprecation.py:323] From /Users/jingyuyan/anaconda3/envs/dlwork/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 290s 14ms/step - loss: 0.4723 - acc: 0.7614 - val_loss: 0.5991 - val_acc: 0.7300
Epoch 2/10
20000/20000 [==============================] - 268s 13ms/step - loss: 0.2604 - acc: 0.8936 - val_loss: 0.7511 - val_acc: 0.6710
Epoch 3/10
20000/20000 [==============================] - 304s 15ms/step - loss: 0.2313 - acc: 0.9110 - val_loss: 0.4142 - val_acc: 0.8198
Epoch 4/10
20000/20000 [==============================] - 239s 12ms/step - loss: 0.1929 - acc: 0.9264 - val_loss: 0.6525 - val_acc: 0.7644
Epoch 5/10
20000/20000 [==============================] - 312s 16ms/step - loss: 0.1767 - acc: 0.9324 - val_loss: 0.4321 - val_acc: 0.8326
Epoch 6/10
20000/20000 [==============================] - 264s 13ms/step - loss: 0.1485 - acc: 0.9448 - val_loss: 0.4275 - val_acc: 0.8234
Epoch 7/10
20000/20000 [==============================] - 290s 14ms/step - loss: 0.1418 - acc: 0.9464 - val_loss: 0.4818 - val_acc: 0.8426
Epoch 8/10
20000/20000 [==============================] - 247s 12ms/step - loss: 0.1115 - acc: 0.9597 - val_loss: 0.5610 - val_acc: 0.8140
Epoch 9/10
20000/20000 [==============================] - 243s 12ms/step - loss: 0.1135 - acc: 0.9584 - val_loss: 0.6222 - val_acc: 0.8334
Epoch 10/10
20000/20000 [==============================] - 338s 17ms/step - loss: 0.1037 - acc: 0.9622 - val_loss: 0.6642 - val_acc: 0.8306

训练时间相对较长,若有条件的同学可以使用GPU进行训练可节约时间,完成模型的训练后对模型进行评估。

scores = model.evaluate(x_test,y_test)
print('loss=',scores[0])
print('accuracy=',scores[1])
25000/25000 [==============================] - 105s 4ms/step
loss= 0.495558518127203
accuracy= 0.8608

可以看到,模型的准去率已经达到了0.86,我们保存该模型

model.save_weights('model.h5')

6.随机预测影评

我们将模型训练好之后,开始准备进行自由影评的测试,首先我们打开IMDb的官网,选择你喜欢的电影进行预测。这边我们就选择两部电影《美丽心灵》和最近比较火热《复仇者联盟:终局之战》中的影评进行预测。

img2

img3

我们选择一条复仇者联盟的影评尝试建立预测

 Cheap dialogs, non sense violence. The only thing the movie has to offer is explosions. 
 Is it really that hard to make quality superhero movies? 
 Look at Christopher Nolan's series of Batman movies, 
 those are quality films. This movie is flat, it can be consumed as fast food, 
 nothing to offer really.

可以看出这是一条比较负面的评价。

我们再选择一条美丽心灵的影评尝试建立预测

Although this film was slow paced, it was kept a float with Russell Crowe's best performance.
The screenplay was excellent, as was of course Ron Howard's directing.
The writing was great and I found the story kept my attention throughout the entire film.
The performances by Ed Harris, Jennifer Connelly and Christopher Plummer where excellent.
Certainly worth seeing.

而这是是一条比较正面的评价。

text_1 = '''Cheap dialogs, non sense violence. The only thing the movie has to offer is explosions. 
            Is it really that hard to make quality superhero movies? 
            Look at Christopher Nolan's series of Batman movies, 
            those are quality films. This movie is flat, it can be consumed as fast food, 
            nothing to offer really.'''
text_2 = '''Although this film was slow paced, it was kept a float with Russell Crowe's best performance.
            The screenplay was excellent, as was of course Ron Howard's directing.
            The writing was great and I found the story kept my attention throughout the entire film.
            The performances by Ed Harris, Jennifer Connelly and Christopher Plummer where excellent.
            Certainly worth seeing.'''
# 建立将文字转化成预测预处理的函数
def text2data(text, maxlen=400):
# 将影片文字转成数字列表
    text_seq = token.texts_to_sequences([text])
    # 查看数字列表
    print('seq:', text_seq[0])
    # 查看长度
    print('len:', len(text_seq[0]))
    # 截长补短操作
    pad_input_seq = sequence.pad_sequences(text_seq,maxlen=maxlen)
    return pad_input_seq
# 转换《复仇者联盟》的影评
text_1_data = text2data(text_1)
seq: [701, 3242, 696, 277, 563, 1, 60, 150, 1, 16, 44, 5, 1465, 6, 3977, 6, 8, 62, 11, 250, 5, 93, 485, 3787, 98, 164, 29, 1363, 197, 4, 1353, 98, 144, 22, 485, 104, 10, 16, 6, 1031, 8, 66, 26, 13, 698, 1646, 160, 5, 1465, 62]
len: 50
# 转换《美丽心灵》的影评
text_2_data = text2data(text_2)
seq: [258, 10, 18, 12, 546, 1781, 8, 12, 825, 3, 15, 2609, 114, 235, 1, 877, 12, 317, 13, 12, 4, 260, 2708, 936, 1, 483, 12, 83, 2, 9, 254, 1, 61, 825, 57, 687, 465, 1, 432, 18, 1, 350, 30, 1660, 2161, 2097, 2, 1363, 116, 317, 430, 286, 315]
len: 53
# 读取模型
from keras.models import load_model
model = load_model('model.h5')
# 合并两条数据
concat = np.concatenate([text_1_data, text_2_data], axis=0)
# 预测
res = model.predict_classes(concat)
res
array([[0],
       [1]], dtype=int32)
format_dict[res[0][0]], format_dict[res[1][0]]
('负面评价', '正面评价')

可以看到,评价的结果符合我们的预期。大家可以多多尝试其他的影评。

结论

本章我们介绍如何使用Keras对影评进行情感分析。大家可以再深入了解一下,例如翻译算法是怎么做的,中文的情感分析该如何做等等。


版权声明:如无特殊说明,文章均为本站原创,转载请注明出处

本文链接:http://tunm.top/article/learning_09/