圣诞树动画

Question

MaxU - stop genocide of UA

Asked:2020-03-03 18:15:51 +0000 UTC2020-03-03 18:15:51 +0000 UTC 2020-03-03 18:15:51 +0000 UTC

评论/推文的情感（语气）评估分类器的性能比较

772

在回答这个问题时，我想比较不同分类器对评论/推文的情感（色调）评估的有效性。

我不确定这个问题是否完全符合 StackOverflow 规则，但我相信它对对自然文本分类问题感兴趣的人很有用。

PS我尝试以答案的形式解决问题。

1 个回答

Voted

MaxU - stop genocide of UA · Answer 1 · 2020-03-03T18:16:03Z

在这个答案中，我比较了以下分类器的性能（预测精度）：

我还测试了：

LogisticRegression -SGDClassifier更高级的版本
决策树分类器
随机森林分类器

DecisionTreeClassifier并且RandomForestClassifier对语料库和新的（自写评论）都给出了糟糕的结果（预测准确性），所以我决定放弃它们。

所有模型都使用整个语料库中 5% 的数据进行训练，即在预测之前，分类器从未见过剩余的 95% 的语料库。

整个（100%）语料库的预测结果：

In [11]: df.drop('ttext',1)
Out[11]:
        ttype  pred_SGD  pred_MultinomialNB  pred_SVC_linear  pred_SVC_rbf  pred_MLP_NN
30221       1         1                   1                1             1            1
88858       1         1                   1                1             1            1
220076     -1        -1                  -1               -1            -1           -1
201195     -1        -1                  -1               -1            -1           -1
78267       1         1                   1                1             1            1
71817       1         1                   1                1             1            1
207275     -1        -1                  -1               -1            -1           -1
226007     -1        -1                  -1               -1            -1           -1
140091     -1        -1                  -1               -1            -1           -1
2433        1         1                   1                1             1            1
...       ...       ...                 ...              ...           ...          ...
199205     -1        -1                  -1               -1            -1           -1
178062     -1        -1                  -1               -1            -1           -1
54428       1         1                   1                1             1            1
176046     -1        -1                  -1               -1            -1           -1
171906     -1        -1                  -1               -1            -1           -1
53821       1         1                   1                1             1            1
113037      1         1                   1                1             1            1
87279       1         1                   1                1             1            1
6561        1         1                   1                1             1            1
30793       1         1                   1                1             1            1

[226834 rows x 6 columns]

准确性：

r = (df.filter(regex='pred_')
       .rename(columns=lambda c: c.replace('pred_', ''))
       .eq(df['ttype'], axis=0).mean()
       .to_frame('Accuracy'))

In [55]: r
Out[55]:
               Accuracy
SGD            0.998554
MultinomialNB  0.991165
SVC_linear     0.998611
SVC_rbf        0.958441
MLP_NN         0.998492

日程：

ax = r.plot.barh(alpha=0.55, title='Classifier Comparison', figsize=(12,8))
plt.tight_layout()

for rect in ax.patches:
    width = rect.get_width()
    ax.text(0.5, rect.get_bbox().get_points()[:, 1].mean(),
            '{:.2%}'.format(width), ha='center', va='center')

我们根据自己的书面评论检查模型：

test = get_test_data()

In [85]: test
Out[85]:
                                                                      ttext  ttype
0             Погода сегодня полная фигня, но настроение все равно отличное      1
1  Ну сходил я на этот фильм. Отзывы были нормальные, а оказалось - отстой!     -1
2                                                       StackOverflow рулит      1
3                                                           все очень плохо     -1
4                                                          бывало и получше     -1
5                                                           да вы задолбали     -1
6                                                         ненавижу вас :)))      1
7                                                              ненавижу вас     -1


test = test_unseen_dataset(grid, test, 'ttext')

In [112]: test.drop('ttext',1)
Out[112]:
   ttype  pred_SGD  pred_MultinomialNB  pred_KNN  pred_SVC_linear  pred_SVC_rbf  pred_MLP_NN
0      1         1                   1         1                1             1            1
1     -1        -1                  -1        -1               -1            -1           -1
2      1         1                   1        -1                1            -1            1
3     -1         1                  -1        -1                1            -1           -1
4     -1        -1                  -1        -1               -1            -1           -1
5     -1         1                  -1        -1                1            -1           -1
6      1         1                   1        -1                1             1            1
7     -1        -1                  -1        -1               -1            -1           -1

我们考虑准确性：

r2 = (test.filter(regex='pred_')
          .rename(columns=lambda c: c.replace('pred_', ''))
          .eq(test['ttype'], axis=0).mean()
          .to_frame('Accuracy'))

In [114]: r2
Out[114]:
               Accuracy
SGD               0.750
MultinomialNB     1.000
KNN               0.750
SVC_linear        0.750
SVC_rbf           0.875
MLP_NN            1.000

训练模型的程序代码：

# (с) https://ru.stackoverflow.com/users/211923/maxu?tab=profile

# Corpus download: http://study.mokoron.com/
# Corpus (c)
# positive: https://www.dropbox.com/s/fnpq3z4bcnoktiv/positive.csv?dl=0
# negative: https://www.dropbox.com/s/r6u59ljhhjdg6j0/negative.csv?dl=0
# join them together: type positive.csv negative.csv > pos_neg.csv

#cols = 'id tdate tmane ttext ttype trep tfav tstcount tfol tfrien listcount'.split()

try:
    from pathlib import Path
except ImportError:             # Python 2
    from pathlib2 import Path
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.externals import joblib


def get_train_data(path, frac=0.15, **kwargs):
    df = pd.read_csv(path, sep=';', header=None,
                     names=['id','ttext','ttype'],
                     usecols=[0,3,4], **kwargs)
    # Speed up: randomly select 15% of data
    # comment it out for better prediction performance
    return df.sample(frac=frac)


def get_test_data(path=None, **kwargs):
    if path:
        return pd.read_csv(path, **kwargs)
    else:   # generate a dummy DF
        test = pd.DataFrame({
                'ttext':['Погода сегодня полная фигня, но настроение все равно отличное',
                'Ну сходил я на этот фильм. Отзывы были нормальные, а оказалось - отстой!',
                'StackOverflow рулит', 'все очень плохо', 'бывало и получше', 'да вы задолбали',
                'ненавижу вас :)))', 'ненавижу вас']
        })
        test['ttype'] = [1, -1, 1, -1, -1, -1, 1, -1]
        return test

def fit_all_classifiers_grid(X, y, classifiers, **common_grid_kwargs):
    grids = {}
    for clf in classifiers:
        print('{:-^70}'.format(' [' + clf['name'] + '] '))
        pipe = Pipeline([
                    ("vect", CountVectorizer()),
                    (clf['name'], clf['clf'])])
        grids[clf['name']] = (GridSearchCV(pipe,
                                           param_grid=clf['parm_grid'],
                                           **common_grid_kwargs)
                                  .fit(X, y))
        # saving single trained model ...
        joblib.dump(grids[clf['name']], './pickle/{}.pkl'.format(clf['name']))
    return grids

classifiers = [
    {   'name':     'SGD',
        'clf':      SGDClassifier(),
        'title':    "SGDClassifier",
        'parm_grid':  {
                        'vect__min_df':         [1, 2, 3],
                        'vect__ngram_range':    [(2,5)],
                        'vect__analyzer':       ['char_wb'],
                        'SGD__alpha':           [0.0001, 0.001, 0.01, 0.1],
                        'SGD__max_iter':        [200]
        } 
    },
    #{   'name':     'LogRegr',
    #    'clf':      LogisticRegression(),
    #    'title':    "LogisticRegression",
    #    'parm_grid':  {
    #                    'vect__min_df':         [1, 2, 3],
    #                    'vect__ngram_range':    [(2,5)],
    #                    'vect__analyzer':       ['char_wb'],
    #                    'LogRegr__C':           [5, 10],
    #                    'LogRegr__max_iter':    [100, 200]
    #    } 
    #},
    {   'name':     'MultinomialNB',
        'clf':      MultinomialNB(),
        'title':    "MultinomialNB",
        'parm_grid':  {
                        'vect__min_df':         [1, 2, 5, 7],
                        'vect__ngram_range':    [(2,5)],
                        'vect__analyzer':       ['char_wb'],
                        'MultinomialNB__alpha': [0.0001, 0.001, 0.01, 0.1]
        } 
    },
    {   'name':     'KNN',
        'clf':      KNeighborsClassifier(),
        'title':    "K-Neighbors",
        'parm_grid':  {
                        'vect__min_df':         [1, 3, 5, 7],
                        'vect__ngram_range':    [(2,5)],
                        'vect__analyzer':       ['char_wb'],
                        'KNN__n_neighbors':     [3, 4, 5]
        } 
    },
    {   'name':     'SVC_linear',
        'clf':      SVC(),
        'title':    "SVC (linear)",
        'parm_grid':  {
                        'vect__min_df':         [1, 3, 5],
                        'vect__ngram_range':    [(2,5)],
                        'vect__analyzer':       ['char_wb'],
                        'SVC_linear__kernel':   ['linear'],
                        'SVC_linear__C':        [0.025, 0.1, 0.5],
        } 
    },
    {   'name':     'SVC_rbf',
        'clf':      SVC(),
        'title':    "SVC (rbf)",
        'parm_grid':  {
                        'vect__min_df':         [1, 3, 5],
                        'vect__ngram_range':    [(2,5)],
                        'vect__analyzer':       ['char_wb'],
                        'SVC_rbf__kernel':      ['rbf'],
                        'SVC_rbf__gamma':       ['auto'],
                        'SVC_rbf__C':           [0.5, 1, 2],
        } 
    },
    #{   'name':     'DecisionTree',
    #    'clf':      DecisionTreeClassifier(),
    #    'title':    "DecisionTree",
    #    'parm_grid':  {
    #                    'vect__min_df':         [1, 3, 5],
    #                    'vect__ngram_range':    [(2,5)],
    #                    'vect__analyzer':       ['char_wb'],
    #                    'DecisionTree__max_depth':  [3, 5],
    #    } 
    #},
    #{   'name':     'RandomForest',
    #    'clf':      RandomForestClassifier(),
    #    'title':    "RandomForest",
    #    'parm_grid':  {
    #                    'vect__min_df':         [1, 3, 5],
    #                    'vect__ngram_range':    [(2,5)],
    #                    'vect__analyzer':       ['char_wb'],
    #                    'RandomForest__max_depth':      [3, 5],
    #                    'RandomForest__n_estimators':   [10],
    #                    'RandomForest__max_features':   [1],
    #    } 
    #},
    {   'name':     'MLP_NN',                  # NOTE: very slow, might give poor accuracy on small data sets
        'clf':      MLPClassifier(),
        'title':    "MLP NN",
        'parm_grid':  {
                        'vect__min_df':         [3, 5, 7],
                        'vect__ngram_range':    [(2,5)],
                        'vect__analyzer':       ['char_wb'],
                        'MLP_NN__activation':   ['relu'],
                        'MLP_NN__alpha':        [0.0001, 0.001, 0.01, 0.1],
        } 
    },
    #{   'name':     'AdaBoost',                # NOTE: poor accuracy
    #    'clf':      AdaBoostClassifier(),
    #    'title':    "AdaBoost",
    #    'parm_grid':  {
    #                    'vect__min_df':         [1, 3, 5, 7],
    #                    'vect__ngram_range':    [(2,5)],
    #                    'vect__analyzer':       ['char_wb'],
    #                    'AdaBoost__n_estimators':   [25, 50, 75, 150],
    #    } 
    #},
]


def print_grid_results(grids):
    for name, clf in grids.items():
        print('{:-^70}'.format(' [' + name + '] '))
        print('Score:\t\t{:.2%}'.format(clf.best_score_))
        print('Parameters:\t{}'.format(clf.best_params_))
        print('*' * 70)


def print_best_features(grids, clf_name, n=20):
    clf = grids[clf_name]
    if not hasattr(clf.best_estimator_.named_steps[clf_name], 'coef_'):
        print('*' * 70)
        print('Attribute [coef_] not available for [clf_name]')
        print('*' * 70)
        return
    features = clf.best_estimator_.named_steps['vect'].get_feature_names()
    coefs = pd.Series(clf.best_estimator_.named_steps[clf_name].coef_.ravel(), features)
    print('*' * 70)
    print('Top {} POSITIVE features:'.format(n))
    print('*' * 70)
    print(coefs.nlargest(20))
    print('-' * 70)
    print('Top {} NEGATIVE features:'.format(n))
    print('*' * 70)
    print(coefs.nsmallest(20))
    print('-' * 70)
    print('*' * 70)

def test_unseen_dataset(grid, test_df, X_col='ttext'):
    for name, clf in grid.items():
        test_df['pred_{}'.format(name)] = clf.predict(test_df[X_col])
    return test_df


def main(path):    
    p =  Path('.')
    pkl_dir = p / 'pickle'
    print(pkl_dir)
    pkl_dir.mkdir(parents=True, exist_ok=True)

    # read data set into DF. Only the following columns: ['id','tdate','ttext','ttype']
    df = get_train_data(path, frac=0.05)

    test = get_test_data()

    # tune up hyperparameters for ALL classifiers
    print('Tuning up hyperparameters for ALL classifiers ...')
    print('NOTE: !!! this might take hours !!!')
    grid = fit_all_classifiers_grid(df['ttext'], df['ttype'],
                                    classifiers, cv=2,
                                    verbose=2, n_jobs=-1)

    # persist trained models
    fn = str(pkl_dir / 'ALL_grids.pkl')
    print('Saving tuned [grid] to [{}]'.format(fn))
    joblib.dump(grid, fn)

    # print best scores and best parameters for ALL classifiers
    print_grid_results(grid)

    pd.options.display.expand_frame_repr = False
    test = test_unseen_dataset(grid, test, 'ttext')
    test.to_excel('./test.xlsx', index=False)
    #print(test)
    print(test.iloc[:, 2:].eq(test['ttype'], axis=0).mean())

if __name__ == "__main__":
    p =  Path(__file__).parent.resolve()
    main(str(p / 'pos_neg.csv.gz'))

PS答案中使用了 Yulia Rubtsova 准备的语料库

评论/推文的情感（语气）评估分类器的性能比较

是否可以在 C++ 中继承类 <---> 结构？

这种神经网络架构适合文本分类吗？

为什么分配的工作方式不同？

控制台中的光标坐标

如何在 C++ 中删除类的实例？

点是否属于线段的问题

json结构错误

ServiceWorker 中的“获取”事件

c ++控制台应用程序exe文件[重复]

按多列从sql表中选择

评论/推文的情感（语气）评估分类器的性能比较

1 个回答

相关问题