RError.com

RError.com Logo RError.com Logo

RError.com Navigation

  • 主页

Mobile menu

Close
  • 主页
  • 系统&网络
    • 热门问题
    • 最新问题
    • 标签
  • Ubuntu
    • 热门问题
    • 最新问题
    • 标签
  • 帮助
主页 / 问题 / 792931
Accepted
MaxU - stop genocide of UA
MaxU - stop genocide of UA
Asked:2020-03-03 18:15:51 +0000 UTC2020-03-03 18:15:51 +0000 UTC 2020-03-03 18:15:51 +0000 UTC

评论/推文的情感(语气)评估分类器的性能比较

  • 772

在回答这个问题时,我想比较不同分类器对评论/推文的情感(色调)评估的有效性。

我不确定这个问题是否完全符合 StackOverflow 规则,但我相信它对对自然文本分类问题感兴趣的人很有用。

PS我尝试以答案的形式解决问题。

python
  • 1 1 个回答
  • 10 Views

1 个回答

  • Voted
  1. Best Answer
    MaxU - stop genocide of UA
    2020-03-03T18:16:03Z2020-03-03T18:16:03Z

    在这个答案中,我比较了以下分类器的性能(预测精度):

    • SGDClassifier (S随机G辐射下降)
    • MultinomialNB(多项式朴素贝叶斯)
    • KNeighborsClassifier ( K - Nearest N eighbors Vote )
    • SVC (kernel='linear')(支持向量机分类;内核linear:)
    • SVC(kernel='rbf') (支持向量机C分类;内核:)Radial-Basis Function
    • MLPClassifier (多层感知器)_

    我还测试了:

    • LogisticRegression -SGDClassifier更高级的版本
    • 决策树分类器
    • 随机森林分类器

    DecisionTreeClassifier并且RandomForestClassifier对语料库和新的(自写评论)都给出了糟糕的结果(预测准确性),所以我决定放弃它们。

    所有模型都使用整个语料库中 5% 的数据进行训练,即 在预测之前,分类器从未见过剩余的 95% 的语料库。

    整个(100%)语料库的预测结果:

    In [11]: df.drop('ttext',1)
    Out[11]:
            ttype  pred_SGD  pred_MultinomialNB  pred_SVC_linear  pred_SVC_rbf  pred_MLP_NN
    30221       1         1                   1                1             1            1
    88858       1         1                   1                1             1            1
    220076     -1        -1                  -1               -1            -1           -1
    201195     -1        -1                  -1               -1            -1           -1
    78267       1         1                   1                1             1            1
    71817       1         1                   1                1             1            1
    207275     -1        -1                  -1               -1            -1           -1
    226007     -1        -1                  -1               -1            -1           -1
    140091     -1        -1                  -1               -1            -1           -1
    2433        1         1                   1                1             1            1
    ...       ...       ...                 ...              ...           ...          ...
    199205     -1        -1                  -1               -1            -1           -1
    178062     -1        -1                  -1               -1            -1           -1
    54428       1         1                   1                1             1            1
    176046     -1        -1                  -1               -1            -1           -1
    171906     -1        -1                  -1               -1            -1           -1
    53821       1         1                   1                1             1            1
    113037      1         1                   1                1             1            1
    87279       1         1                   1                1             1            1
    6561        1         1                   1                1             1            1
    30793       1         1                   1                1             1            1
    
    [226834 rows x 6 columns]
    

    准确性:

    r = (df.filter(regex='pred_')
           .rename(columns=lambda c: c.replace('pred_', ''))
           .eq(df['ttype'], axis=0).mean()
           .to_frame('Accuracy'))
    
    In [55]: r
    Out[55]:
                   Accuracy
    SGD            0.998554
    MultinomialNB  0.991165
    SVC_linear     0.998611
    SVC_rbf        0.958441
    MLP_NN         0.998492
    

    日程:

    ax = r.plot.barh(alpha=0.55, title='Classifier Comparison', figsize=(12,8))
    plt.tight_layout()
    
    for rect in ax.patches:
        width = rect.get_width()
        ax.text(0.5, rect.get_bbox().get_points()[:, 1].mean(),
                '{:.2%}'.format(width), ha='center', va='center')
    

    在此处输入图像描述

    我们根据自己的书面评论检查模型:

    test = get_test_data()
    
    In [85]: test
    Out[85]:
                                                                          ttext  ttype
    0             Погода сегодня полная фигня, но настроение все равно отличное      1
    1  Ну сходил я на этот фильм. Отзывы были нормальные, а оказалось - отстой!     -1
    2                                                       StackOverflow рулит      1
    3                                                           все очень плохо     -1
    4                                                          бывало и получше     -1
    5                                                           да вы задолбали     -1
    6                                                         ненавижу вас :)))      1
    7                                                              ненавижу вас     -1
    
    
    test = test_unseen_dataset(grid, test, 'ttext')
    
    In [112]: test.drop('ttext',1)
    Out[112]:
       ttype  pred_SGD  pred_MultinomialNB  pred_KNN  pred_SVC_linear  pred_SVC_rbf  pred_MLP_NN
    0      1         1                   1         1                1             1            1
    1     -1        -1                  -1        -1               -1            -1           -1
    2      1         1                   1        -1                1            -1            1
    3     -1         1                  -1        -1                1            -1           -1
    4     -1        -1                  -1        -1               -1            -1           -1
    5     -1         1                  -1        -1                1            -1           -1
    6      1         1                   1        -1                1             1            1
    7     -1        -1                  -1        -1               -1            -1           -1
    

    我们考虑准确性:

    r2 = (test.filter(regex='pred_')
              .rename(columns=lambda c: c.replace('pred_', ''))
              .eq(test['ttype'], axis=0).mean()
              .to_frame('Accuracy'))
    
    In [114]: r2
    Out[114]:
                   Accuracy
    SGD               0.750
    MultinomialNB     1.000
    KNN               0.750
    SVC_linear        0.750
    SVC_rbf           0.875
    MLP_NN            1.000
    

    在此处输入图像描述

    训练模型的程序代码:

    # (с) https://ru.stackoverflow.com/users/211923/maxu?tab=profile
    
    # Corpus download: http://study.mokoron.com/
    # Corpus (c)
    # positive: https://www.dropbox.com/s/fnpq3z4bcnoktiv/positive.csv?dl=0
    # negative: https://www.dropbox.com/s/r6u59ljhhjdg6j0/negative.csv?dl=0
    # join them together: type positive.csv negative.csv > pos_neg.csv
    
    #cols = 'id tdate tmane ttext ttype trep tfav tstcount tfol tfrien listcount'.split()
    
    try:
        from pathlib import Path
    except ImportError:             # Python 2
        from pathlib2 import Path
    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import SGDClassifier, LogisticRegression
    from sklearn.naive_bayes import MultinomialNB, GaussianNB
    from sklearn.neural_network import MLPClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.svm import SVC
    from sklearn.gaussian_process import GaussianProcessClassifier
    from sklearn.gaussian_process.kernels import RBF
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
    from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
    from sklearn.pipeline import Pipeline, make_pipeline
    from sklearn.externals import joblib
    
    
    def get_train_data(path, frac=0.15, **kwargs):
        df = pd.read_csv(path, sep=';', header=None,
                         names=['id','ttext','ttype'],
                         usecols=[0,3,4], **kwargs)
        # Speed up: randomly select 15% of data
        # comment it out for better prediction performance
        return df.sample(frac=frac)
    
    
    def get_test_data(path=None, **kwargs):
        if path:
            return pd.read_csv(path, **kwargs)
        else:   # generate a dummy DF
            test = pd.DataFrame({
                    'ttext':['Погода сегодня полная фигня, но настроение все равно отличное',
                    'Ну сходил я на этот фильм. Отзывы были нормальные, а оказалось - отстой!',
                    'StackOverflow рулит', 'все очень плохо', 'бывало и получше', 'да вы задолбали',
                    'ненавижу вас :)))', 'ненавижу вас']
            })
            test['ttype'] = [1, -1, 1, -1, -1, -1, 1, -1]
            return test
    
    def fit_all_classifiers_grid(X, y, classifiers, **common_grid_kwargs):
        grids = {}
        for clf in classifiers:
            print('{:-^70}'.format(' [' + clf['name'] + '] '))
            pipe = Pipeline([
                        ("vect", CountVectorizer()),
                        (clf['name'], clf['clf'])])
            grids[clf['name']] = (GridSearchCV(pipe,
                                               param_grid=clf['parm_grid'],
                                               **common_grid_kwargs)
                                      .fit(X, y))
            # saving single trained model ...
            joblib.dump(grids[clf['name']], './pickle/{}.pkl'.format(clf['name']))
        return grids
    
    classifiers = [
        {   'name':     'SGD',
            'clf':      SGDClassifier(),
            'title':    "SGDClassifier",
            'parm_grid':  {
                            'vect__min_df':         [1, 2, 3],
                            'vect__ngram_range':    [(2,5)],
                            'vect__analyzer':       ['char_wb'],
                            'SGD__alpha':           [0.0001, 0.001, 0.01, 0.1],
                            'SGD__max_iter':        [200]
            } 
        },
        #{   'name':     'LogRegr',
        #    'clf':      LogisticRegression(),
        #    'title':    "LogisticRegression",
        #    'parm_grid':  {
        #                    'vect__min_df':         [1, 2, 3],
        #                    'vect__ngram_range':    [(2,5)],
        #                    'vect__analyzer':       ['char_wb'],
        #                    'LogRegr__C':           [5, 10],
        #                    'LogRegr__max_iter':    [100, 200]
        #    } 
        #},
        {   'name':     'MultinomialNB',
            'clf':      MultinomialNB(),
            'title':    "MultinomialNB",
            'parm_grid':  {
                            'vect__min_df':         [1, 2, 5, 7],
                            'vect__ngram_range':    [(2,5)],
                            'vect__analyzer':       ['char_wb'],
                            'MultinomialNB__alpha': [0.0001, 0.001, 0.01, 0.1]
            } 
        },
        {   'name':     'KNN',
            'clf':      KNeighborsClassifier(),
            'title':    "K-Neighbors",
            'parm_grid':  {
                            'vect__min_df':         [1, 3, 5, 7],
                            'vect__ngram_range':    [(2,5)],
                            'vect__analyzer':       ['char_wb'],
                            'KNN__n_neighbors':     [3, 4, 5]
            } 
        },
        {   'name':     'SVC_linear',
            'clf':      SVC(),
            'title':    "SVC (linear)",
            'parm_grid':  {
                            'vect__min_df':         [1, 3, 5],
                            'vect__ngram_range':    [(2,5)],
                            'vect__analyzer':       ['char_wb'],
                            'SVC_linear__kernel':   ['linear'],
                            'SVC_linear__C':        [0.025, 0.1, 0.5],
            } 
        },
        {   'name':     'SVC_rbf',
            'clf':      SVC(),
            'title':    "SVC (rbf)",
            'parm_grid':  {
                            'vect__min_df':         [1, 3, 5],
                            'vect__ngram_range':    [(2,5)],
                            'vect__analyzer':       ['char_wb'],
                            'SVC_rbf__kernel':      ['rbf'],
                            'SVC_rbf__gamma':       ['auto'],
                            'SVC_rbf__C':           [0.5, 1, 2],
            } 
        },
        #{   'name':     'DecisionTree',
        #    'clf':      DecisionTreeClassifier(),
        #    'title':    "DecisionTree",
        #    'parm_grid':  {
        #                    'vect__min_df':         [1, 3, 5],
        #                    'vect__ngram_range':    [(2,5)],
        #                    'vect__analyzer':       ['char_wb'],
        #                    'DecisionTree__max_depth':  [3, 5],
        #    } 
        #},
        #{   'name':     'RandomForest',
        #    'clf':      RandomForestClassifier(),
        #    'title':    "RandomForest",
        #    'parm_grid':  {
        #                    'vect__min_df':         [1, 3, 5],
        #                    'vect__ngram_range':    [(2,5)],
        #                    'vect__analyzer':       ['char_wb'],
        #                    'RandomForest__max_depth':      [3, 5],
        #                    'RandomForest__n_estimators':   [10],
        #                    'RandomForest__max_features':   [1],
        #    } 
        #},
        {   'name':     'MLP_NN',                  # NOTE: very slow, might give poor accuracy on small data sets
            'clf':      MLPClassifier(),
            'title':    "MLP NN",
            'parm_grid':  {
                            'vect__min_df':         [3, 5, 7],
                            'vect__ngram_range':    [(2,5)],
                            'vect__analyzer':       ['char_wb'],
                            'MLP_NN__activation':   ['relu'],
                            'MLP_NN__alpha':        [0.0001, 0.001, 0.01, 0.1],
            } 
        },
        #{   'name':     'AdaBoost',                # NOTE: poor accuracy
        #    'clf':      AdaBoostClassifier(),
        #    'title':    "AdaBoost",
        #    'parm_grid':  {
        #                    'vect__min_df':         [1, 3, 5, 7],
        #                    'vect__ngram_range':    [(2,5)],
        #                    'vect__analyzer':       ['char_wb'],
        #                    'AdaBoost__n_estimators':   [25, 50, 75, 150],
        #    } 
        #},
    ]
    
    
    def print_grid_results(grids):
        for name, clf in grids.items():
            print('{:-^70}'.format(' [' + name + '] '))
            print('Score:\t\t{:.2%}'.format(clf.best_score_))
            print('Parameters:\t{}'.format(clf.best_params_))
            print('*' * 70)
    
    
    def print_best_features(grids, clf_name, n=20):
        clf = grids[clf_name]
        if not hasattr(clf.best_estimator_.named_steps[clf_name], 'coef_'):
            print('*' * 70)
            print('Attribute [coef_] not available for [clf_name]')
            print('*' * 70)
            return
        features = clf.best_estimator_.named_steps['vect'].get_feature_names()
        coefs = pd.Series(clf.best_estimator_.named_steps[clf_name].coef_.ravel(), features)
        print('*' * 70)
        print('Top {} POSITIVE features:'.format(n))
        print('*' * 70)
        print(coefs.nlargest(20))
        print('-' * 70)
        print('Top {} NEGATIVE features:'.format(n))
        print('*' * 70)
        print(coefs.nsmallest(20))
        print('-' * 70)
        print('*' * 70)
    
    def test_unseen_dataset(grid, test_df, X_col='ttext'):
        for name, clf in grid.items():
            test_df['pred_{}'.format(name)] = clf.predict(test_df[X_col])
        return test_df
    
    
    def main(path):    
        p =  Path('.')
        pkl_dir = p / 'pickle'
        print(pkl_dir)
        pkl_dir.mkdir(parents=True, exist_ok=True)
    
        # read data set into DF. Only the following columns: ['id','tdate','ttext','ttype']
        df = get_train_data(path, frac=0.05)
    
        test = get_test_data()
    
        # tune up hyperparameters for ALL classifiers
        print('Tuning up hyperparameters for ALL classifiers ...')
        print('NOTE: !!! this might take hours !!!')
        grid = fit_all_classifiers_grid(df['ttext'], df['ttype'],
                                        classifiers, cv=2,
                                        verbose=2, n_jobs=-1)
    
        # persist trained models
        fn = str(pkl_dir / 'ALL_grids.pkl')
        print('Saving tuned [grid] to [{}]'.format(fn))
        joblib.dump(grid, fn)
    
        # print best scores and best parameters for ALL classifiers
        print_grid_results(grid)
    
        pd.options.display.expand_frame_repr = False
        test = test_unseen_dataset(grid, test, 'ttext')
        test.to_excel('./test.xlsx', index=False)
        #print(test)
        print(test.iloc[:, 2:].eq(test['ttype'], axis=0).mean())
    
    if __name__ == "__main__":
        p =  Path(__file__).parent.resolve()
        main(str(p / 'pos_neg.csv.gz'))
    

    PS答案中使用了 Yulia Rubtsova 准备的语料库

    • 6

相关问题

Sidebar

Stats

  • 问题 10021
  • Answers 30001
  • 最佳答案 8000
  • 用户 6900
  • 常问
  • 回答
  • Marko Smith

    是否可以在 C++ 中继承类 <---> 结构?

    • 2 个回答
  • Marko Smith

    这种神经网络架构适合文本分类吗?

    • 1 个回答
  • Marko Smith

    为什么分配的工作方式不同?

    • 3 个回答
  • Marko Smith

    控制台中的光标坐标

    • 1 个回答
  • Marko Smith

    如何在 C++ 中删除类的实例?

    • 4 个回答
  • Marko Smith

    点是否属于线段的问题

    • 2 个回答
  • Marko Smith

    json结构错误

    • 1 个回答
  • Marko Smith

    ServiceWorker 中的“获取”事件

    • 1 个回答
  • Marko Smith

    c ++控制台应用程序exe文件[重复]

    • 1 个回答
  • Marko Smith

    按多列从sql表中选择

    • 1 个回答
  • Martin Hope
    Alexandr_TT 圣诞树动画 2020-12-23 00:38:08 +0000 UTC
  • Martin Hope
    Suvitruf - Andrei Apanasik 什么是空? 2020-08-21 01:48:09 +0000 UTC
  • Martin Hope
    Air 究竟是什么标识了网站访问者? 2020-11-03 15:49:20 +0000 UTC
  • Martin Hope
    Qwertiy 号码显示 9223372036854775807 2020-07-11 18:16:49 +0000 UTC
  • Martin Hope
    user216109 如何为黑客设下陷阱,或充分击退攻击? 2020-05-10 02:22:52 +0000 UTC
  • Martin Hope
    Qwertiy 并变成3个无穷大 2020-11-06 07:15:57 +0000 UTC
  • Martin Hope
    koks_rs 什么是样板代码? 2020-10-27 15:43:19 +0000 UTC
  • Martin Hope
    Sirop4ik 向 git 提交发布的正确方法是什么? 2020-10-05 00:02:00 +0000 UTC
  • Martin Hope
    faoxis 为什么在这么多示例中函数都称为 foo? 2020-08-15 04:42:49 +0000 UTC
  • Martin Hope
    Pavel Mayorov 如何从事件或回调函数中返回值?或者至少等他们完成。 2020-08-11 16:49:28 +0000 UTC

热门标签

javascript python java php c# c++ html android jquery mysql

Explore

  • 主页
  • 问题
    • 热门问题
    • 最新问题
  • 标签
  • 帮助

Footer

RError.com

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

帮助

© 2023 RError.com All Rights Reserve   沪ICP备12040472号-5