热线电话:13121318867

登录
2019-02-26 阅读量: 609
python如何利用NLP对文本进行分析(2)

第2步:文本清理或预处理

  • 删除标点符号,数字:标点符号,数字对给定文本的处理过程没有多大帮助,如果包含它们,它们只会增加我们将作为最后一步创建的单词包的大小并降低算法的效率。
  • 词干:扎根这个词

将每个单词转换为小写:例如,在不同的情况下使用相同的单词是无用的(例如'good'和'GOOD')。

# library to clean data

import re

# Natural Language Tool Kit

import nltk

nltk.download('stopwords')

# to remove stopword

from nltk.corpus import stopwords

# for Stemming propose

from nltk.stem.porter import PorterStemmer

# Initialize empty array

# to append clean text

corpus = []

# 1000 (reviews) rows to clean

for i in range(0, 1000):

# column : "Review", row ith

review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])

# convert all cases to lower cases

review = review.lower()

# split to array(default delimiter is " ")

review = review.split()

# creating PorterStemmer object to

# take main stem of each word

ps = PorterStemmer()

# loop for stemming each word

# in string array at ith row

review = [ps.stem(word) for word in review

if not word in set(stopwords.words('english'))]

# rejoin all string array elements

# to create back into a string

review = ' '.join(review)

# append each string to create

# array of clean text

corpus.append(review)

24.8917
2
关注作者
收藏
评论(0)

发表评论

暂无数据
推荐帖子