第2步:文本清理或预处理
- 删除标点符号,数字:标点符号,数字对给定文本的处理过程没有多大帮助,如果包含它们,它们只会增加我们将作为最后一步创建的单词包的大小并降低算法的效率。
- 词干:扎根这个词

将每个单词转换为小写:例如,在不同的情况下使用相同的单词是无用的(例如'good'和'GOOD')。
# library to clean data
import re
# Natural Language Tool Kit
import nltk
nltk.download('stopwords')
# to remove stopword
from nltk.corpus import stopwords
# for Stemming propose
from nltk.stem.porter import PorterStemmer
# Initialize empty array
# to append clean text
corpus = []
# 1000 (reviews) rows to clean
for i in range(0, 1000):
# column : "Review", row ith
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
# convert all cases to lower cases
review = review.lower()
# split to array(default delimiter is " ")
review = review.split()
# creating PorterStemmer object to
# take main stem of each word
ps = PorterStemmer()
# loop for stemming each word
# in string array at ith row
review = [ps.stem(word) for word in review
if not word in set(stopwords.words('english'))]
# rejoin all string array elements
# to create back into a string
review = ' '.join(review)
# append each string to create
# array of clean text
corpus.append(review)








暂无数据