POS标记器如何去停用词？

詹惠儿

2019-07-03 阅读量: 619

POS标记器如何去停用词？

文字可能包含“the”，“is”，“are”等停用词。可以从要处理的文本中过滤停用词。在nlp研究中没有通用的停用词列表，但是nltk模块包含一个停用词列表。

您可以添加自己的停用词。转到您的NLTK下载目录路径- >语料库- >停用词- >更新停用词文件取决于您使用的语言。这里我们使用英语（stopwords.words（'english'））。

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words('english'))

// Dummy text

txt = "Sukanya, Rajib and Naba are my good friends. " \

"Sukanya is getting married next year. " \

"Marriage is a big step in one’s life." \

"It is both exciting and frightening. " \

"But friendship is a sacred bond between people." \

"It is a special kind of love between us. " \

"Many of you must have tried searching for a friend "\

"but never found the right one."

# sent_tokenize is one of instances of

# PunktSentenceTokenizer from the nltk.tokenize.punkt module

tokenized = sent_tokenize(txt)

for i in tokenized:

# Word tokenizers is used to find the words

# and punctuation in a string

wordsList = nltk.word_tokenize(i)

# removing stop words from wordList

wordsList = [w for w in wordsList if not w in stop_words]

# Using a Tagger. Which is part-of-speech

# tagger or POS-tagger.

tagged = nltk.pos_tag(wordsList)

print(tagged)

0.0000

关注作者

发表评论

暂无数据

CDA考试动态

CDA报考指南

推荐帖子