文字可能包含“the”,“is”,“are”等停用词。可以从要处理的文本中过滤停用词。在nlp研究中没有通用的停用词列表,但是nltk模块包含一个停用词列表。
您可以添加自己的停用词。转到您的NLTK下载目录路径- >语料库- >停用词- >更新停用词文件取决于您使用的语言。这里我们使用英语(stopwords.words('english'))。
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
// Dummy text
txt = "Sukanya, Rajib and Naba are my good friends. " \
"Sukanya is getting married next year. " \
"Marriage is a big step in one’s life." \
"It is both exciting and frightening. " \
"But friendship is a sacred bond between people." \
"It is a special kind of love between us. " \
"Many of you must have tried searching for a friend "\
"but never found the right one."
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(txt)
for i in tokenized:
# Word tokenizers is used to find the words
# and punctuation in a string
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
Output:
[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')]
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'),
('never', 'RB'), ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]
基本上,POS标记器的目标是将语言(主要是语法)信息分配给子句子单元。这些单位称为令牌,并且大多数时候对应于单词和符号(例如标点符号)。
三个资料Q群下载不了也转发不了,先放这里Fine_tuning.zipLangChain.zipdata_clear.rar