热线电话:13121318867

登录
2019-04-15 阅读量: 646
python如何标记文本

要运行下面的python程序,必须在系统中安装(NLTK)自然语言工具包。基本上标记化涉及从文本正文中分割句子和单词。

# import the existing word and sentence tokenizing

# libraries

from nltk.tokenize import sent_tokenize, word_tokenize

text = "Natural language processing (NLP) is a field " + \

"of computer science, artificial intelligence " + \

"and computational linguistics concerned with " + \

"the interactions between computers and human " + \

"(natural) languages, and, in particular, " + \

"concerned with programming computers to " + \

"fruitfully process large natural language " + \

"corpora. Challenges in natural language " + \

"processing frequently involve natural " + \

"language understanding, natural language" + \

"generation frequently from formal, machine" + \

"-readable logical forms), connecting language " + \

"and machine perception, managing human-" + \

"computer dialog systems, or some combination " + \

"thereof."

print(sent_tokenize(text))

print(word_tokenize(text))`

输出

[‘Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.’, ‘Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.’]

[‘Natural’, ‘language’, ‘processing’, ‘(‘, ‘NLP’, ‘)’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘,’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘(‘, ‘natural’, ‘)’, ‘languages’, ‘,’, ‘and’, ‘,’, ‘in’, ‘particular’, ‘,’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘.’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘,’, ‘natural’, ‘language’, ‘generation’, ‘(‘, ‘frequently’, ‘from’, ‘formal’, ‘,’, ‘machine-readable’, ‘logical’, ‘forms’, ‘)’, ‘,’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘,’, ‘managing’, ‘human-computer’, ‘dialog’, ‘systems’, ‘,’, ‘or’, ‘some’, ‘combination’, ‘thereof’, ‘.’]

0.0000
5
关注作者
收藏
评论(0)

发表评论

暂无数据
推荐帖子