热线电话:13121318867

登录
2018-12-14 阅读量: 764
在python中怎么标记文本?

要运行下面的python程序,必须在系统中安装(NLTK)自然语言工具包。
NLTK模块是一个庞大的工具包,旨在帮助您使用整个自然语言处理(NLP)方法。
要安装NLTK,请在终端中运行以下命令。

  • sudo pip install nltk
  • 然后,只需输入python,即可在终端中输入python shell
  • 类型进口NLTK
  • nltk.download( '全部')

由于大量的tokenizer,chunkers,其他算法和所有要下载的语料库,上述安装将花费相当长的时间。
一些经常使用的术语是:

  • 语料库 - 文本正文,单数。Corpora是其中的复数。
  • 词典 - 单词及其含义。
  • 令牌 - 每个“实体”是基于规则分割的任何内容的一部分。例如,当句子被“标记化”为单词时,每个单词都是一个标记。如果您将句子标记为段落,则每个句子也可以是一个标记。

因此,基本上标记化涉及从文本正文中分割句子和单词。

# import the existing word and sentence tokenizing

# libraries

from nltk.tokenize import sent_tokenize, word_tokenize

text = "Natural language processing (NLP) is a field " + \

"of computer science, artificial intelligence " + \

"and computational linguistics concerned with " + \

"the interactions between computers and human " + \

"(natural) languages, and, in particular, " + \

"concerned with programming computers to " + \

"fruitfully process large natural language " + \

"corpora. Challenges in natural language " + \

"processing frequently involve natural " + \

"language understanding, natural language" + \

"generation frequently from formal, machine" + \

"-readable logical forms), connecting language " + \

"and machine perception, managing human-" + \

"computer dialog systems, or some combination " + \

"thereof."

print(sent_tokenize(text))

print(word_tokenize(text))`

OUTPUT

[‘Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.’, ‘Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, managing human-computer dialog systems, or some combination thereof.’]

[‘Natural’, ‘language’, ‘processing’, ‘(‘, ‘NLP’, ‘)’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘,’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘(‘, ‘natural’, ‘)’, ‘languages’, ‘,’, ‘and’, ‘,’, ‘in’, ‘particular’, ‘,’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘.’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘,’, ‘natural’, ‘language’, ‘generation’, ‘(‘, ‘frequently’, ‘from’, ‘formal’, ‘,’, ‘machine-readable’, ‘logical’, ‘forms’, ‘)’, ‘,’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘,’, ‘managing’, ‘human-computer’, ‘dialog’, ‘systems’, ‘,’, ‘or’, ‘some’, ‘combination’, ‘thereof’, ‘.’]

所以,我们创建了令牌,最初是句子,后来是单词。

0.0000
4
关注作者
收藏
评论(0)

发表评论

暂无数据
推荐帖子