python如何分析文本数据

詹惠儿

2019-02-22 阅读量: 580

python如何分析文本数据

书面文本中的模式在所有作者或语言中都不相同。这使得语言学家可以研究起源语言或文本的潜在作者身份，而这些特征并非直接为人所知，例如美国革命的联邦主义论文。

目的：在本案例研究中，我们将检查来自不同作者和各种语言的书籍集中的各个书籍的属性。更具体地说，我们将查看书籍长度，独特单词的数量，以及这些属性如何按语言或作者。

所以我们将构建一个函数来计算文本中的单词频率。我们将考虑一个示例测试文本，稍后将用我们刚刚下载的书籍的文本文件替换示例文本。因为我们要去计算单词频率，因此大写和小写字母是相同的。我们将整个文本转换为小写并保存。

from collections import Counter

def count_words(text): #counts word frequency

skips = [".", ", ", ":", ";", "'", '"']

for ch in skips:

text = text.replace(ch, "")

word_counts = {}

for word in text.split(" "):

if word in word_counts:

word_counts[word]+= 1

else:

word_counts[word]= 1

return word_counts

# >>>count_words(text) You can check the function

def count_words_fast(text): #counts word frequency using Counter from collections

text = text.lower()

skips = [".", ", ", ":", ";", "'", '"']

for ch in skips:

text = text.replace(ch, "")

word_counts = Counter(text.split(" "))

return word_counts

# >>>count_words_fast(text) You can check the function

输出：

{‘were’: 1, ‘is’: 1, ‘manageable’: 1, ‘to’: 1, ‘things’: 1, ‘keeping’: 1, ‘my’: 1, ‘test’: 1, ‘text’: 2, ‘keep’: 1, ‘short’: 1, ‘this’: 2}
Counter({‘text’: 2, ‘this’: 2, ‘were’: 1, ‘is’: 1, ‘manageable’: 1, ‘to’: 1, ‘things’: 1, ‘keeping’: 1, ‘my’: 1, ‘test’: 1, ‘keep’: 1, ‘short’: 1})

0.0000

关注作者

发表评论

暂无数据

CDA考试动态

CDA报考指南

推荐帖子