你以为川普的推特都是他自己写的？数据可不这么认为！-CDA数据分析师官网

热线电话：13121318867

你以为川普的推特都是他自己写的？数据可不这么认为！

2017-02-17

那么事实真的是这样吗？

有个美国网友发现川普发推特有两个客户端。一个安卓，另一个是 iPhone 。

而且这位细心的网友还发现，一些言辞激烈的推都来自安卓；而画风比较正常的推都来自 iPhone。

这一发现，也引起了数据分析师 David Robinson 的注意。David 注意到当川普发祝贺内容时，是通过 iPhone ；而当他抨击竞选对手时而是通过安卓。而且两个不同客户端通常发推的时间也不太相同。

本着科学严谨的态度，程序员小哥决定让数据说话，于是做了程序，抓取分析了川普发过的推，终于发现了一些模式。并且通过统计，图表，最终他基本确定，川普的推特并不是他一个人写的。

数据证明，安卓端和iPhone发的推分别是两个人所写的。而且发推时间，使用标签，加链接，转发的方式也截然不同。同时，安卓端发的内容更加激烈和消极。

如果就像川普采访中所说他使用的手机是三星 Galaxy ，我们可以确信用安卓发推的是川普本人，用 iPhone 发的大概是他的团队助理。

丨发推时间对比
首先用 twitteR 包中的 userTimeline 函数导入川普发推的时间数据：

♦ library ( dplyr )
♦ library ( purrr )
♦ library ( twitteR )

# You'd need to set global options with an authenticated

appsetup_twitter_oauth(getOption("twitter_consumer_key"),

getOption("twitter_consumer_secret"),
getOption("twitter_access_token"),
getOption("twitter_access_token_secret"))
# We can request only 3200 tweets at a time; it will return fewer

# depending on the

APItrump_tweets <- userTimeline("realDonaldTrump", n = 3200)trump_tweets_df <- tbl_df(map_df(trump_tweets, as.data.frame))

# if you want to follow along without setting up Twitter authentication,
# just use my dataset:load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))

稍微清理下数据,提取源文件。(在此只分析来自 iPhone 和 Android tweet 的数据，除去很少一部分发自网页客户端和 iPad 的推文)。

library(tidyr)
tweets <- trump_tweets_df %>%
select(id, statusSource, text, created) %>%
extract(statusSource, "source", "Twitter for (.*?)<") %>%
filter(source %in% c("iPhone", "Android"))

分析的数据包括来自 iPhone 的 628 条推文，来自 Android 的 762 条推文。

主要考虑推文是在一天内什么时间发布的，在此我们可以发现区别：

♦ library(lubridate)
♦ library(scales)

tweets %>%
count(source, hour = hour(with_tz(created, "EST"))) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(hour, percent, color = source)) +
geom_line() +
scale_y_continuous(labels = percent_format()) +
labs(x = "Hour of day (EST)",
y = "% of tweets",
color = "")

川普一般习惯早上发推，而他的助理会集中在下午或晚上发推。

丨发文习惯对比
当川普的安卓手机转推时，习惯用双引号引用这整句话。

而 iPhone 转推时，一般不使用双引号。

安卓手机： 500 多条推文没有双引号，200 多条有双引号

iPhone：几乎没有双引号

与此同时，在分享链接和图片时，安卓和 iPhone 也大不相同。

tweet_picture_counts <- tweets %>%
filter(!str_detect(text, '^"')) %>%
count(source,
picture = ifelse(str_detect(text, "t.co"),
"Picture/link", "No picture/link"))
ggplot(tweet_picture_counts, aes(source, n, fill = picture)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "", y = "Number of tweets", fill = "")

数据证明 iPhone 端发的推文很多会附上图片，链接。内容也以宣传为主。

比如下面这条：

而川普安卓端发的推文没有图片、链接，更多是直接的文字，比如：

丨用词对比
在对比安卓和 iPhone 用词区别时，David 用到了他和 Julia Silge 一起编写的 tidytext 包。

用 unnest_tokensfunction 把句子分解为单独的词:

library(tidytext)
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"tweet_words <- tweets %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
tweet_words

## # A tibble: 8,753 x 4
## id source created word
## <chr> <chr> <time> <chr>
## 1 676494179216805888 iPhone 2015-12-14 20:09:15 record
## 2 676494179216805888 iPhone 2015-12-14 20:09:15 health
## 3 676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain
## 4 676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016
## 5 676509769562251264 iPhone 2015-12-14 21:11:12 accolade
## 6 676509769562251264 iPhone 2015-12-14 21:11:12 @trumpgolf
## 7 676509769562251264 iPhone 2015-12-14 21:11:12 highly
## 8 676509769562251264 iPhone 2015-12-14 21:11:12 respected
## 9 676509769562251264 iPhone 2015-12-14 21:11:12 golf
## 10 676509769562251264 iPhone 2015-12-14 21:11:12 odyssey
## # ... with 8,743 more rows

总体来说川普推文中有哪些常用词呢？

在此基础上我们再来分别看安卓和 iPhone 常用词的区别。

android_iphone_ratios <- tweet_words %>%
count(word, source) %>%
filter(sum(n) >= 5) %>%
spread(source, n, fill = 0) %>%
ungroup() %>%
mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%
mutate(logratio = log2(Android / iPhone)) %>%
arrange(desc(logratio))

丨情感分析
安卓和 iPhone 推文在情感上也有很大的差异，让我们来量化一下。用到 tidytext 当中的NRC Word-Emotion Association 词典，主要把用词联系以下十种情绪分析：积极,消极,愤怒,期待,厌恶,恐惧,快乐,悲伤,惊讶,信任。

nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
dplyr::select(word, sentiment)
nrc

## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows

为了分别计算安卓和 iPhone 推文的情感，可以把不同用词分类。

sources <- tweet_words %>%
group_by(source) %>%
mutate(total_words = n()) %>%
ungroup() %>%
distinct(id, source, total_words)
by_source_sentiment <- tweet_words %>%
inner_join(nrc, by = "word") %>%
count(sentiment, id) %>%
ungroup() %>%
complete(sentiment, id, fill = list(n = 0)) %>%
inner_join(sources) %>%
group_by(source, sentiment, total_words) %>%
summarize(words = sum(n)) %>%
ungroup()
head(by_source_sentiment)

## # A tibble: 6 x 4
## source sentiment total_words words
## <chr> <chr> <int> <dbl>
## 1 Android anger 4901 321
## 2 Android anticipation 4901 256
## 3 Android disgust 4901 207
## 4 Android fear 4901 268
## 5 Android joy 4901 199
## 6 Android negative 4901 560

（比如，我们可以看到安卓推文中 4901 个词中 321 个词与情感“愤怒”有关。）

同时可以用 Poisson test 分析，比起 iPhone ，安卓推文更喜欢使用带强烈情绪的词。

library(broom)
sentiment_differences <- by_source_sentiment %>%
group_by(sentiment) %>%
do(tidy(poisson.test(.$words, .$total_words)))
sentiment_differences

## Source: local data frame [10 x 9]
## Groups: sentiment [10]
##
## sentiment estimate statistic p.value parameter conf.low
## (chr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 anger 1.492863 321 2.193242e-05 274.3619 1.2353162
## 2 anticipation 1.169804 256 1.191668e-01 239.6467 0.9604950
## 3 disgust 1.677259 207 1.777434e-05 170.2164 1.3116238
## 4 fear 1.560280 268 1.886129e-05 225.6487 1.2640494
## 5 joy 1.002605 199 1.000000e+00 198.7724 0.8089357
## 6 negative 1.692841 560 7.094486e-13 459.1363 1.4586926
## 7 positive 1.058760 555 3.820571e-01 541.4449 0.9303732
## 8 sadness 1.620044 303 1.150493e-06 251.9650 1.3260252
## 9 surprise 1.167925 159 2.174483e-01 148.9393 0.9083517
## 10 trust 1.128482 369 1.471929e-01 350.5114 0.9597478
## Variables not shown: conf.high (dbl), method (fctr), alternative (fctr)

我们可以用 95% 的置信区间来明确二者的区别: