1.准备数据
[plain] view plain copy
> install.packages("tree")
> library(tree)
> library(ISLR)
> attach(Carseats)
> High=ifelse(Sales<=8,"No","Yes") //set high values by sales data to calssify
> Carseats=data.frame(Carseats,High) //include the high data into the data source
> fix(Carseats)
2.生成决策树
[plain] view plain copy
> tree.carseats=tree(High~.-Sales,Carseats)
> summary(tree.carseats)
[plain] view plain copy
//output training error is 9%
Classification tree:
tree(formula = High ~ . - Sales, data = Carseats)
Variables actually used in tree construction:
[1] "ShelveLoc" "Price" "Income" "CompPrice" "Population"
[6] "Advertising" "Age" "US"
Number of terminal nodes: 27
Residual mean deviance: 0.4575 = 170.7 / 373
Misclassification error rate: 0.09 = 36 / 400
3. 显示决策树
[plain] view plain copy
> plot(tree . carseats )
> text(tree .carseats ,pretty =0)
4.Test Error
[plain] view plain copy
//prepare train data and test data
//We begin by using the sample() function to split the set of observations sample() into two halves, by selecting a random subset of 200 observations out of the original 400 observations.
> set . seed (1)
> train=sample(1:nrow(Carseats),200)
> Carseats.test=Carseats[-train,]
> High.test=High[-train]
//get the tree model with train data
> tree. carseats =tree (High~.-Sales , Carseats , subset =train )
//get the test error with tree model, train data and predict method
//predict is a generic function for predictions from the results of various model fitting functions.
> tree.pred = predict ( tree.carseats , Carseats .test ,type =" class ")
> table ( tree.pred ,High. test)
High. test
tree. pred No Yes
No 86 27
Yes 30 57
> (86+57) /200
[1] 0.715
5.决策树剪枝
[plain] view plain copy
/**
Next, we consider whether pruning the tree might lead to improved results. The function cv.tree() performs cross-validation in order to cv.tree() determine the optimal level of tree complexity; cost complexity pruning is used in order to select a sequence of trees for consideration.
For regression trees, only the default, deviance, is accepted. For classification trees, the default is deviance and the alternative is misclass (number of misclassifications or total loss).
We use the argument FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance.
If the tree is regression tree,
> plot(cv. boston$size ,cv. boston$dev ,type=’b ’)
*/
> set . seed (3)
> cv. carseats =cv. tree(tree .carseats ,FUN = prune . misclass ,K=10)
//The cv.tree() function reports the number of terminal nodes of each tree considered (size) as well as the corresponding error rate(dev) and the value of the cost-complexity parameter used (k, which corresponds to α.
> names (cv. carseats )
[1] " size" "dev " "k" " method "
> cv. carseats
$size //the number of terminal nodes of each tree considered
[1] 19 17 14 13 9 7 3 2 1
$dev //the corresponding error rate
[1] 55 55 53 52 50 56 69 65 80
$k // the value of the cost-complexity parameter used
[1] -Inf 0.0000000 0.6666667 1.0000000 1.7500000
2.0000000 4.2500000
[8] 5.0000000 23.0000000
$method //miscalss for classification tree
[1] " misclass "
attr (," class ")
[1] " prune " "tree. sequence "
[plain] view plain copy
//plot the error rate with tree node size to see whcih node size is best
> plot(cv. carseats$size ,cv. carseats$dev ,type=’b ’)
/**
Note that, despite the name, dev corresponds to the cross-validation error rate in this instance. The tree with 9 terminal nodes results in the lowest cross-validation error rate, with 50 cross-validation errors. We plot the error rate as a function of both size and k.
*/
> prune . carseats = prune . misclass ( tree. carseats , best =9)
> plot( prune . carseats )
> text( prune .carseats , pretty =0)
//get test error again to see whether the this pruned tree perform on the test data set
> tree.pred = predict ( prune . carseats , Carseats .test , type =" class ")
> table ( tree.pred ,High. test)
High. test
tree. pred No Yes
No 94 24
Yes 22 60
> (94+60) /200
[1] 0.77
数据分析咨询请扫描二维码
CDA数据分析师在中国航信高科技产业园进行了面向测试度量的数据分析培训课程,培训人数近2 ...
2024-05-01CDA数据分析师走进深圳迈瑞生物医疗电子股份有限公司,在迈瑞总部展开了为期两天的培训,本次课程参训人员线上及线下近百人, ...
2024-05-01CDA数据分析师在合肥市对合肥阳光新能源科技有限公司开展了为期8天的企业内训。 合肥阳光新能源科技 ...
2024-05-01CDA数据分析师走进海尔大学,进行了《数据治理与数据中台建设的道与术》专题培训,培训现场爆满,近百人参加了此次培训。 ...
2024-05-01在中国银行苏州分行培训中心开始数据分析师培训,此次培训课程共10天内容,包括Excel、MySQL、概率论与数理统计、SPSS等内容, ...
2024-05-01从实际的业务需求出发,结合行业的典型应用特点,围绕实际的商业问题,探讨数据挖掘、机器学习模型在金融领域的应用,包括获客、信用评分、细分画像、交叉销售、反欺诈、违规识别、时序预测、运筹优化、流程挖掘九个方面,形成 ...
2024-05-01本次培训课程为线上+线下的模式,由于学员编程能力不一、部分学员没有编程基础,故提供统计学、python基 ...
2024-05-01华夏银行信用卡中心-机器学习培训 1、课程亮点 取材于业界一流企业和顶级咨询公司的行业实践;已经被证明是人人 ...
2024-05-01主 题:数据中台建设及数据分析应用主题分享 1. 数据中台市场洞察 2. 主流数据中台产品比较 3. 某企业数据中 ...
2024-05-01围绕“数据驱动”战略,全力打造我行 300 人数字化人才梯队,着力培养数字化管理人才、大数据专业团队 ...
2024-05-01在当今数据驱动的商业环境中,数据分析成为了企业决策的重要依据。通过对大量数据的收集、处理和分析,企业能够更好地理解市场 ...
2024-04-29在人工智能(AI)的世界里,提示词(Prompt)是一种强大的工具,它能够引导AI按照用户的需求产生特定的输出。本文将深入探讨AI ...
2024-04-29CDA立足未来职场,拓展前沿视野——对外经贸大学保险学院举办“三全育人大讲堂”分享行业最新动态。 ...
2024-04-294月2日,CDA数据分析师创始发起人兼协会理事长赵坚毅博士受邀在浙江万里学院举办了一场以“数字化能力在职场中的作用” ...
2024-04-29随机森林(Random Forests)现在机器学习中比较火的一个算法,是一种基于Bagging的集成学习方法,能够很好地处理分类和回归的问 ...
2022-12-23方差分析是数据分析中常用的一种统计分析方法,接下来让我们简单了解一下方差分析的基本思想和原理吧。 方差分析(Analysis ...
2022-12-23来源:关于数据分析与可视化 关于streamlit-aggrid 数据排序 表格样式的调整 数据 ...
2022-08-03作者:麦叔 定义 「把上面晦涩的概念汇成一句话就是:」 ❝ 回调函数就是一个被作为参 ...
2022-08-03现今,高学历人群日益增多,物以稀为贵的高学历光环淡去。无论本科生还是研究生,甚至博士生,求职竞争力都大不如前,就业压力越来越大。
2022-06-01某家企业10个人面试,有9个本科生……如何脱颖而出,除得体的举止和良好的沟通力外,证书成重要筹码,这也是很多人考证的关键所在。
2022-04-14