博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms
阅读量:7064 次
发布时间:2019-06-28

本文共 3058 字,大约阅读时间需要 10 分钟。

I would like to thank  and  for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA () and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available via data package and originally were provided by . It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supportedsource("https://bioconductor.org/biocLite.R") biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)BRCA.rnaseq$bcr_patient_barcode <- substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the  from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: .

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the . Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC)registerDoMC(cores=6) library(glmnet) # fit the model cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]), y = factor(BRCA.rnaseq[, 1]), family = "binomial", type.measure = "class", parallel = TRUE) -> cvfit # extract feature names that have # non zero coefficiant names(which( coef(cvfit, s = "lambda.min")[, 1] != 0) )[-1] -> glmnet.features # first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

plot of chunk unnamed-chunk-5

. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

转自:

 

转载于:https://www.cnblogs.com/payton/p/5604104.html

你可能感兴趣的文章
Android中Messenger的使用
查看>>
判断矩形相交
查看>>
html笔记
查看>>
[Java]安装Tomcat
查看>>
linux下进度条的简单实现
查看>>
我的友情链接
查看>>
Android项目中引用外部项目library失败的原因
查看>>
线性回归原理和实现基本认识
查看>>
类的生命周期
查看>>
Docker 入门及安装[Docker 系列-1]
查看>>
java中使用反射获取pojo(实体)类的所有字段值
查看>>
Linux - 常用参考资料(持续更新)
查看>>
运维经验分享(一)-- Linux Shell之ChatterServer服务控制脚本
查看>>
Linux - tar命令详解
查看>>
DFA和NFA
查看>>
NTP常见问题和解决方案&配置文件详解
查看>>
XmlParser和HtmlParser
查看>>
smartsvn学习(二)如何在Xcode下使用SVN
查看>>
我的友情链接
查看>>
二维条码防伪封签是怎样进行防伪的
查看>>