A large suite of exploratory and predictive analysis to better understand the features of long non-coding RNA.
Find the features of lncRNA that are associated with biological function and predict the set of lncRNA most likely to directly impact human physiology. Answering this question is important, as it will provide us with insights into what makes parts of the genome biologically important, as well as characterizing the background level of noise responsible spurious transcripts to form
pageRank white paper
After collection a list of over 100 lncRNA with demonstrated function, we wanted to compare their RNA expression, or abundance in different cell types, to lncRNA without known function. One hypothesis was that functional lncRNA are tightly clustered in expression space with the remainder generally dispersed. To measure this closeness between lncRNA, we decided to use the page rank algorithm: using euclidian distance between points in expression space instead of incoming page links.Ideally, if functional lncRNA are tightly grouped, they will have a much higher page rank value.
Here's an example comparing pagerank of closely grouped, versus dispersed points in a distribution From the analysis, we realized that non-functional lncRNA are generally clustered around the origin of expression space, consistent with their low expression, and functional lncRNA are much more dispersed.
pagerank importance values for functional lncRNA To implement the pagerank algorithm I decided to extend the my R code base with C++ after realizing that calculating eigenvectors for dataset was too slow in base R. Simply, the Rcpp package allowed me to add a C++ function as an inline string(see
srcCalcEigenCppD), wrap the function, then use it as a regular R function.
Implementing logistic regression with custom decision boundaries
Using N-dimensional data, logistic regression can be used to apply a circular decision boundary. Increasing the degree polynomial of the objective function, the decision boundary can can therefore be more complicated. To implement logistic regression, the first and second derivatives to the objective function are used to optimize a weight matrix, which is then applied to the test set. Although many packages exist for this purpose, implementing the logistic regression algorithm gives greater control, and is a great learning exercise.
These two plots should be the same, which means that the formula used in the implemented logistic regression, is equivalent to applying a polynomial of degree 2 to the data, then passing the data into a regular logistic regression.
ellipse implemented into the algorithm
polynomial of degree=2 applied to data, then passed to log reg
gradient function using optimization to find a log. reg. parameter
(note: the optimization doesn't use the 2nd differential, as the optimization works better this way)
Extending the learning problem to account for unlabeled data
In most training sets, we some set of features, labeled as 'positive' or 'negative'. These two labels are then used to train an objective function. However, there may not be examples that are classified as either positive or negative. This is exactly the case with predicting functional lncRNA. We have published reports of positive examples, but no one publishes papers with negative results. (This should change, ask me about the problems with academic publishing...).
#####Soltion: Elkan and Noto In a brilliant paper, Elkan determined a way to adjust the probability outcome of any algorithm to account for unlabeled examples. Elkan's
c is multiplied by the probability of a example being labeled, to get the probability of a example being positive.
code to estimate c.
plots when applied to ridge regression