May 18 2014

Data Mining

Making sense of data[^1]

3 Data Preparation

Preparing the data is one of the most time-consuming parts of any data analysis/data mining project.

3.1 DATA SOURCES

Surveys or polls
Experiments
Observational and other studies
Operational databases(CRM etc)
Data warehouses
Historical databases
Purchased data

3.2 DATA UNDERSTANDING

Data Tables
Continuous and Discrete Variables
Scales of Measurement(Nominal/Ordinal/IntervalRatio)
Roles in Analysis(Labels/Descriptors/Response)
Frequency Distribution

3.3 DATA PREPARATION

Normalization
- Min-max: $$\acute{value} = \frac{Value - OriginalMin}{OriginalMax - OriginalMin}*(NewMax - NewMin) + NewMin$$
- z-score: $$\acute{value} = \frac{Value - \bar{x}}{s}$$
- Decimal scaling: $$\acute{value} = \frac{Value}{10^n}$$

May 17 2014

Bioinformatics

Circos for Comparative genomics

Synteny and Comparative genomics

Welcome to see details in my github issues.

First, use Lastz for synteny block alignment.

Then use SVG for synteny block drawing.
line

And we found Circos more powerful on handing this.

circos

说明,本文以下内容多数来自网络资源转载,像原作者致意,感谢他们的原创工作!如想详细了解，请点击衔接，接入原文！

Circos Installation

For OSX: you can refer to this os-x-installation-guide and this.

Read this first

Circos系列教程一安装

Circos系列教程二染色体示意图ideograms

Circos系列教程三突出标记Highlight

Circos系列教程四连线links

Tutorial

CIRCOS教程翻译 1.1——helloworld
1.1

CIRCOS教程翻译 1.2——ticks
1.2

CIRCOS教程翻译 1.3 ——染色体的变化
1.3

CIRCOS教程翻译 1.4——links和rules
1.4

May 14 2014

Data Visualization

Designing Data Visualizations

Data Visualization

The terms data visualization and information visualization (casually, data viz and infoviz) are useful for referring to any visual representation of data that is:

algorithmically drawn (may have custom touches but is largely rendered with the help of computerized methods);
easy to regenerate with different data (the same form may be repurposed to rep-resent different datasets with similar dimensions or characteristics);
often aesthetically barren (data is not decorated); and
relatively data-rich (large volumes of data are welcome and viable, in contrast to infographics).

Encoding for your data type

May 12 2014

Data Mining

Machine learning: an algorithmic perspective(2009)

The Multi-Layer Perceptron(MLP)

Training the MLP consists of two parts:

working out what the outputs are for the given inputs and the current weights;
and then updating the weights according to the error, which is a function of the difference between the outputs and the targets.

These are generally known as going forwards and backwards through network.

Going Backwards：Back-Propagation of Error

Which makes it clear that the errors are sent backwards through the network. It is a form of gradient descent(梯度下降法).

May 12 2014

Data Mining

Short Introduction to Machine learning

Learning

When come to learning, what do we talk about it?

For machines, they are learning from data, since data is what they have.
For our human beings, especially behavioural terms, we are talking about learning from experience.

Machine Learning, is about computers modify or adapt their actions(whether these actions are making predictions or controlling a robot), so that these actions get more accurate, where accuracy is measured by how well the chosen actions reflect the correct ones.

Types of Machine Learning

Supervised learning.
A training set of examples with the correct responses (targets) are provided and, based on this training set, the algorithm generalizes to respond correctly to all possible inputs. This is called learning from examples.
Unsupervised learning.
Correct responses (targets) are not provided, instead the algorithm tries to identify similarities between the inputs so that inputs that have something in common are categorized together. The statistical approach to unsupervised learning is known as density estimation.
Reinforcement learning.
Somewhere between supervised learning and unsupervised learning. The algorithm gets told when the answer is wrong but does not get told how to correct it. It has to explore and try different possibilities until it works out how to get the answer right. It ‘s sometime called learning withe a critic because of this monitor that scores the answer, but does not suggest improvements.
Evolutionary learning.
Biological evolution can be seen as a learning process: biological organisms adapt to improve their survival rates and chances of having offspring in their environment. We’ll look at how we can model this in a computer by using an idea of fitness, which corresponds to a score for how good the current solutions is.

By the way, from ISL(An Introduction to Statistical Learning) by Trevor Hastie and Robert Tibshirani:

Supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.
With unsupervised statistical learning, there are inputs but no supervising output; nevertheless we can learn relationships and structure from such data.

The most common type of learning is supervised learning.

Supervised

We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).

Regression
Classification
GAM
boosting
support vector machines

Unsupervised

There is no response variable to predict, we observe a vector of measurements $x_i$ but no associated response $y_i$.

clustering

Reference:
[1]: Machine Learning: The Complete Guide

May 11 2014

Python

Python中4个强大的内置函数

python 四大利器

filter() and map() and reduce() and lambda()

`filter(function, iterable)¶`

filter(function, iterable) is equivalent to [item for item in iterable if function(item)] if function is not None and [item for item in iterable if item] if function is None.

see https://docs.python.org/2/library/functions.html?highlight=filter#filter

对迭代器中function(iter)为True的组成一个新的迭代器(list/string/tuple等，取决于迭代器的类型)。

>>> def f(x): return x % 2 != 0 and x % 3 !=0
>>> filter(f,xrange(1,100)) 
[1, 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, 35, 37, 41, 43, 47, 49, 53, 55, 59, 61, 65, 67, 71, 73, 77, 79, 83, 85, 89, 91, 95, 97]
# or
>>> filter(lambda x: x % 2 != 0 and x % 3 !=0, xrange(1,100))
[1, 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, 35, 37, 41, 43, 47, 49, 53, 55, 59, 61, 65, 67, 71, 73, 77, 79, 83, 85, 89, 91, 95, 97]
# or
>>> [ x for x in xrange(1,100) if x % 2 != 0 and x % 3 !=0]
[1, 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, 35, 37, 41, 43, 47, 49, 53, 55, 59, 61, 65, 67, 71, 73, 77, 79, 83, 85, 89, 91, 95, 97]

`map(function, iterable, ...)¶`

Apply function to every item of iterable and return a list of the results.

>>> def cube(x): return x*x*x 
>>> map(cube, xrange(0, 10))
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
# or
>>> map(lambda x: x*x*x, xrange(0, 10))
# or
>>> [x*x*x for x in xrange(0, 10)]
# for multiple iterable
>>> map(lambda x,y: x+ y, xrange(0, 10), xrange(10, 20))
[10, 12, 14, 16, 18, 20, 22, 24, 26, 28]

`reduce(function, iterable[, initializer])¶`

Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
For example, reduce(lambda x, y: x+y, [1,2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5).

>>> def add(x,y): return x + y
>>> reduce(add, xrange(1, 11))
55
>>> reduce(add, xrange(1, 11),20)
75

`lambda¶`

lambda是Python中一种非常有趣的语法，它允许你快速定义单行的最小函数，类似与C语言中的宏，这些叫做lambda的函数，是从LISP借用来的，可以用在任何需要函数的地方,函数式编程的典范。
其实上面已经不自觉地在使用了，懒人必备。

>>> g = lambda x: x*x*x
>>> g(3)
27
>>> (lambda x: x*x*x)(3)

`综合运用¶`

我们也可以把filter map reduce lambda结合起来用，函数就可以简单的写成一行。

>>> filter(lambda x: x % 3 != 0 and x % 4 !=0, map(lambda x,y: x+ y, xrange(0, 100), xrange(100, 200)))
[106, 110, 118, 122, 130, 134, 142, 146, 154, 158, 166, 170, 178, 182, 190, 194, 202, 206, 214, 218, 226, 230, 238, 242, 250, 254, 262, 266, 274, 278, 286, 290, 298]
# 再求和试试
>>> reduce(lambda a,b:a+b, filter(lambda x: x % 3 != 0 and x % 4 !=0, map(lambda x,y: x+ y, xrange(0, 100), xrange(100,200))))
6634

May 3 2014

About Me

About Me !

I’m a bioinformatic engineer and amateur programmer, and here is a short introduction about me and this blog, thanks for your reading!

I’m engaged in genomics research, including genome assembly/annotation/evolution and comparative genomics.

As a mathematics graduate, I’m also interested in interdisciplinary work, specifically data mining and visualization. And I hope we can exchange our ideas and experiences about Data Mining through this blog.

I mostly work in python/perl/R on OSX and Linux for most of my work, I also dabble in C/C++/Shell. And I am trying to learn D3.js and processing for my hobby of Data Visualization now.

Thanks for following me: @Github and @Weibo.

I’m living in Wuhan, a city near the Yangtze River in China. It’s beautiful besides it’s bad weather(long hot summer and cold winter), I spent my college life there and I love my friends there forever.

Besides programming, I exercise and read books regularly. To be a stronger and better version of myself!

About this blog !

This blog is proudly powered by Hexo, which takes great advantage of Node.js.

And it is only for my scattered ideas of reading and programming, there are no business attempts.

Any suggestions about this blog and topics talked there are welcome.

Like the TED said: Ideas are worth spreading and sharing! Many THX!

Time axis !

Time	Department/Works	Company/University
*2012-Now*	You Guess！	You Guess！
*2010-2012*	Bioinformatics Engineer/Analyst	*BGI*
*2006-2010*	Student in College of Mathematics And Statistics	*HUST*

May 2 2014

Bioinformatics

杂合基因组的组装（De novo assembly of highly heterozygous genomes )

杂合基因组的组装（De novo assembly of highly heterozygous genomes ）

最近看到一篇文献
Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads 。
感觉高杂合的大型基因组的组装有了一些可喜的进展，日本人的工作还是很扎实的，有一些值得借鉴和参考的地方。

Abstract

随着测序平台的改进和发展，测序通量已经不是问题，测序价格越来越便宜，对于一些非模式生物或野生物种来说，测定它们的基因组序列对于科学研究意义越来越明显。但是，大多数情况来看，它们通常又具有较高的杂合或者多倍性这些问题，这对于以短reads做组装为主流的de novo项目非常棘手，目前尚缺乏较完善兼具可实践的方案(经费多的实验室除外)。

一般来说，杂合基因组的解决有两种可行的方案，但都公认为比较费时费力费钱费脑.

Fosmid-based (or Bac-based) hierarchical sequencing;

Inbred lines ( doubled-monoploid clone).

Fosmid或者bac为基础的分层组装需要构建大量的长片段文库，实验工作以及组装拼接都是精细活，像牡蛎oyster (Zhang et al. 2012), 小菜饿diamondback moth (You et al. 2013), and 挪威云杉Norway spruce (Nysted et al. 2013)这些经典案例都值得看一看、读一读。

May 1 2014

Data Mining

PREA: Personalized Recommendation Algorithms Toolkit

Abstract

As they said:

“Recommendation systems are important business applications with signiﬁcant economic impact. In recent years, a large number of algorithms have been proposed for recommendation systems.”

and

“Recommendation systems are emerging as an important business application. Amazon.com, for example, provides personalized product
recommendations based on previous purchases. Other examples include music recommendations in pandora.com, movie recommendations in netflix.com, and friend recommendations in Facebook.com.”

How it works?

See this：

And more details, see the attachment below:

PREA: Personalized Recommendation Algorithms Toolkit

Apr 24 2014

Data Mining

Top 10 Algorithms for Data Mining

Algorithms are the core problems during data mining, different data sets and different mining questions may require different algorithms for solution. According to this book I’m reading these days, here are the top 10 algorithms for data mining.