Use Python for Data Mining

Here is my experience in a python production environment and configure illustrations to share with you.

I mainly followed these blogs and you can refer to them with some minor adjustments.

  1. Getting started with python for data scientists
  2. INSTALL PYTHON, NUMPY, SCIPY, AND MATPLOTLIB ON MAC OS X

Modules

And here is what i do:

1
2
3
4
5
6
7
8
9
10
11
12
13
sudo pip install nose
sudo pip install jinja2
sudo pip install tornado
sudo pip install pyzmq
sudo pip install jinja2
sudo pip install ipython
sudo pip install pyreadline
sudo pip install pygments
sudo pip install numpy
sudo pip install scipy
sudo pip install cython
sudo pip install ipython
sudo pip install pandas

Error with Pandas

But I come across a strange problem that I can’t install pandas on my OSX 10.9. I google the errors and tried many solutions on stackoverflow.

Then on githhub, Tom Augspurger told me that virtualenv and conda may work.

Fixed

Here I tried virtualenv and it does works .

1
2
3
4
pip install virtualenv
virtualenv ENV
source bin/activate
pip install pandas

And python 3.3 with pythonbrew also works, it’s also a alternative.
pandas

1
2
pythonbrew switch Python-3.3.1
pip install pandas

And have fun!

单文库基因组组装 (A Single Library for Genome Assemble)

Illumina 报告中比较了 Reads 长度,coverage,insert size 等对组装结果的影响,可以看到理想状况下,对于简单基因组,30X左右短片段reads加上适量长片段reads可以覆盖足够的基因组区域,并且有较好的N50等指标。

最开始sanger测序可能为了避免重复序列的影响,采用了1k-40k的建库策略,后来soapdenovo在做人类基因组的时候沿用了200,500,2k,5k,10k的测序方法。但是不同基因组具体采用的策略并不一致,但是一般均需要短片段文库(<2k)和长片段文库(>2k)。像Abyss由于做非洲人的时候就只用了42X的210文库数据。

GAGE评价了一些组装软件的组装效果,有 Effect of multiple libraries on assembly 这一段。结合我自己的项目经验,multilib的策略是为了辅助scaffolding。因为contig的组装主要用到reads见的overlap信息,只要测序随机和均一,并且深度足够,短片段reads可以很好的组装出contig(无N的一致性序列),contig的组装步骤并不设计文库片段信息(insert-size和pair-end关系),后面scaffolding则需要用到文库信息来辅助contig间建立连接关系,而这里最主要的也是需要大雨2k的文库梯度分配。所以像allpath这种软件推荐的就是一个短片段文库加一个大片段文库。金小峰这种单倍体物种,基因组也不太大,考虑到个体小,提取DNA复杂,一只蜜蜂样品不足以构建三个短片段文库(200,500,800),我们可以尽量尝试建1到2个文库,对于contig组装影响不会太大(我曾经组装的单染色体蚂蚁也是由于样品原因,建了一个500的文库,效果也很好)。

另外我们注意到像fermi这样的最新的组装软件的进展,对人类基因组已经可以一个样品一个库,35X数据做denovo assembly了。

为了更好的开展后续的分析和讨论工作,后面我还会具体找下已经出来的蜜蜂或蚂蚁的组装文献给大家看看,应该说膜翅目的研究现在还是比较热门的,有很多可参考的借鉴的地方。为了尽快推进这个项目,我们没必要非建3个文库。这是我的意见。

Install matplotlib is a nightmare

Building matplotlib on OSX has proved to be a nightmare because of the different types of zlib, png and freetype that may be on your system.

The recommended and supported way to build is to use a third-party
package manager to install the required dependencies, and then
install matplotlib from source using the setup.py script. Two widely
used package managers are homebrew and MacPorts. The following
example illustrates how to install libpng and freetype using
homebrew.

Example usage::

brew install libpng freetype

If you are using MacPorts, execute the following instead:

Example usage::

port install libpng freetype

To install matplotlib from source, execute:

Example usage::

  python setup.py install

结果freetype报错:

/usr/local/include/ft2build.h:56:10: fatal error: ‘freetype/config/ftheader.h’ file not found

google 之在stackoverflow
仍然不成功;

最后到处试试,终于在这篇博客看到;
http://blog.caoyuan.me/2012/08/matplotlib-error-mac-os-x/

最后终于成功!

Assemble a genome

基因组拼接就是将测序得到的短 reads 还原成更长基因组序列的过程,
不同组装软件和组装策略采用的具体算法和细节不尽相同,
但总体上都经过如下几步:

a) Contig 组装

首先,利用 readsoverlap 和覆盖度情况,拼接出 contigs 序列;
Contigs 组装方法较多,软件丰富,算法实现侧重点不同,具体细节比较麻烦;
但从整体上来看,都是先将 readsoverlap 关系构图,然后具体去简化这个图。
reads 一般采用基于 Kmer 的 De Brujin Graph(DBG),
传统长 reads 一般采用 Overlap-Layout-Consensus(OLC*) 或 String Graph(SG)。

Read More

About Me

About Me !

I’m a bioinformatic engineer and amateur programmer, and here is a short introduction about me and this blog, thanks for your reading!

I’m engaged in genomics research, including genome assembly/annotation/evolution and comparative genomics.

As a mathematics graduate, I’m also interested in interdisciplinary work, specifically data mining and visualization. And I hope we can exchange our ideas and experiences about Data Mining through this blog.

I mostly work in python/perl/R on OSX and Linux for most of my work, I also dabble in C/C++/Shell. And I am trying to learn D3.js and processing for my hobby of Data Visualization now.

Thanks for following me: @Github and @Weibo.

I’m living in Wuhan, a city near the Yangtze River in China. It’s beautiful besides it’s bad weather(long hot summer and cold winter), I spent my college life there and I love my friends there forever.

Besides programming, I exercise and read books regularly. To be a stronger and better version of myself!

About this blog !

This blog is proudly powered by Hexo, which takes great advantage of Node.js.

And it is only for my scattered ideas of reading and programming, there are no business attempts.

Any suggestions about this blog and topics talked there are welcome.

Like the TED said: Ideas are worth spreading and sharing! Many THX!

Time axis !

Time Department/Works Company/University
2012-Now You Guess! You Guess!
2010-2012 Bioinformatics Engineer/Analyst BGI
2006-2010 Student in College of Mathematics And Statistics HUST

About This Blog!

1.Bulletin Board

Hi, I’m Buttonwood! This is my first blog on github.

Writing blogs like Hackers!

This is really very very cool! And I’ve learned lots of things from it.

Yes, after days of struggling, my Techblog is online at last!
Thx to the Internet and we‘re in an Open Society as well as A Boom Time.

It is a personnal blog focus on new technology and thoughts, and also for summing up the practical experience of my work and life.

It is nonbusiness and all right reserved.

Besides, this is an individual experiment and my personal deed,it has nothing to do with any company、organization or institution.

Any suggestion about this blog is welcome, and I’ll keep on updating.

Read More