ATAC-seq Insert Size Plotting

Originally posted on 2017-Feb-10.

ATAC-seq is THE simplest genomic protocol to profile open chromatin regions in the genome. The original publication (Nature Methods 10, 1213-1218) has been cited 441 times at this moment of writing (1,073 times at this moment of re-posting). It has become almost a routine for a lab that is interested in transcriptional regulation.

One common QC for the data is to plot the fragment size density of your libraries. The successful construction of a ATAC library requires a proper pair of Tn5 transposase cutting events at the ends of DNA. In the nucleosome-free open chromatin regions, many molecules of Tn5 can kick in and chop the DNA into small pieces; around nucleosome-occupied regions, Tn5 can only access the linker regions. Therefore, in a normal ATAC-seq library, you should expect to see a sharp peak at the <100 bp region (open chromatin), and a peak at ~200bp region (mono-nucleosome), and other larger peaks (multi-nucleosomes). Examples from one of my data:

Regular scale:

Log scale:

This clear nucleosome phasing pattern indicates a good quality of the experiment.

Different people probably have different ways of plotting this using different codes, but making this plot can be simply achieved by a combination of one line of unix commands and Excel. You don’t need some complicated scripts to do it at all.

The 9th column of sam/bam file contains the fragment length information of the paired reads. Therefore, you can count them by:

samtools view ATAC_f2q30_sorted.bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > fragment_length_count.txt

This is what it does:

samtools view ATAC_f2q30_sorted.bam | #stream reads
awk '$9>0' | #output reads with positive fragment length, i.e. left-most reads
cut -f 9 | #get the fragment length
sort | uniq -c | #count the occurrence of each fragment length
sort -b -k2,2n | #sort based on fragment length
sed -e 's/^[ \t]*//' #remove leading blank of the standard output

In the output, the 2nd column is the fragment length, and the 1st column is the counts. Open fragment_length_count.txt with Excel, and simply plot the 2nd column against the 1st column:

Okay, I guess this is not really about ATAC-seq, it is more about command line usage, and I admit, it is not as beautiful as the plot generated by matplotlib or R.

This is kind of like a histogram with 1-bp bin, which sometimes could be very spiky without smoothing.

In addition, many beginners don’t know what to expect from the library profiles (before sequencing) when running on a gel or Bioanalyzer/tapeStation etc. From our experience, we often see at least three peaks:

  1. The sharp peak around 200 bp, representing the open chromatin;
  2. The peak around 350 bp, representing mono-nucleosome;
  3. The peak around 550 bp, representing di-nucleosome;

Sometimes, we also see more peaks larger than 550 bp (not always). It could be a combination of large tagmented fragments and the artefacts of the Bioanalyzer (it is not a normal log scale like agarose gel). Click here to see a few examples of Bioanalyzer profiles from successful ATAC-seq experiments. Check the pdf files with names rep{1..10}_bioanlayzer.pdf. Note those are libraries where the Illumina Nextera adapters (136 bp) have been added.

Running classic Windows game on modern MAC OS using wine

Originally posted on 2016-Nov-17.

It is relatively straightforward. All you need is wine, which is a free open source software that allows you to run Windows applications (*.exe files) on unix systems (Linux, Mac OSX etc).

In this post, I’m going to use the classic game Diablo II as an example. The latest patch of Diablo II is 1.14d, which can function on modern Mac OS. However, the experience is very awful. The most annoying thing to me is that the mouse cursor freezes for ~0.5 seconds whenever you open your character screen or quest log screen.

Eventually, I gave up on the Diablo II Mac client, and instead, using wine to run Windows client on Mac.

To do this, you need to be familiar and comfortable with using command line (Terminal app).

First, you need to install wine. This is an excellent post: Installing Wine on Mac OS X. Then, get Diablo II and Diablo II LOD Windows client from After that, go to the client folder in Terminal, and type:

wine Installer.exe

From now on, everything is like Windows, just follow the instruction, and install the game in C:\Program Files. When you do that, the game is actually located at ~/.wine/drive_c/Program\ Files/Diablo\ II/. If you want to play runeword mode, download the runeword mode patch from here. Make sure, you download 1.13d

Unzip it, and put the folder (Runewords) into your Diablo II folder. Now, to play runeword mode, you need to go to the Runewords folder to run the game, like this:

cd ~/.wine/drive_c/Program\ Files/Diablo\ II/Runewords/
wine ../Game.exe -direct -txt

If everything goes right, you should be able to play Diablo II. However, to this stage, the graphic is not very satisfying. One solution is to use Sven’s glide-wrapper. Go to download the latest glide-wrapper. Again, unzip it into your Diablo II folder.

To configure the glide-wrapper, go to you Diablo II folder, and run glide-init.exe:

cd ~/.wine/drive_c/Program\ Files/Diablo\ II/
wine glide-init.exe

Then, the configuration window will pop up, and click “settings” at the left hand side, tick things like this:

If you want to play window-mode, simply tick it. Then click “Extensions” at the left hand side, and tick the options like this (the following options work the best for me, but it might be dependent on machines):

After that is done, simply click “Quit” at the left hand side.

Now, to run the game with rune word mode and glide-wrapper, run the game like this:

cd ~/.wine/drive_c/Program\ Files\Diablo\ II/Runewords/
wine ../Game.exe -direct -txt -3dfx

In the game, check the graphic by typing /fps in the dialogue, then you should see that the top right corner shows Video: Glide, meaning it is working. If you still see Video: DDraw, the glide is not configured properly.

Have fun!

Getting RefSeq gene TSS from UCSC

Originally posted on 2016-Nov-09.

As a practice of using the unix command lines, let’s get the transcription start site (TSS) coordinates of all hg38 RefSeq genes from the UCSC genome browser, and change it to a proper bed format:

1. Go to the UCSC genome browser.

2. Go to the table browser: Tools –> Table Browser.

3. Choose the following options:

clade: Mammal
genome: Human
assembly: Dec. 2013 (GRCh38/hg38)
group: Genes and Genes Predictions
track: RefSeq Genes
table: refGene
region: genome
output: all fields from selected table
output file: whatever name you want (e.g. refseq_genes_hg38.txt)

4. The downloaded text file is a tab-delimited file. You can get a rough idea what it contains:

$ head -2 refseq_genes_hg38.txt

#bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
0 NM_001276351 chr1 - 67092175 67134971 67093004 67127240 8 67092175,67095234,67096251,67115351,67125751,67127165,67131141,67134929, 67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134971, 0 C1orf141 cmpl cmpl 0,2,1,2,0,0,-1,-1,

5. txStart and txEnd indicate where the transcript starts or ends respectively. If it is from the + strand, then txStart is the TSS; if it is from the – strand, then txEnd is the TSS. Remember in a bed format, the start coordinates are 0-based, and the end coordinates are 1-based. Therefore, write a script called and the content look like this:

tail -n +2 refseq_genes_hg38.txt | awk '
if($4=="+") {print $3,$5,$5+1,$2 "_" $13,".",$4}
 else {print $3,$6-1,$6,$2 "_" $13,".",$4}
 }' > refseq_hg38_UCSC_TSS.bed
# Column 2 (name, $2) is the unique refseq ID
# Column 13 (name2, $13) is the gene name
# We want both of them for easy reading,
# that's why I put $2 "_" $13 there

6. Now have a look the output:

$ head refseq_hg38_UCSC_TSS.bed
chr1 67134970 67134971 NM_001276351_C1orf141 . -
chr1 67134970 67134971 NM_001276352_C1orf141 . -
chr1 67134970 67134971 NR_075077_C1orf141 . -
chr1 201283450 201283451 NM_000299_PKP1 . +
chr1 201283450 201283451 NM_001005337_PKP1 . +
chr1 8423686 8423687 NM_001042682_RERE . -
chr1 8817639 8817640 NM_001042681_RERE . -
chr1 8817639 8817640 NM_012102_RERE . -
chr1 34165841 34165842 NM_052896_CSMD2 . -
chr1 34165273 34165274 NM_001281956_CSMD2 . -

7. Sometimes, different isoforms have the same TSS. To get the unique one, do this:

sort -k1,1 -k2,2n -k3,3n -k6,6 -u refseq_hg38_UCSC_TSS.bed > refseq_hg38_UCSC_TSS_unique.bed

8. Now have a look at the final output:

$ head refseq_hg38_UCSC_TSS_unique.bed
chr1 11872 11873 NR_046018_DDX11L1 . +
chr1 17435 17436 NR_128720_MIR6859-4 . -
chr1 29369 29370 NR_024540_WASH7P . -
chr1 30364 30365 NR_036268_MIR1302-11 . +
chr1 36080 36081 NR_026818_FAM138A . -
chr1 69089 69090 NM_001005484_OR4F5 . +
chr1 140565 140566 NR_039983_LOC729737 . -
chr1 187957 187958 NR_128720_MIR6859-4 . -
chr1 206596 206597 NR_026823_FAM138D . -
chr1 451677 451678 NM_001005221_OR4F29 . -

Note: in the original post, there was a mistake in the The $5-1, $5 should be changed to $5, $5+1. This has been corrected now. Thank Rohit Satyam for pointing that out.

Common Bioinformatics Tasks With UNIX Commands

Originally posted on 2016-Oct-20.


1. Give names to each peak in a 3-column bed file:

$ cat Oct4_union_peaks.bed

chr1 3062833 3063056
chr1 3343519 3343843
chr1 3482630 3483324
chr1 3549654 3549895
chr1 3648929 3649295

$ awk 'BEGIN{OFS="\t"}{print $1,$2,$3,"Oct4_union_peak_" NR}' Oct4_union_peaks.bed

chr1 3062833 3063056 Oct4_union_peak_1
chr1 3343519 3343843 Oct4_union_peak_2
chr1 3482630 3483324 Oct4_union_peak_3
chr1 3549654 3549895 Oct4_union_peak_4
chr1 3648929 3649295 Oct4_union_peak_5

2. Print the number of fileds for each line:

awk '{print NF}' some_file.txt

3. Print the last column of a text file:

awk '{print $NF}' some_file.txt

4. In a narrowPeak format file, Get all peaks that have more than 5 fold enrichment:

awk '$7>5' Oct4_peaks.narrowPeak

5. In a tab-delimited file, output the lines where the 3rd column is less than the 2nd column:

# in a normal bed file, this should output nothing as the 3rd column is alway larger than the 2nd column
awk '$3<$2' Oct4_peaks.bed

6. Only output the read names, the DNA sequences or the quality strings of a fastq file:

# output read names
awk 'NR%4==1' Oct4_ChIP.fastq
# output DNA sequence
awk 'NR%4==2' Oct4_ChIP.fastq
# output quality strings
awk 'NR%4==0' Oct4_ChIP.fastq

7. Add a “chr” at the beginning of a bed file coordinates (useful to convert non mitochondrial Ensembl coordinates to UCSC coordinates):

awk '{print "chr" $0}' Oct4_peaks_ensembl.bed

8. Calculate the length of each region in a bed file and add the length to the end of each region:

awk 'BEGIN{OFS="\t"}{print $0, $3-$2}' Oct4_peaks.bed

9. Get the summit point (highest pileup) from a narrowPeak file and output as a 4-column bed file:

# simply output the summit point
awk 'BEGIN{OFS="\t"}{print $1,$2+$10,$2+$10+1,$4}' Oct4_peaks.narrowPeak
# output the summit point sorted by fold enrichment (highest to lowest)
sort -k7,7nr Oct4_peaks.narrowPeak | awk 'BEGIN{OFS="\t"}{print $1,$2+$10,$2+$10+1,$4}'


1. Output the record names from a fasta file:

grep '>' your_fasta.fa

2. Display one line before the match and two lines after the match (-A and -B options):


3. Only display the matching strings, not the entire line (-o option):

# search for the P7 end of adapter sequence of a Nextera library

# if you want to suppress output file name, use '-h' as well


1. Extract the 1st, 2nd, 4-8th, 10th and following columns from a tab-delimited file (default):

cut -f 1,2,4-8,10- input.txt

2. Extract the 4-8th columns of a comma-delimited file:

cut -f 4-8 -d, input.txt

3. Extract the last column and use ‘/’ as a delimiter:

cat input.txt | rev | cut -f 1 -d/ | rev


1. Kill all pending jobs in LSF:

bjobs | grep 'PEND' | awk '{print $1}' | xargs bkill

2. Remove all .sh files that have the word ‘foo’ in them:

grep -w 'foo' *.sh | cut -f 1 -d ':' | sort -u | xargs rm