Benford's Law and RNA-seq

More than two years ago, I started to watch videos from the YouTube channel Numberphile. I came across Benford’s Law, and I was amazed when I saw it on this video. Basically, Benford’s Law is an observation about the relative frequency distribution of the first digits from real-world data sets, where the data span multiple orders of magnitude. Intuitively, we expect them to be uniformly distributed so that each digit (from 1 to 9) has a relative frequency around 1/9. However, it turns out that 1 is the most frequent digit, and 2 is the 2nd most frequent digit, and so on. Since people who fabricate data tend to make the distribution uniform, Benford’s Law can be used for fraud detection. After watching the Numberphile video, I immediately realised that RNA-seq data should follow Benford’s Law as well. In an RNA-seq sample, you always have lowly expressed genes and highly expressed genes, and their expression values span several orders of magnitude. Indeed, there are papers about using Benford’s Law to characterise patterns of gene expressions, such as this one and this one.

In this term, I started to teach junior undergraduates some basic data analysis skills. It would be nice if I could introduce Benford’s Law to them, and I think they should be interested. Instead of just telling them the law, I decided to let them discover the law by themselves. The first thing is to find a data set for them. I just chose one from the Nature Methods paper by Mortazavi et al. in 2008, because this is the first paper that comes to my mind when talking about RNA-seq. In reality, it is actually quite a few papers in 2007 and 2008 together that mark the start of the RNA-seq era. See an excellent review by Stark et al., 2019.

I took a screenshot of the frist page of the Mortazavi et al. paper and sent to the students:

Then the students were asked to find the paper and download the Supplementary Data Set 1 from the paper. Even though very few of them have read formal scientific literature before, they all know how to use search engines. Therefore, it is not a big problem for them to find the paper and download the Supplementary Data Set 1, which is a tab delimited text file that contains the gene expression values.

Then the students were asked to take the 6th column, i.e. the “finalRPKM” column of the file and plot the frequency distribution of the first digits of the numbers in that column, ignoring all numbers with leading 0s. If there are not many numbers, this can be easily done. However, there are more than 30,000 genes in that data, so there are more than 30,000 lines in the text file. If students know a bit programming, it is not a problem. However, not every student has programming experience when they just start their study in biology. They do use Microsoft Excel a lot.

To show them how to solve the problem, I demonstrated two approaches of doing it: by using Excel or by using a few Linux commands. In this way, they will understand the problem is solvable even without programming experience if they take whatever available tools they have, such as Excel. In addition, the problem is solved in seconds using Linux commands. This may motivate them to learn a bit command line magic.

First, let’s have a look how to get it done by Excel. You need to import the text file to Excel. In Excel (my version is 2017), do File –> Import –> Text file –> click the Import button and choose the file. Then, choose delimited, click the Next button and choose Tab and click Next again. Now, it is important to show them that they need to set the Column data format of the gene column that contains the gene names as Text. Otherwise, genes such as March5 or Sept7 will be converted to dates. I did a demontration to show them that, and this is a well-known problem in the early stage of genomics. For example, see these two papers by Zeeberg et al. and Ziemann et al.. At the time of this writing, it seems March5 and Sept7 are now officially Marchf5 and Septin7. Maybe these changes are made to prevent the problem. After importing the file, you need to choose the ‘finalRPKM’ column, and click the Data tab, and use Text to Columns option. When the Wizard shows up, choose Fixed width and click the Next button. Now you will see a lot of tick marks in the Preview of selected data, and you simply click at the first tick, like this (note the vertical line with a black arrow on top):

After that, the first digit of every number is isolated. You can use the COUNTIF function of Excel to count the frequency of 1, 2, 3, …, and 9.

The second way of achieving the same goal is through Linux commands. Many bioinformatics people use python or perl for text processing, but sometimes Linux commands are just better and quicker. I showed the students that you can do this using essentially one line of commands in just a few seconds:

tail -n +2 41592_2008_BFnmeth1226_MOESM22_ESM.txt | cut -f 6 | cut -c 1 | awk '$1>0' | sort | uniq -c

Now let’s break it up to see what each command does:

tail -n +2 41592_2008_BFnmeth1226_MOESM22_ESM.txt | \ # remove the header
cut -f 6 | \ # get the 6th column
cut -c 1 | \ # get the first character, i.e. the first digit
awk '$1>0' | \ # get rid of 0s
sort | uniq -c # count occurrence

Here is the result:

5411 1
2820 2
1839 3
1413 4
1050 5
 887 6
 701 7
 592 8
 530 9

If you plot that in a bar graph, it just looks like Benford’s Law, although it is not that accurate. For example, number 1 appears ~35% of the time, but Benford’s Law is not a strict law anyway.

I hope by demonstrating this, they are motivated to learn some Linux commands to start with their career in modern biology.

Use Aliyun OSS For UCSC Visualisation In China Mainland

You always need to repeat some basic tasks that you did in the past when you relocate to a new place. For some of them, you need to find alternative ways of achieving the same goals. For example, finding a way of visualising bigWig signal files. This is very important for the assessment of the data quality. I hate to say it, but it is often true that the most efficient way of telling whether an experiment works or not is the least scientific way, which is by visual inspection. Looking at bigWig signals is relatively easy in my previous institute, but it is not so straigtforward for a small lab that does not have good computational support. Things become even more annoying and complicated when your lab is in China mainland for obvious reasons.

The problem.

The bigWig files we got nowadays are often large (>100 MB), so it is not possible to physically upload them to the UCSC genome browser. You need a server to host the files. You put your bigWig files on the server and allow them to be accessed publicly so that UCSC genome browser can reach them. Of course, you can use sevices like Dropbox or Amazon S3, but they are either blocked in China mainland or extremely slow. Try upload a bigWig file generated from only 1 million sequencing reads from your univeristy network. It takes ages here. Therefore, we need to solve both the server problem and the internet speed problem.

The Goal.

  1. Find a server that is publicly reachable to host your files.
  2. Speed needs to be fast: the connection between your local machine and the server, and the connection between the server and the UCSC genome browser.
  3. Affordable.

The Solution.

Having looked around for a few things, I think I find a solution. This may not be THE solution, because I haven’t tried hard enough. This solution is to use Aliyun Object Storage Service, which is similar to Amazon S3, and it really solves all the problems mentioned above. The price is pretty affordable. For 100GB of data, you only pay 12 RMB per month. With 100GB of space, you can store quite a lot of experiments. The price is nothing comparing to the reagents you spent on your experiments, and you don’t need to worry about the maintaince of the sever etc.

To use it, purchase the service, and login to the OSS console. Now at the top right corner, click the Access Key button:

Then a warning will pop up, but you just click “Continue to use Access Key”. Then you can get your Access Key ID and your Access Key Secret:

Now in your local machine, open up your Terminal and download the right binary from this page. Start configure your OSS storage by type:

ossutil64 config

Just input the information required, and here is mine:

For the following settings, carriage return means skip the configuration. Please try "help config" to see the meaning of the settings
Please enter endpoint:
Please enter accessKeyID:your_access_key_id
Please enter accessKeySecret:your_access_key_secret
Please enter stsToken:

The stsToken is optional, so just press Enter. You have the following options for endpoints:

and a few more in other countries.

Okay, just like Amazon S3, you need to create a bucket and put your files into the bucket. For example, I created a bucket called myatacbucket by (Note: only lowercase letters, numbers and - are allowed to name a bucket):

ossutil64 mb oss://myatacbucket --acl=public-read --storage-class Standard

This creates a bucket called myatacbucket that can be read by public. Okay, now locate the bigWig file in your local machine, and transfer it to the bucket:

ossutil64 cp oss://myatacbucket

Now, to get an URL so that UCSC can reach it. You go to your OSS console, and do your bucket name --> File Management --> Tick your files --> Export URL, like this:

So, using the above example, my URL will be:

Go to UCSC (recommend the Asia mirror if you are in China:, and put your URL there like usual (the following text should be in one single line):

track type=bigWig name=atac_seq description=the_best_atac_seq_ever_pileup
visibility=full maxHeightPixels=10:50:128 viewLimits=0:20 autoScale=OFF color=0,0,255

Now it is done, and the speed is quite nice. You can find more information about ossutils from their documentation.

Why do I like CUT&RUN more than ChIP-seq

Chromatin immunoprecipitation (ChIP) is one of the gold standards to study protein-DNA interaction in vivo. In 1984, John Lis’ lab at Cornell invented the first ChIP method using UV to crosslink protein to DNA, and they used this approach to study the interaction of the RNA polymerase with specific genes (Gilmour and Lis, 1984). However, UV crosslinking causes DNA damage that makes cerntain downstream analysis, such as PCR, very difficult. Most people often choose to use another approach developed by Alexander Varshavsky’s lab later in 1988 at MIT, where they used formaldehyde for the crosslinking to study the histone-DNA interaction around the hsp70 gene (Solomon et al., 1988). Ever since then, the ChIP protocols have been optimised and developed extensively, with a lot of method variations invented along the way.

One major breakthrough came in 1999 and 2000, when two labs combined ChIP with microarray (ChIP-chip) to study protein-DNA interaction at the chromosome- and genome-wide scale. The first ChIP-chip was performed by Nancy Kleckner’ lab in 1999 to study cohesin binding along yeast chromosome III (Blat and Kleckner, 1999). One year later, Richard Young’s lab successfully investigated two transcription factors (Gal4p and Ste12p) across the entire yeast genome (Ren et al., 2000). With completion of the initial draft of the human genome and the development of microarray technologies, genome-wide tiling array in mouse and human became possible later on. Finally, people can profile in vivo protein-DNA interaction at the genome-wide scale in mouse and human using ChIP-chip. Then, next generation sequencing started. In 2007, Keji Zhao’s lab combined ChIP with Solexa (now Illumina) sequencing (ChIP-seq) to study the genome-wide distribution of 20 histone modifications, 1 histone variant (H2A.Z), Pol II and CTCF (Barski et al., 2007). Yes, that is the famous Barski et al. paper. It has 9 authors in total with 7 co-first authors listed alphabetically based on their surnames :-). ChIP-seq offers higher sensitivity and resolution over ChIP-chip, and it has now almost become a routine to study the function of DNA-binding proteins in vivo.

One key step in ChIP-seq is to fragment the genome to solubilise the chromatin and make the DNA fragments suitable for next generation sequencing. However, this turns out to be extremely difficult after formaldehyde crosslinking. Micrococcal nuclease (MNase) is a popular choice to fragment the genome. One problem with MNase is that it is only effective on native (non-crosslinked) chromatin, although people have successfully performed ChIP-seq using MNase on crosslinked cells (see e.g. Skene and Henikoff, 2015). The most popular approach to fragment chromatin is to use sonication to physically shear DNA into pieces. Several companies have even designed sonication machines specifically for the purpose of ChIP-seq. Those machines are quite popular in the ChIP-seq community based on my experience.

Sonication is really good at breaking down crosslinked chromatin, but it also has a really serious problem that many beginners do not know about: it occassionally also destroys the epitopes of your protein. We have a very efficient (in terms of shearing DNA into small pieces) sonication machine in the lab, and the results are kind of scary to me. I am not going to publish these results, so I might just as well to put them here. Below, you see a western blot with indicated antibodies on crosslinked chromatin. Equal amount of aliquots were taken out after indicated sonication intervals to run on a 8% polyacrylamide gel.

These are just four examples from mouse T cells. Different proteins behave differently. IRF4 and CTCF seems to be relatively resilient to sonication, while STAT6 and Pol II (using the famous 8WG16 monoclonal antibody) gradually disappear with increasing amount of sonication time. Histones are generally stable and not really affected by sonication (not shown here). This could be a reason why histone ChIP-seq is relatively easy, but transcription factor (TF) ChIP-seq is so difficult. One solution is to reduce the sonication time to as short as possible. Some people may say to aim for 100 - 500 bp or even 100 - 300 bp. However, that recommendation is for histone ChIP-seq. For TF ChIP-seq, I would rather have larger fragments. Look at an example shown below. Equal amount of DNA after a sonication time course was loaded on 1% agarose gel. Lanes 1 and 7 are NEB 2-Log DNA ladder, and lane 2 is un-sonicated chromatin. Lanes 3 - 6 are with increasing amount of sonication time.

So, which sonication condition do you choose for your TF ChIP-seq experiments? Unless you have tested before and know your protein epitope is stable over sonication, I would choose conditions in lane 3 to start with the experiments. It is all about finding the balance between shearing DNA and at the same time maintain protein integrity. It is not a very pleasant procedure.

Okay, finally, let’s talk about CUT&RUN: original article by Skene and Henikoff, 2017, and the detailed protocol by Skene et al., 2018. First, I divide ChIP-seq into two main parts (actually three if you count sequencing): the ChIP part and the library prepration part. There are quite a few kits developed by different companies to make library preparation easier and more sensitive. The recent awesome method variations ChIPmentation (Schmidl et al., 2015) and TAF-ChIP (Akhtar et al., 2018) cleverly utilise the transposase Tn5 to make the whole library preparation even easier than any commercial kits. However, all those improvements in the past ten years are done for the library preparation part. The ChIP part of the procedures have almost remained unchanged. This is where CUT&RUN kicks in.

The idea of CUT&RUN is not completely new. It was inspired by a publication in Molecular Cell in 2004 from Ulrich K. Laemmli’s lab (Schmid et al., 2004). Yes, from THE Laemmli, who refined SDS-PAGE and published the second most cited paper of all time :-). In their Molecular Cell paper, Schmid et al. introduced two methods to profile chromatin-associated proteins: ChIC (chromatin immunocleavage) and ChEC (chromatin endogenous cleavage). CUT&RUN is based on the idea of ChIC, and I’m not going to talk about ChEC here, and it has been adapted for sequencing (ChEC-seq). Since the publication of ChIC in 2004, the paper has been only cited 73 times according to Google Scholar at this time of writing. The method does not seem to be widely used. However, the idea of ChIC is so simple and so brilliant. In short, you do the following:

  1. Prepare nuclei or permeabilise cells.
  2. Add antibody for the protein of interest.
  3. Add Protein A and MNase fusion protein (pA-MN), and protein A will bind to the antibody (IgG) and bring MNase to where the protein of interset is located. So, this is the clever bit: MNase is extremely Ca2+ dependent. As long as there is no Ca2+ in the buffer, MNase will not cut DNA at this stage.
  4. Wash off unbound pA-MN.
  5. Add Ca2+ to activate MNase, and it will cut the DNA around where the protein of interest is bound.
  6. Collect the small DNA fragment for the analysis (Southern blot in the original publication).

That’s it! I’m going to say it again: the idea of ChIC is simply brilliant! Last year, Steven Henikoff’s lab did the community a favour and adapted this method for sequencing, and they called it CUT&RUN (Cleavage Under Targets and Release Using Nuclease) (Skene and Henikoff, 2017). The obvious advantages of CUT&RUN over ChIP-seq are:

  • Crosslinking free.
  • Sonication free, yeah!!!
  • No need for immunoprecipitation (IP), which is lossy.
  • Since you do not need to do IP, theoretically, it should work with low-input cells. Indeed, it can do as few as 100 cells, or evern single cells (uliCUT&RUN by Hainer et al., 2018).
  • Higher resolution (enzyme cutting always gives better resolution than sonication).
  • Quick and simple (no overnight IP and reverse-crosslinking).

Very often, when a new ChIP-seq method variation comes out, it is just tested with proteins in yeasts or with histone modifications and CTCF in mammalian cells. They are informative, but you have to admit: they are relatively easy to do. What makes me feel positive about CUT&RUN is that other transcription factors (apart from CTCF), such as MYC and MAX, are shown to be working with the method.

Now, all I need to do is to establish my own lab, try it and get it to work. I am sure it will be a lot of fun :-)

Update: New methods build upon the idea of CUT&RUN , where the MNase was replaced by the transposase Tn5, were developed in 2019. These methods offer an even quicker and simpler way of profiling protein-DNA interactions in low-input and single cell level. These methods inlcude:

I have personally tried the CUT&TAG method, because it has a very detailed protocol on Dr. Steve Henikoff is also actively answering questions on the website. The method worked pretty well in my hand (at least for H3K27me3 for now). Here is a screenshot of our data: H3K27me3 in K562 cells around HOXD cluster.

Project scg_lib_structs

Being able to perform genomic studies on the material obtained from single cells is just awesome. In 2009, Tang et al. presented the first study of mRNA-seq from single cells (scRNA-seq). Two years later, Navin et al. successfully sequenced the genome of single cells. Ever since then, the field of single cell genomics is rapidly evolving and has proved to be powerful in many aspects of biology.

Many methods (mainly for scRNA-seq) are invented along the way, which is a good thing. However, I’m a bit bombarded and lost in those methods. Many of them are full of technical jargons and it is not straightforward to figure out how they actually work and what the differences are among those methods. To help understand how a method works, people tend to present some schematic views of how the sequencing library is generated using their method. For example, this is a very nice summary of 8 different scRNA-seq methods taken from a recent review by Ziegenhain et al.:

It is very clear and very helpful, but my main problem with this kind of schematic view is that it often ommits many important details and does not help with troubleshooting. From my own experience, when I try a new scRNA-seq method, most likely I will fail in the first few attempts. After getting the sequencing data, I often use grep to have a look at the sequences from the fastq files, and grep is pretty good for troubleshooting genomic experiments. However, I still have a few things that I need to figure out:

  • What are the adapter sequences to grep? i.e. the grey bar in the figure above.
  • Since DNA is double stranded, should I grep the adapter sequences as they are, or should I do the reverse complement?
  • Since paired end sequencing is often used, should I grep in Read 1 or Read 2?
  • Reads often contain partial adapters, should I grep 5’- or 3’-end of the adapters or the reads?

The answers to those questions are not obvious by just looking at the schematic view of the library generation. What really helps is to draw from scratch the actual DNA sequences from the fist step (often reverse transcription) to the final library construction step (often PCR amplification). Supplementary Figure S1 from the SCRB-seq paper and the Technical Note from 10x Genomics are very good examples.

Therefore, to help myself understand all of the methods and future troubleshooting, I start the project scg_lib_structs (that stands for “Single Cell Genomics Library Structures”), an on-paper library preparation done by myself. Today, I have finally finished it (well … for now … will continue once new methods are invented), and it mostly contains scRNA-seq methods. My main aim is to create something simple and straigforward to see, so I’ve decided to use HTML to write a webpage for each method. It is kind of stupid, since I have to manually align the sequence, which involves heavy pressing of the Space and ⌫ Delete buttons (you will see what I mean if you open the HTMLs with a text editor), but I don’t know any other way of doing this.

There are a lot of repetitions during generating those pages, because the basic chemistry among them are not that different. We recently wrote a review which summarised the main differences of different scRNA-seq methods. You can check all the text boxes and Figure 2 from our review. Not all methods were included in the review, since some are not published when we started to write the review. The scg_lib_structs project page has a more updated version of the comparison.

All files from this project is here:

GitHub Pages here:

Set Up Public Links For UCSC Genome Browser (bigDataUrl) Visualisation At The WTSI

The best way of visualising sequencing reads pileup/coverage data (very useful for ChIP-seq, DNase-seq & ATAC-seq) is to get the pileup/coverage files in bigWig format, and visualise the signal as custom tracks using UCSC genome browser - the best genome browser, period! The bigWig files are often large, especially when you have high coverage experiments. Therefore, you need a place to host your files, and they need to be in a place where UCSC genome browser can reach.

At Sanger, it is not so obvious how you can do it, but it is not difficult. The information is kind of hidden, and you need to find the right person to ask. At this moment of writing, there are two eay ways of achieving this goal.

Method 1: use web-bfint:/data

There is a host called web-bfint. You can simply access the host by ssh your_sanger_id@web-bfint and transfer files from/to it by scp file_name your_sanger_id@web-bfint as long as you are inside Sanger network. When you login to web-bfint, you will see a welcome message:

Welcome to Ubuntu 16.04 (xenial) (4.4.0-87-generic x86_64) 

* Managed by WTSI Ansible. 
  See the documentation here:

Last login: Thu Jun  7 18:38:55 2018 from *.*.*.*

Once you are in web-bfint, there are two places that will be useful to you:

# for short term deposition of files for collaborators and research
# data will be removed after a while
#(I think it is every six months, not entirely sure though)


# for long term files associated with formal journal publications or archival


You create a ticket via ServiceDesk, asking them to create a directory there for your team under either /data/scratch/project or /data/production or both. Check with the IT person in your team, maybe you already have one there. For example, our team directory are /data/scratch/project/team205 and /data/production/teichmann. The public address associated with those two directories are and, respectively. Anybody can access those places via those addresses anywhere, even outside Sanger network. You and your team members can create subdirectories for youselves.

For example, I have a bigWig file called somewhere on Sanger farm, and I want to visualise it via UCSC genome browser. First, I need to transfer that file to my team directory on web-bfint. I locate the file on farm, and do:

scp xc1@web-bfint:/data/production/teichmann/xi/

After the transfer, if I enter this address in my Chrome/Firefox, I should be able to see the file there. Note: files are synced hourly, so you might exprience a delay between getting files to the directory and being able to see the files via the ftp address.

To visualise the file, I can just go to UCSC genome browser to add a custom track by (just an example, the following text should be in one single line):

track type=bigWig name=chip_seq description=the_best_chip_seq_ever_pileup
visibility=full maxHeightPixels=10:50:128 viewLimits=0:20 autoScale=OFF color=0,0,255

That’s it.

Method 2: use S3 storage

The ftp method is easy, but it has a problem. Sometimes, especially when you have large amount of data, the UCSC genome browser shows an error: ftp server error on cmd=[PASS]. You have to keep refreshing the page several times, which is annoying.

Another way is to use S3 storage provided under Sanger Flexible Computing Environment. Again, create a ticket via ServiceDesk, saying that you want to use S3 storage. You can request a maximum of 100 GB. 100 GB is quite small, but it can still host quite a large amount of bigWig files.

After they created an S3 account for you, you will have a .s3cfg file in your home directory on Sanger farm, and the content will look like this (I ommited security part):

cat ~/.s3cfg 

encrypt = False
host_base =
host_bucket = %(bucket)
progress_meter = True
use_https = True
access_key = *************
secret_key = *****************

You will need the tool s3cmd, which is already installed on Sanger farm. S3 storage is so-called object storage. It does not have a hierarchical structure like the file system on unix. Instead, you create buckets, and you put your files/objects in the bucket you created. The name of the bucket must be unique within the entire institute, so it is better to name the bucket related to your Sanger user name.

For example, I want to create a bucket named xc1_mSp_scATAC to put all my scATAC-seq bigWig files there, so, on Sanger farm, I do:

s3cmd mb s3://xc1_mSp_scATAC

Then, I have a bigWig file called somewhere on the farm. To visualise it, I first need to put it in my s3 bucket that I just created. This can be achieved by locating the bigWig file on the farm and doing:

s3cmd put s3://xc1_mSp_scATAC

By default, the file/object in the bucket is private. Then, I need to make the file public so that UCSC genome browser can reach it. This can be done by:

s3cmd setacl --acl-public s3://xc1_mSp_scATAC/

Now, it is done. The public address for the bigWig file is (now you see why the bucket name must be unique across the entire institute, right?).

To visualise the file, I can just go to UCSC genome browser to add a custom track by (similar example as the previous one, and again, the following text should be in one single line):

track type=bigWig name=atac_seq description=the_best_atac_seq_ever_pileup
visibility=full maxHeightPixels=10:50:128 viewLimits=0:20 autoScale=OFF color=0,0,255

For more s3cmd usage, such as listing buckets/files, creating and deleting buckets/files, you can simply do s3cmd --help.

Enjoy the visualisation.