Don't Miss

# RPKM, FPKM and TPM, clearly explained

from StatQuest

It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). However, TPM (Transcripts Per Kilobase Million) is now becoming quite popular. Since there seems to be a lot of confusion about these terms, I thought I’d use a StatQuest to clear everything up.

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:

1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).

TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:

1. Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
2. Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
3. Divide the RPK values by the “per million” scaling factor. This gives you TPM.

So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.

When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.

Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)

With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.

Source – StatQuest

1. Attention : “However, TPM (Transcripts Per Kilobase Million)” instead of “However, TPM (Transcripts Per Million)”

• I was just pointed here by a colleague to help me understand the benefit of TPM over RPKM (this is not my field) and i think this correction is mistaken. TPM is measuring the transcription frequency of a specific gene; the length of the gene is absorbed into the calculation and shouldn’t appear in the units. I’m not sure if TeX will render here, but i’ll give it a shot:

In the TPM calculation, $RPK_A$ (the reads per kilobase for gene A) is $n_A/\ell_A$, where $n_A$ is the number of reads that map to gene A and $\ell_A$ is the length of gene A. This measures the transcription rate for gene A. Then the scaling factor is $(\sum_i RPK_i) / (10^6)$, where $i$ ranges over all of the genes, including A. So the final value TPM of gene A is $(n_A/\ell_A)/(\sum_i n_i/\ell_i)\times 10^6$, which measures the *relative* rate of transcription of gene A (with the decimal point moved 6 spaces to the right). Both $\ell_A$ and $\ell_i$ have kilobase units, which cancel out.

(I thought this might just be an accident of terminology, but other sources seem to expand TPM as “transcripts per million” as well. I also just realized that this comment might be trying to make the same correction i am; i’m not sure if the post itself was changed in response to this comment or remains as it was posted, so apologies for any misplaced attribution.)

2. If you do a search for this page and read what it has to say on TPM: “What the FPKM? A review of RNA-Seq expression units” it says you should never compare TPM between samples and that it’s only for within sample comparisons. Please comment.

• I tend to agree with “you should never compare TPM between samples and that it’s only for within sample comparisons”

Otherwise, I don’t think “Count up all the RPK values in a sample and divide this number by 1,000,000.” interpretable anymore among samples. This is basically a sum of sequencing depths of all genes, which is fundamentally different from the total number of mapped reads.

In my opinion, RPKM and TPM seem to be for different purposes.

• Daniel J McGoldrick

The authors actually state “TPM is probably the most stable unit across experiments, though you still shouldn’t compare it across experiments” You munged the quote and meaning. There is no “never” and certainly the implication that RPKM or FPKM would be better is false.

3. I think you need to replace ‘gene’ with ‘transcript’. It’s not gene length that counts, but transcript length

4. Hi,
I currently work with qPCR, but just recently was introduced to RNA-Seq ways to report results when a paper about the whole transcriptome of the organism I work with came out. It happens that I would like to compare my qPCR results with the results reported in the transcriptome paper (which are in RPKM) and I don’t know if I can. I’m reading a lot of papers, trying to understand it better, but I couldn’t come to a conclusion yet. Could you give me a hand, please?

Is it possible to compare qPCR results to RNA-Seq results?

Many thanks!

• You can’t compare the “numbers”, but you can compare the results of the analysis or use the evidence to support your claims.

5. “TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM”, if the only difference is the order of operations, then the TPM is always equal to RPKM, then why we need to have TPM at all?

• This is similar to why the order of operations like multiplication, addition, brackets matters. Look at the toy examples. The resulting RPKMs and TPMs are not multiples of each other, but yield different proportions for genes/samples.

6. There is any paper that discuss de best efficacy (like your video) of TPM?

7. If RPKM is obtained first by normalizing the sequencing depth and then the gene length in kb. Is there any reason why it wasn’t coined RPMK instead? Because RPMK seems to be more correctly reflect the order of operation.

8. Can we directly correlate FPKM/TPM to expression? In other words, does high FPKM/TPM values mean high expression for a gene?

Or how do I find the highly expressed genes within a RNAseq sample?