In case you don't know about it, DNA is a chain of these 4 types of molecules called nucleotides. Each nucleotide is represented by single letter which is the first letter of their name. So "TATA" can be a part of a DNA sequence for example. The DNA is made of 2 strands twisted around each other into a double helix. The two stands must be "complementary" such that if the first nucleotide on one strand is an "A" then the other strand must start with a "T", if it is a "T" then other must be an "A", if it is a "C" then the other must be a "G" and if it is a "G" then the other must be a "C", and the same goes for each nucleotide position in each strand.
Erwin Chargaff had discovered 2 simple patterns regarding the frequencies of the nucleotides in most organisms.
Chargaff's first parity rule says that if you count the number of each nucleotide on each strand, you'll find that the number of A nucleotides on one strand is equal to the number of T nucleotides on the other strand and so on for each complementary nucleotide. This is because of what was said in the previous paragraph.
The second parity rule says that the first rule also applies to the same strand, but only approximately. So the number of A nucleotides on a strand is approximately equal to the number of T nucleotides on the same strand and so on for each complementary nucledotide. It is less understood why this should happen. The problem is that if it were because the number of nucleotides is uniformly random, then the number of A and T nucleotides and the number of G and C nucleotides should not only be approximately equal, but the two groups should also be equal to each other, which is not the case. So the number of A nucleotides is approximately equal to the number of T nucleotides but not to the number of C or G nucleotides.
The second parity rule was later discovered to be a special case of a more general rule which says that on a given strand of DNA, the frequency of a sequence of nucleotides is approximately equal to the frequency of the reverse of the complement of that sequence. For example the number of time that "TAG" appears in a strand of DNA is approximately equal to the number of times that "CTA" appears in the same strand.
The research paper being discussed adds some new rules about the frequencies of the nucleotides.
Let's start from some notation.
- Let w be a nucleotide sequence such as "ATCGAT", S be a DNA strand and S' be the other strand of the DNA.
- |p| is the length of the string p (could be a word or a strand). For example |"ATA"| = 3.
- F(w, S) is the fraction of the number of times w occurs in S over |S|-|w|+1. In other words, it is calculated using a sliding window and the number of times the word is found is divided by the number of times the window is moved. For example F("AA", "AATAAA") = 3/5. (This was confirmed in private conversation with the authors of the paper)
- C(w) is the complementary word of w nucleotide to nucleotide. For example C("ATCG") = "TAGC".
- R(w) is the reverse of w. For example R("ATCG") = "GCTA".
Using this notation, Chargaff's rules can be represented as:
1: F(w, S) = F(C(w), S')
2: F(w, S) ≈ F(R(C(w)), S) (generalized version)
What the authors of the paper did was that they organized all possible words of a certain length into a table which they called a "Math table" such that each row will have a new word, its complement, its reverse and its reverse complement, but each word can only be used once in the table. Since each element of the table is unique, the words in the first column which generate the other columns are called a "generating set" or "G", and each word in "G" is labelled "g". According to the authors of the paper, there is no special way how to determine which words go into G.
For example here is a Math table for words of length 2:
By summing the frequencies of each column in the table it was determined that:
Eq1: SUM(F(g, S) for g in G) ≈ SUM(F(C(g), S) for g in G)
Eq2: SUM(F(R(g), S) for g in G) ≈ SUM(F(C(R(G)), S) for g in G)
Note: I'm not sure if the way G is chosen makes a difference or if the blanks in the Math table affect the summation.
The authors of the paper say that there are other identities were discovered using the Math table, but don't mention them.
Since SUM(F(g, S) + F(C(g), S) + F(R(g), S) + F(C(R(g)), S) for g in G) = 100% (assuming R(g) returns blank and hence F(R(g), S) and F(C(R(g), S) give 0 if R(g) is blank in the Math table),
SUM(F(g, S) + F(C(g), S) for g in G) + SUM(F(R(g), S) + F(C(R(g)), S) for g in G) = 100%
Using Eq1 and Eq2, we can turn this equation into
SUM(2F(g, S) for g in G) + SUM(2F(R(g), S) for g in G) ≈ 100%
2SUM(F(g, S) for g in G) + 2SUM(F(R(g), S) for g in G) ≈ 100%
Eq3: SUM(F(g, S) for g in G) + SUM(F(R(g), S) for g in G) ≈ 50%
Eq4: SUM(F(C(g), S) for g in G) + SUM(F(C(R(g)), S) for g in G) ≈ 50%
Eq3 was confirmed with a variety of different species DNA.