Changing genomes: The size problem

The idea

There are enough of science fiction stories where a virus transfers to the cells of some character an entirely different genome, causing him or her to change accordingly. Nice idea, eh?

Well, replacing cell genome with a different one will not always automatically introduce deep changes in the organism – for example, it is not realistic to expect growing more arms, feet, wings etc. Phenotype elements of this magnitude and complexity usually have to develop during the organism embryo / fetus stages. (Of course, well enough engineered extra genes might solve this problem, too.) However, we are coming closer and closer to the time when a partial or even complete genome replacement might be possible. It is worth contemplating at least one of the problems on its way.

This problem is the size. It would be nearly impossible to fit an entire human genome, or a large part of it, within a virus. Pandoravirus salinus – the virus with the largest genome currently known – is about 0.001 mm in size, which makes it too large to successfully infect a lot of human cells. (For example, some brain cells are as small as 0.005 mm – only 5 times larger. Erythrocytes are even smaller, but since they do not have a nucleus and do not carry or need genome, they do not count.) And it contains about 2.5 Megabases of genetic info, while the human genome is about 3234 Megabases – more than a thousand times larger. It appears that a virus will simply be unable to do the trick.

Of course, there are ways around this. Let’s consider some of them.

The obvious ways

The pandoravirus consists to a large part of proteins and lipids, which form around its DNA an envelope named capsid. The capsid is needed to protect the virus from the environment and to help it infect the cells. A skillfully created virus, designed to be transmitted in lab / hospital conditions by medical means, will need a far smaller capsid. This can reduce to a degree the virus size.

Not all of the human genome is useful info. About only 2% of it encodes proteins. The rest is “noncoding” DNA, unused introns or other “genetic garbage”. The problem with this is, a lot of this “garbage” is actually needed too. Big parts of it play important role in the regulation of the genes expression. Other parts are “spare copies” of existing genes, who can come into play if the existing genes are damaged. It is strongly suspected that at least 40% of the genome is not really dispensable. A future genetic engineering can probably cut that volume to about 2-3% of the genome size, but this is still over 100 Megabases – far more than a virus can carry to every human cell.

A smart trick might be to use RNA instead of DNA – this will cut the virus payload amount by half. While not much, if aided by other tricks, this might sometimes be the difference between being able and not being able to do the task.

Another trick might be to not replace these parts of the genome that are the same in the source and the target genome. Most reasonable genetic modifications will need replacing under 10 genes, which is a very small part of the about 20,000 in the human genome. Even huge replaces, comparable with replacing the entire genome of an animal with one of a plant, can afford to affect only about 20-30% of all protein-encoding genes. The result might not be perfect, but will be good enough.

A gene that encodes a specific protein is sometimes different across the species, but for most genes the differences between even an animal and a plant version of the same gene will be very small. A good genetic replacement mechanism might be able to replace only the differing parts of the genes, thus eliminating the need to carry a lot of duplicating stuff. This can bring the payload size down enough to have drastic genome changes fit into a conveniently sized virus.

A lot of proteins can be “optimized” by size. For example, the active center of an enzyme usually takes a very small part of the protein. The rest, usually more than 99% of the protein, serves just to ensure the needed spatial conformation of the active center. Same about the enzyme regulating center. A powerful enough modelling system will probably be able to cut in most proteins the size of non-relevant part by several times, sometimes more than by a magnitude. This will cut by the corresponding size also the volume of NA needed to encode that protein.

Finally, the changes needed can be delivered not into a single virus, but packed within a series of consecutively introduced viruses. This technique not only can increase the volume of genetic info that can be delivered, but also can give an opportunity to conduct the process more gradually, thus providing better control over it. (Or to achieve changes larger than a single delivery will be able to do, for example a complete biochemistry change.)

Okay. These tricks are already more than sufficient for an (effectively) complete replacement of the genome of a cell – even a human one. But why should this stop us? Why not have a vastly richer genome, with all the opportunities it may give? Or why not have many different genomes, each of them packed with abilities far beyond the generic human genome, and be able to activate whichever one we choose?

This could be great. For example, we could be able to take many different forms by choice, carrying not only the initial and the final genome, but also extra genes responsible for the physical modifications, and even several intermediate genomes to smooth the transition between very different forms. Great, eh? 🙂

However, it might also need far better packaging of the genetic information than everything listed above can achieve. Is such a packaging possible?

I think yes. And will have the boldness to describe a way to do that.

A more compact encoding

DNA encodes a sequence of amino acids – the building blocks of all proteins in the human body. (They, in turn, serve to synthesize all other materials that the human body produces, uses and is built by – lipids, carbohydrates etc.) Every amino acid is encoded by three nucleotides, each with molecular mass about 350 to 380 g/mol. So, DNA needs about 1000 g/mol to encode one amino acid.

This is a rather wasteful encoding. It gives a lot of opportunities to detect and fix damages in the genetic info, and preserve it through many and many generations. However, we want to create a way to hold this info for just a small time, until the virus in which it is packaged arrives to the target cell and enters it. So, we could do with a far more economic encoding.

Imagine a chain of three or higher-valency atoms. Two of the atom links connect it to the neighboring links in the chain, and the rest can be used to attach to them atoms or compounds that encode the information and/or play a structural role. A classic example for this would be a chain of carbon atoms, but many other elements are suitable, too.

By far not all three or higher-valency elements are appropriate. For example, some of them can be poisonous for the living cell. (This can be circumvented by ensuring that they will be bound in harmless compounds when the chain is broken down.) Other have chemical properties that can make them less suitable – for example, oxygen or fluorine can theoretically have up to six or seven valences, but strongly prefer to exhibit only two or one. This will be far harder to get around. Happily, many other possibilities exist.

Some elements would really hate to form long homogeneous chains all by themselves. However, this can often be circumvented by interlacing them with atoms from other elements. These atoms might have three valences or more (eg. N in C-N-C-N chains), or only two valences (eg. O in P-O-P-O chains). In the second case, the amount of info that can be encoded by a chain of given length will be diminished, as every second link can’t carry additional differentiating elements. However, lone atoms are typically small and will not increase much the mass per amino acid ratio, and even with them one can provide some info by using different two-valency atoms.

How much we can improve the ratio of info per mass / size? Let’s take one of the many possible examples – a chain of carbon atoms. To play on the safe side, we will assume that only one of the two free valences of each chain atom will be used for storing info. The other one will be preserved for binding with compounds that stabilize the chain. (If it is also used for info purposes, the amount of info per link chain can double – eg. one link might be used to encode two amino acids.)

The human organism proteins are built from 20 different amino acids. Other organisms may have up to three other amino acids. Finally, there must be a “stop” signal code, which terminates the chain transcription at the end of this “gene”. So, if we plan to be universal, we must be able to attach one of at least 24 different compounds to the carbon atom in order to encode one amino acid per chain link.

To be bio-compatible, we will have to use for these different compounds only atoms that are widely present in the living organisms. (Some rarer elements will still be tolerated by the organism, but we decided to play on the safe side.) So, this limits us to H, O, C, N and P.

Our task will be to build 24 different compounds with one free valence, who are as small as possible and use only these elements. To make the long story short, we can easily do this, getting them to average size of about 35 g/mol or less. Choosing smallest compounds to encode the most used amino acids can bring the average used size down to about 30 g/mol or less. However, not every possible smallest compound will be well compatible with the breakdown chemistry, or easily distinguishable from the others by the enzymes that will read this chain. Again, let’s play on the safe side and assume that the average used size will be about 40 g/mol per a single unit.

These 40 g/mol should be added to the 14 g/mol of the carbon atom that is the chain link. In addition, we said that some structural elements will be bound to the carbon atoms too. Assuming another about 20-21 g/mol for these elements per link, we come to about 75 g/mol per encoded amino acid. That is over 12 times less than in DNA. Not bad for playing on the safe side!

Of course, such a payload will need biochemical mechanisms that can translate it to an ordinary DNA in order to be usable to the living cell. (Unless the living cell is planned to be reformed to an equally optimized different chemistry, or non-chemical mechanism of functioning, but this is another story.) For this, the virus will have to also carry DNA or RNA for about 50-100 enzymes, specially designed to perform the translation. This amount of NA, however, is not a problem to fit in a convenient virus. Especially if using some of the tricks mentioned in the first part. 🙂

Additional compression

The genetic info squeezing can be improved further. A lot of different enzymes have only minor structural differences – over 90% of their structure is the same. In addition, often most of the differences do not affect the enzymes activity – the differences of real significance are even smaller. The repeating of a large amount of encoded info can easily be avoided by a multistage translation process.

In fact, the genetic translation in the human body is multistage. After DNA is transcribed into RNA, this is often not the RNA that encodes the desired protein. It has to be further processed by some enzymes – most often to cut out some parts – to get the final, translatable RNA. Why inventing strange stuff where we can just learn from the Nature and take a mechanism that is proven to work? Except that we will be using it the other way around – to insert chain pieces instead of to delete them.

For a start, we will have to add one more encoding compound, to mark the beginning and the end of the “to be replaced” sequences. (In fact, we can play for reasonably sure even without an additional one, by matching long sequences that are not found in the existing proteins.) Or, we will have special encoding compounds that translate to such sequences. These compounds do not need to be very small, since we will use them rarely and they will match a big amount of info. However, making enzymes that recognize them might be harder, so initially we might have to use matching sequences of more ordinary encoding compounds instead.

Then, we should have some enzymes that find a “to be replaced” mark inside an encoding chain. (Which can be our virus payload chain type, or DNA, or intermediate RNA, or any other form of intermediate chain.) These enzymes will replace that mark with a long sequence that encodes something needed in several places. This way, we can have only one record of a long sequence, no matter how many proteins contain it.

What I just described is a two-stage translation process. There is no problem in making more stages. These can differ by different inlay encodes used, or different info chain types (eg. C-N based instead of C-C), etc.

Using several consecutive stages, or an undefined number of stages (as long as there are marks to replace) can achieve very complex and effective forms of info decompression, thus enabling very high degrees of compression to be used. Using all the possible tricks there, data about natural protein-encoding DNA, not optimized by tricks from the first part, can be compressed down to less than 1% of the original size. Even protein-encoding data, optimized to the best the first part tricks can achieve, can be compressed down to 10-20% of its size, maybe to even less.

This all may enable us to transfer into cells tens, maybe hundreds of different genomes. What we can use these for?

The perspectives

Curing all existing genetic diseases, including aging and death from old age, is not even the beginning of the potential of this technology. It can bring us abilities that we have only dreamed of. Having enzymes to digest and use nearly every organic matter. Or even inorganic. Or even to extract energy from radioactive elements etc. Being able to switch our bodies speed and energy needs from superhuman speed and strength to complete hibernation, or anywhere between these. Being able to create thousands of different body parts and/or organs, specialized for what we would need or just like. Changing our bodies completely – becoming birds, should we want to fly, or dolphins or sharks for a sea life, entirely designed organisms for the deep space, and back… The only limit will be our imagination and the skill of the engineers that implement our dreams.

Of course, this potential is for both good and bad. For miracles, but also for bio-terrors. However, I believe that the weight of the power that this knowledge gives us will also teach us to be up to the responsibility it requires. For example, it may finally force us to put more efforts into understanding the others and considering their interests, too. It is already time for that, with or without genetic miracles at hand.

Leave a Reply