Genetics

8-31-05

GENOTYPE TO PHENOTYPE

Two years ago was the 50'th anniversary of the discovery of the molecular structure of DNA, and last year we had the completion of the Human Genome Project, the nucleotide base pair sequencing of our entire genome, about 3 billion base pairs of DNA consisting of about 25,000 genes. Somehow, the DNA base pair sequences of our collection of genes make us what we are. And the same is true for the millions of other species, our evolutionary relatives, with which we share this planet.

"Genes", the fundamental elements of heredity transmitted from parents to offspring, consist of stretches of DNA (along a chromosome) that "code for" a protein (or structural RNA molecule). Diploid human cells contain 46 chromosomes (22 autosomal pairs plus XX or XY) with a total of 6 billion base pairs of DNA (so, the haploid human genome size is 3 billion base pairs). A person's genome consists of two same-or-different versions (alleles) of each autosomal gene in a diploid cell, and one version of each autosomal gene in a haploid (sex) cell.

1. Review: What is the structure of DNA, and how do new nucleotides get added during DNA replication?

Figure 2.5 shows the double helical structure (base pairs, antiparallel strands, diameter and unit length).

Figure 2.18 shows how new nucleotides get added during growth of a strand of DNA.

2. How do mRNA copies of DNA sequence get made by transcription?

See Figure 1.13 and gene expression handout.

RNA gets synthesized in the 5' to 3' direction, by the action of the enzyme called RNA polymerase, which opens up the DNA double helix and uses one of the strands (reading it in the 3' to 5' direction) as a template for synthesizing the new RNA strand by temporary base pairing. Behind the enzyme, the DNA strands come back together.

What makes RNA polmerase start at the right locations (i.e., the beginnings of genes) on the DNA? RNA polymerase molecules get bound to specific "promoter" sites along the DNA and then start the actual synthesis process at nearby "transcription start" sites. At the end of a gene, there is another stretch of DNA sequence that functions as a "transcription stop" signal; the enzyme falls off the DNA and releases the RNA molecule that it has synthesized ("the primary transcript").

Still in the nucleus (for eukaryotes), the primary transcript gets modified in several ways (including removal of any internal non-coding regions, "introns") to become an mRNA molecule.

3. How does a specific mRNA get translated to make a protein of specific amino acid sequence?

Figures 1.15 and 1.16, Table 1.1 (page 20), and gene expression handout

When an mRNA molecule enters the cytoplasm (from the nucleus), a ribosome binds to the 5' end of the mRNA and "finds" the first AUG sequence, then the specific tRNA that recognizes this sequence joins the complex. This tRNA (that recognizes the "start codon", AUG) has methionine covalently attached to it. The methionine got put onto this tRNA earlier by a specific "charging enzyme" (in this case, by "methionyl-tRNA synthetase").

Once this initiation complex is formed, the second tRNA can diffuse in and hydrogen bond to the second codon. Then there is an energy-driven formation of a covalent "peptide bond" between the two amino acids, and a shifting of the position of the ribosome relative to the mRNA. After this positional shift, the next tRNA can come in, and the process continues, with the formation of one more peptide bond in each cycle of the steps. So, codons get read in the 5' to 3' direction, and a protein is getting synthesized in the "amino" to "carboxyl" direction. This continues until a "stop codon" (UAG, UAA, or UGA) is reached, at which point the ribosome-mRNA complex falls apart, thus releasing the newly made protein.

Numerous ribosomes (right behind each other) can carry out this process on a single mRNA molecule, thus resulting in the synthesis of numerous copies of the protein.

Problem S-1: "Gene Size": What is the approximate size of a "typical" human gene? Here are some basic facts that will allow you to calculate this. 1. A "typical" human protein (i.e., of average size) has a molecular weight of about 60,000 daltons. 2. The 20 different amino acids found in proteins have individual molecular weights that average out to about 120 daltons. 3. An average of about 10% of a typical human gene is "coding", while the remaining about 90% is "non-coding" (the region before the translational start codon, plus the sum of all of the intron regions, plus the region after the translational stop codon).

Using the data given above, calculate the following:

(a) What is the number of amino acids in an "average size" human protein molecule?

(b) What is the approximate size of the actual coding region ( in base pairs) of the gene for this protein?

(d) Assuming that the sizes of these genes average out to the number calculated in part c above, what is the overall fraction (expressed as a percent to the nearest integer) of our genome that is "regions within genes" and what is the overall fraction that is "regions not within genes"?

(e) What is the overall fraction (expressed as a percent to the nearest integer) of our genome that is "coding regions within genes"?