Transmission of genetic information occurs through the means of DNA (or RNA in some viruses). Information is coded in the DNA molecule by means of nitrogenous bases, which occur in pairs. Two kinds of pairs are possible: A-G (adenine + guanine as the nitrogenous bases) and C-T (cytosine + guanine as the nitrogenous bases). The precise order of these base pairs codes the information, much like the zeros and ones code information for computers.
Code contained in the DNA is expressed (produces an effect in the organism) through the mechanisms of transcription and translation. During transcription, a messenger RNA (mRNA) molecule is produced, which has base pairs complementary to that in the DNA. Again, using the computer analogy, a "1" in the DNA is represented as a "0" in the RNA, and vice versa. In the next step (translation), the information coded in the mRNA molecule is used to assemble a protein, by specialized intracellular organelles called ribosomes. Proteins are chains of amino acids, and the order of the amino acids is specified by the mRNA code. Different mRNA molecules produce different proteins.
Proteins are the workhorses of the cell. They are responsible for much of the functionality within the cell, for example:
- as enzymes which catalyze various metabolic processes, such as energy generation, the production of all other molecules manufactured by the cell, secretory products such as hormones, etc.
- as structural elements, for example, by forming channels in the cell membrane, which allow the exchange of specific molecules between the cell and its external environment.
- as contractile elements, by providing motive power to move things from one place to another within the cell, or to contract the entire cell (for example, muscle cells).
Because of the central role of proteins in so many cellular functions, the control of protein synthesis by DNA in fact controls all facets of cell behavior, and on a higher level, the organism itself.
Regions of the DNA that code for proteins are called genes. Genes usually compose only a small part of the DNA. The larger fraction of DNA is non-coding. Sometimes the non-coding regions are referred to as "junk DNA", but this is a misnomer. Non-coding DNA is not all junk, in fact it mediates a number of regulatory functions on the coding regions. We are still in the process of discovering the functions of these regions, so just because the function of a specific region is not known does not mean that it has no function.
Changes in the DNA code are called mutations. Since DNA is the mode of transmission of genetic information from parent to offspring, mutations in the DNA are heritable, that is, the progeny inherits the mutations and passes them along to their offspring, in turn. A mutation in a single base pair (for example, the base pair A-G is switched to C-T) is called a Single Nucleotide Polymorphism, or SNP.
The effect of an SNP depends on a number of factors. If the SNP mutation affects a non-coding region, it may not show any effect in the phenotype (meaning, the organism shows no effect of the mutation). At other times, an effect may be observed, but it might be an effect which has no influence on the survival of the organism. Such effects are not under selection pressure, because they neither hinder nor aid survival. SNPs in critical regions will have an effect on survival. If they are serious enough, they might kill the organism, or harm its chances of reproductive success. In such cases, there is a negative selection pressure on the SNP, and in the natural course of affairs the SNP will die out, since organisms which contain the mutation are not successful at competing with those who are free of it. Very rarely, an SNP may have a beneficial effect. Beneficial mutations are under positive selection pressure, and tend to propagate strongly through future generations. It is estimated that humans have about 10 millions SNPs (across the total population of humans; no single human has all the SNPs), of which about 1% are probably functional (have some effect on the phenotype, and therefore may be under selection pressure).
Since SNPs can last through thousands of generations, they can be used as markers to trace an individual's ancestry. If a certain SNP is present in a number of individuals, it can be deduced that they have a common ancestor. Of course, it's possible that the same SNP may have occurred twice, in two unrelated populations, although the chances for this are low. Because of this, and other confounds, geneticists don't rely on a single SNP. A more complete DNA analysis which takes into account several known SNPs will provide more consistent results.
Humans have 46 chromosomes, in 23 pairs. We say "pairs" because humans are diploid, meaning that each chromosome has a corresponding complementary chromosome. Pairs 1 to 22 are known as the autosomes, while pair 23 is is termed the sex chromosomes. The pair of sex chromosomes in females is alike, both are of type "X". In males, the two chromosomes in this pair differ: one is "X" while the other is "Y".
The Structure of Chromosomes
All cells in the body are diploid, except the sex cells. Sex cells, or gametes (the ovum and the sperm) are haploid, containing only 1 part of each pair (23 chromosomes total, instead of 46). Since males have two different types of chromosomes in pair 23, sperm can have either the X chromosome or the Y chromosome. Sperm cells are therefore a mix of two different types, X and Y, and the sex of the offspring depends upon whether the ovum was fertilized by sperm containing an X chromosome (in which case the offspring is female) or a Y chromosome (which produces male offspring). Upon fertilization, the haploid ovum and sperm combine. Each contributes 23 chromosomes, which results in a total of 46 chromosomes in the zygote, which is the same number as each of its parents.
The process of fertilization creates two problems for geneticists trying to analyze SNPs in a population. The first problem is that since somatic cells are diploid while gametes are haploid, each gamete (ovum or sperm) contains only half the DNA of the parent. That is, it contains only one chromosome from each pair. The two chromosomes in a pair are complementary, but not identical. Since it cannot be predicted which chromosome of a given pair will be contained in the ovum and sperm that takes part in the fertilization process, there is no way of knowing which of the parent's traits will be inherited by the offspring.
For example, consider chromosome pair number 17 in the mother. It consists of 2 chromosomes; let's call them 17A and 17B. Suppose there is an SNP in chromosome 17A. However, since gametes are haploid, each of the mother's ova will contain either a copy of 17A or 17B, but not both. Which type of ovum gets fertilized is random, so the child may or may not inherit the SNP.
The second problem is that during fertilization, portions of the maternal and paternal chromosomes exchange DNA, in a process called recombination, or crossing-over. Therefore, chromosomes in the offspring are not identical to those from either parent. Each chromosome in the offspring contains some DNA from both the mother and the father. There is no way to tell exactly which part of the DNA came from which parent, specially when the parents are long dead and their DNA cannot be analyzed and compared. Since geneticists are trying to use SNPs to trace ancestry over tens or hundreds of generations, it's not just a question of parents, but all ancestors going back over several generations.
Both of these problems can be avoided using by using the following methods.
- since there is no counterpart to the Y chromosome in the ovum, the Y chromosome does not take part in recombination. The Y chromosome in the son is an identical copy of the Y chromosome in the father. Since the father has only a single type of Y chromosome, the first problem is also avoided because all sperm will contain only that type of Y.
- besides the nucleus, DNA is also present in mitochondria. All mitochondria in the zygote come from the mother only, since sperm cells have no mitochondria. Mitochondrial DNA does not take part in recombination, so mitochondrial DNA in the offspring is identical to that from the mother.
This provides geneticists a means to avoid the randomness created by not knowing which of the two chromosomes of each pair were inherited by the offspring, and also the randomness from the the recombination process. If you only compare SNPs in a population by comparing the Y chromosome, or mitochondrial DNA, both problems are avoided.
Any specific inherited SNP is known as a haplotype. It's called haplotype because it's only present in one chromosome of the pair, that is, in a haploid state. Haplotypes can be in any of the 46 chromosomes, but because of the reasons described above, when doing population genetics the term usually refers to the Y-chromosome haplotype, that is, an SNP present in a Y-chromosome. Using Y-chromosome haplotypes, the ancestry of any male can be traced through his paternal lineage. A group of people who share similar haplotypes is known as a haplogroup. All individuals in a given haplogroup have a common ancestor at some point in time. If a sufficient number of individuals are examined for their haplotypes, maps can be created, showing the distribution of various haplogroups across the world. These maps are called Haplotype Maps, and they provide a variety of interesting information, such as the ethnic makeup of a population, the history of past migrations of people to different parts of the world, etc.
In the same way, the maternal lineage can be traced back through mitochondrial DNA. The term "haploid" has no significance for mitochondrial DNA, but because of the parallels to the process of grouping individuals by common ancestor, the term "haplotype" is conventionally applied to this as well. Thus there are Y-chromosome haplotypes, which trace patrilineal ancestry, and mitochondrial haplotypes, which trace matrilineal ancestry.
How Haplotypes are Created
As mentioned above, the fate of an SNP can vary. SNPs that confer neither advantage or disadvantage in terms of survival and reproduction are transmitted to offspring, and their subsequent fate depends upon genetic drift. Otherwise, they will be selected for or against, depending upon whether they aid or harm the individuals containing them. SNPs that last (those that don't kill the person or disable him/her seriously), divide the population into two groups when they appear: those who have the SNP and those who don't. To give an example, let's say an SNP appears in an individual "P" at some point in time. This SNP will be inherited by his children and their progeny. The progeny will form a haplogroup, all tracing their ancestor back to ancestor P. Let's call this haplogroup "Haplogroup A". The general population at this time is divided into two segments: Haplogroup A, and non-Haplogroup A, the difference being that people in Haplogroup A are descended from individual P, and everyone else is not.
Subsequently, a different SNP appears in individual "Q" who was in non-Haplogroup A. This SNP is inherited by his progeny, who now form Haplogroup B. The population now consists of Haplogroup A, Haplogroup B and non-Haplogroup A-B. Some time later, a third SNP appears in an individual "R" who belonged to Haplogroup A. It's inherited by his progeny. They were already Haplogroup A, but this new mutation separates them from other people in Haplogroup A who didn't have that mutation. We'll call these people Haplogroup A+C. The C haplotype is only present in descendents of "R", who are a subset of everyone in Haplogroup A, which descended from a more distant ancestor, "P".
Over time, the accumulation of SNPs can divide the population into many types and subtypes, as shown in the diagram below:
The development of SNPs (Single Nucleotide Polymorphisms) over time, and the evolution of populations with particular SNP markers
Over time, it becomes difficult to tell who the ancestral population is, since everyone has accumulated a lot of SNPs. However, it's easy to tell who is related to whom, since any two people sharing a haplotype must have a common ancestor at some point in time. Since people have accumulated multiple SNPs through various ancestors over a long period of time, a given individual belongs to several haplogroups.
If we create a haplotype map of a large sample of people from all parts of the world, it's possible to show how different groups within the population are related by ancestry. These haplotype maps contain information about the order in which SNPs appeared through time. For example, if we find 3 groups of people, who have Haplotype A, Haplotype A + C, and Haplotype non-A-non-C, then we can deduce that Haplotype C must have appeared after Haplotype A. This is because:
- the presence of haplotype A+C shows that either Haplotype C first appeared in an individual of Haplotype A, or Haplotype A first appeared in an individual of Haplotype C.
- if Haplotype A first appeared in an individual of Haplotype C, then there should be some people in the population who are Haplotype C but not Haplotype A, because only one individual in Haplotype C had the mutation that led to Haplotype A, so it exists only in the progeny of that individual. There should be other individuals of Haplotype C who did not have that mutation, and their progeny won't have Haplotype A. However, Haplotype C alone (without A) does not exist in the population
- conversely, if Haplotype A appeared before Haplotype C, then there should be individuals around who have only Haplotype A but not Haplotype C. This is in fact the case, therefore Haplotype A arose before Haplotype C.
Please note that the example above is just an example, it's not meant to describe the actual order in which modern haplotypes evolved. Since modern haplotypes are also designated by letters of the alphabet, there might be some confusion on this point.
The actual order in which haplotypes evolved is shown by the map to the right. Using such maps, it is possible to trace the last common ancestor of all men today, sometimes called the "Y-chromosomal Adam". Note that the last common ancestor is not the "first" man who ever lived. Indeed, evolution says that there was no such thing as "first man". A species does not start from a single individual, but rather from a sub-population that becomes geographically separated from its parent population, and evolves separately from it, gradually accumulating sufficient differences from the parent population over time so that it can be called a new and different species.
Since examining haplotype maps alone only tells us the order in which the mutations occurred and not the exact date of mutations, more information is needed before we can chronologically identify a "Y-chromosomal Adam". This information comes from our understanding of the rate at which mutations accumulate, usually by studying DNA which does not change in other ways (such as through recombination). Y chromosomal DNA and mitochondrial DNA are good study subjects in this respect, and by observing and analyzing such DNA across a large range of individuals of several species, it's possible to estimate the rate at which mutations accumulate. This rate of change is sometimes called the molecular clock, and it remains an active area of research.
From these calculations, it appears that Y-chromosomal Adam lived about 60,000-90,000 years ago in Africa. The date is very approximate, since such calculations are not exact. But even allowing for a large fudge factor, it can be seen that this was a long time after humans first evolved. Modern humans first appeared in Africa sometime between 195,000 to 160,000 years ago. So humans had been around for about a hundred thousand years before the lifetime of Y-chromosomal Adam.
The concept of "Y-chromosomal Adam" simply means that all men today are descended from a single male ancestor who lived in Africa about 60,000 - 90,000 years ago. What happened to the lineages of the other men who lived at this time, and were his contemporaries? Surely there must have been many other men who also had children. What happened to those kids?
The answer is simple if you consider how the Y-chromosome is transmitted, from father to son. Daughters do not have the Y-chromosome. Consider a man who produces one son and one daughter. His son inherits his Y-chromosome, the daughter does not. If that son then has a son of his own, the Y-chromosome is transmitted to the third generation, but if he doesn't, then it is lost. It takes only a single generation in which there are no offspring, or the offspring are all daughters, for the Y-chromosome of that lineage to be lost forever. Somewhere in the 60,000+ years between us and Y-chromosomal Adam, this happened to all other lineages. There was a generation which produced no offspring, or which produced only daughters.
The specific Y-chromosome haplogroups that exist today are shown in the diagram above. Following is a brief description of some of them. Please remember that the percentages will not add up to 100%. In fact, in many cases they will exceed 100%, since any single man can belong to multiple haplogroups at the same time, being descended from a number of patrilineal ancestors at different times in his family history.
Groups A and B: these are found only in Sub-Saharan Africa (and people who have ancestry in that region but have since moved, such as many Africans affected by the slave trade). Since these are the oldest haplogroups, it makes sense that are found where humans first originated. The first haplogroup that branched off was Haplogroup A, with the defining mutation M91. Today, Haplogroup A is among the Khoisan people of Africa, as well as some Ethiopians and Nilotes. Humans who are not Haplogroup A are lumped into Haplogroup BT, which is composed of Haplogroups B and CT. The super-group BT is defined by mutations M42, M94, M139 and M299, all of which date to about 55,000 years ago. Haplogroup B was the next to diverge, defined by mutation M60, and is today found in Africa among the Pygmies and Hadzabe people in central Tanzania.
Haplogroups CT and CF: Everyone who doesn't belong to Haplogroup A or B belongs to the super-group CT, which is defined by mutations M168 and M294. This is the branch of humanity that first left Africa, although the mutation that led to CT probably happened in Africa about 50,000 - 60,000 years ago, in a man from East Africa, who's been called the "Eurasian Adam", since his descendents include all Eurasian and Asian people today. Haplogroup CT was probably the group involved in the early migration out of Africa, which spread along the southern coast of the Arabian peninsula, Iran, Pakistan, India, and all the way to southeast Asia. Here are some maps showing these early migrations. As you can see, the dates calculated from the genetic evidence correspond only roughly to dates determined by the archeological evidence. These are not yet exact sciences in terms of dating, so we have to reckon with some degree of uncertainty. Haplogroup CF diverged soon after, also in Africa, and is defined by mutation P143. Both super-groups CT and CF are found all over the world, including the Americas.
Haplogroup C: emerged from CT, via CF, and is defined by mutations M130 and M216. It's divided into various subgroups:
- C1 (M8, M130, M216): presently found in Japan
- C2 (M38): presently found in southeast Asia, Polynesia
- C3 (M217, P44): presently found in native Americans, and groups related to them in Asia, such as Mongols, Kazakhs, Eskimos and other Asian people of the far north, and paleo-Siberians
- C4 (M347): presently found in the aboriginal inhabitants of Australia
- C5 (M356): found in some parts of India
Haplogroup C, is therefore spread over a large region of Asia and Australia, as well as the North and South American continents. However, it represents the paths of the early human migrations, which were mostly coastal. So the actual number of people with Haplogroup C is not that large, since it's mostly prevalent only in people originating from the coastal regions, and the few Native Americans and Australian Aborigines left alive today.
Haplogroup DE: This is a small paragroup defined by mutations M1, M145 and M203, which originated about 65,00o years ago. There are 3 subtypes, the base paragroup DE, which is extremely rare, Haplogroup E (mostly Africa) and Haplogroup D (mostly east Asia). The Asian lineages (DE/D) are found primarily in Tibet and the Andaman Islands, but are absent in India. Since India was a major source for the radiation of humans, the absence of DE in India raises some confusing questions about the exact relationship between DE, E and D. One theory is that harmful mutations in the DE line subsequently led to its disappearance in India and other parts of Asia. Haplogroup D is present in India, in some isolated northeastern tribes. Haplogroup E, which probably originated in east Africa, is fairly common there today. E1b1a, a subclade of E, is probably the most common of them all, and very prevalent in Sub-Saharan Africa. E1b1b is the only E subclade which is found outside Africa (other than regions inhabited by descendants of the slave trade), including North African Arabs and Berbers, eastern Europeans, and southern Europeans (probably due to the Moorish conquests).
Haplogroup F: This is the largest Haplogroup existing today, if we include its subclades (haplogroups which are sub-branches of F). This group is defined by a whole range of mutations, including P14, M89, M213, P133, etc. About 90% of all men alive today belong to some branch of Haplogroup F, which basically means everyone outside Africa, who does not belong to Haplogroup C (and a very small number of DE/D individuals). This group originated somewhere in the region between north Africa, the Levant, and the Arabian peninsula, probably about 40,000 - 50,000 years ago. It corresponds to the second wave of emigration out of Africa (or out of the near east, if it originated there), the first being Haplogroup C. The major subclades in Haplogroup F are G, H and IJK, which are discussed in more detail below. But in addition, there are also minor subclades, which are found in relatively small populations in southern India and Sri Lanka, Korea, and the Yunnan province in southwestern China.
Haplogroup G: this originated from Haplogroup F, probably somewhere in the middle east, about 20,000 years ago. It's defined by mutation M201. Today, it's found in various parts of central Asia, the middle east, India, south Asia (specially the Malay archipelago). About 11% of men in southern Russia and the Ukraine belong to Haplogroup G, while up to 60% of Ossetian males in the Caucasus region have it. The average incidence in Europe is about 5%, though it is mostly concentrated in the south. In Asia, it's found most often in Iran, Turkey, the Pashtun parts of Afghanistan, as well as smaller numbers father south in southern Asia and north Africa.
Haplogroup H: this is another subclade of Haplogroup F, defined by mutation M69. It probably arose in India, about 20,000 to 30,000 years ago. Today, it is almost exclusively found only in the Indian subcontinent, and among the Roma people (gypsies) of Europe, who are thought to be migrants from India. The incidence varies from 27% in India, 25% in Sri Lanka, about 10% - 20% among different ethnicities in Pakistan, and about 6-12% in Nepal. Outside the subcontinent, it is very rare. Some other people who have it include about 6% of the Kurds in Turkmenistan, 4% from a sample of Iranians in Samarkand, 2% from a sample of Uzbeks in Bukhara, 3% from a sample of Uzbeks in Khorezm, as well as some Tajiks, Serbians, Syrians, etc. It may well have spread to these regions from India as little as a thousand years ago, during the period of the Khwarizm empire.
Haplogroup IJ: This is a paragroup of Haplogroups I and J. No actual individuals belonging to IJ have been found, but its presence is inferred from the commonalities of Haplogroups I and J. Haplogroups I and J are quite common across western Asia and Europe, and through migration, the Americas. Some of the subclades are:
- Haplogroup I1: mostly Scandinavia and northwestern Europe, some also found in eastern Europe
- Haplogroup I2: southern Slavic people - Balkans (Serbia, Croatia, Bosnia, Slovenia), Moldavia, southern Russia, Sardinia
- Haplogroup J1: middle east, Dagestan, semitic people in north and east Africa
- Haplogroup J2: Mesopotamia (Iraq, Syria, Anatolia), southern Europe, central and southwest Asia, North Africa, the Caucasus, south Asia
A variant of Haplogroup J2 is also found in parts of the Mediterranean, including southern Italy and Greece. This is possibly the result of ancient Greek colonies which once spread around the edges of the Mediterranean.
Haplogroup K: this is defined by mutation M9, and probably arose in southwest Asia about 40,000 years ago. This is a very diverse group today, and contains a number of major clades (L - T), as well as some minor clades. The major clades are described separately below, so this is only a brief discussion of the minor clades K1 - K4. These clades are found in northern and eastern Eurasia, and Melanesia. There is some small distribution across southwest Asia, north Africa and Oceania as well, and some distribution in the Americas. All of these subclades are found at relatively small frequencies. There was a major reclassification of Haplogroup K in 2008. The original subclades K1 and K7 were assigned to a newly reconstructed Haplogroup M, while the original subclades K3, K4 and K6 were renamed K1, K2 and K3. The descriptions above refer to the new system.
Haplogroup L: this is defined by mutation M20, and probably first appeared in south Asia about 30,000 years ago. It's commonest in parts of south Asia, including India (7-15%), Pakistan and Baluchistan (28%), the Kalash and Burusho people of Pakistan (12-23%). It's found at lower frequencies in west Asia, including Iran, Iraq and Turkey (3-4%), Armenians and Balkarians (5-9%), the Caucasus (about 9% of Avars), and in central Asia - about 9% among Pamiris, Bukharans, and Tajiks, 2-4% among Uzbeks, Uyghurs, Tartars. It's usually divided into 3 subclades, with L1 being the commonest in India and Sri Lanka, L2 in central and southwest Asia, and L3 among the Burusho/Kalash/Pashtuns of Pakistan.
Haplogroup M: This was modified in 2008 as described above. The defining mutation is P256, and it appeared about 10 - 30,000 years ago. The subclade M1 is found mostly in Papua, M2 in the Solomon Islands and Fiji, and M3 in Melanesia. M1 is the commonest Y-chromosome haplogroup in Western New Guinea.
Haplogroup S: is defined by mutations M230, P202 and P204. This was originally known as Haplogroup K5 and was reassigned to Haplogroup S in 2008. It's a small group that occurs mostly in the highland regions of Papua New Guinea, as well as in adjacent regions of Indonesia and Melanesia.
Haplogroup T: is defined by mutation M70. This was originally Haplogroup K2, and has now been reassigned to T. This is a low frequency group with a spotty distribution, occurring in small numbers across various parts of Asia and Europe. Some regions include:
- Southern India: specially among a few Dalit tribes of Dravidian people in the south
- Sicily: about 18% of Sicilians
- Spain: about 16% of Spaniards in the Ibiza region
- Africa: found in Somalia, Cameroon, and Egypt
- Smaller frequencies are found in isolated regions of Europe, including southwestern Russia, Kuban Cossacks, etc.
Haplogroup NO: this is considered part of the paragroup NOP. It's thought that the NO mutation (M214) occurred in a man who lived somewhere near the Aral Sea about 35,000-40,000 years ago. Collectively, this is a very large group, spread through much of north (N) and eastern (O) Eurasia. Haplogroup NO (without the N or O mutations) is quite rare, and found sporadically among some eastern Asian people, including the Japanese, Koreans, Han Chinese, Huis and Yaos. This ancestral group to N and O is almost extinct, for reasons that are not well understood. However, it has produced two very successful groups N and O, which are described below.
Haplogroup N: is defined by mutation M231, and it first appeared in east Asia about 15,000 to 20,000 years ago, during the last ice age. Since then, it has been widely spread across northern Eurasia, possibly by small bands of some Uralic language speaking men. Today, the highest frequencies are found in the Finnic and Baltic people of northern Europe and the Ob-Ugric and Samoyedic people of western Siberia. Haplogroup N shows evidence of several bottlenecks and founder effects, that is, it did not spread in a continuous manner but rather due to the migration of small bands of men. The group probably arose first in eastern Asia, and these people migrated northwards through north China and into the Altai region of eastern Siberia. From there, various subclades spread westwards, reaching as far as Scandinavia. Finns (60%) and Lithuanians (40%) show high frequencies of N. Various subclades of N are also found in sino-Tibetan people, Japanese and Koreans, Han Chinese, Vietnamese, as well as some Turkic people of west Asia.
Haplogroup O: This is an older cousin of N, and probably arose in eastern central Asia or Siberia about 35,000 years ago. It's defined by mutation M175. Today, it is very widespread, comprising of about 80-90% of all men in east and southeast Asia. It is also somewhat localized, occurring in much lower frequencies outside these regions. Among the various subclades of O there are:
- O1a: southern Han Chinese, Kradai, Austronesians
- O2a: Malays, Indonesians, some south Asians (Cambodians, Vietnamese, Taiwanese aborigines), as well as tribals from Andhra Pradesh in India.
- O2b: Koreans, Japanese, south east Asians
- O3: Sino-Tibetans, as well as east Asians
There is no significant distribution of any O subclade outside east and southeast Asia.
Haplogroup P: This is defined by mutation M45, and is a large group consisting of the patrilineal ancestors of most Europeans and almost all the indigenous people of the Americas, as well as about 1/3 to 2/3 of the males from different populations in central and south Asia. It probably arose somewhere in central Asia about 35,000 to 40,000 years ago. Besides the two major subgroups deriving from P (Haplogroups Q and R), there are some minor subclades, which are found in central and east Asia, Siberia, and the Russian Far East. These populations are somewhat rare. The two major groups include Q (Siberian + Native American) and R (European + South Asian), and they are dealt with separately below.
Haplogroup Q: is defined by mutation M242. It arose in Siberia about 15,000 to 20,000 years ago. Today, it consists of two major populations in Siberia and North America, as well as several minor subclades spread across Eurasia. About 15,000 years ago, a population of Haplogroup Q individuals migrated from Siberia into the Americas and later became established there. They underwent a subsequent mutation (M3), which puts them in a separate subclade within Q. In Asia and Europe, this group occurs in roughly a triangular region with Norway, Iran and northern China as the 3 corners of the triangle. Frequencies of Q in this region are generally small, ranging from about 2% to 10%. Additionally, there is an ancestral form of Q as well as two distinct subclades (Q4 and Q5) that are only found in the Indian subcontinent. It's believed that there was a migration of the original Q population in central Asia to India, and the subclades Q4 and Q5 developed and became localized in India.
Haplogroup R: This is defined by the M207 mutation, and first arose about 27,000 years ago in south or western Asia, with the probable origin being likely in either India, Pakistan or Iran. It has spread through much of western Asia, India and Europe since then, as well as some parts of northern Africa. This group is divided into two major subclades, R1 and R2. R1 is further divided into R1a and R1b. The history and spread of these subclades is somewhat uncertain, however, this is one commonly accepted reconstruction:
- R appeared in the central Asian steppes about 35,000 to 30,000 years ago, Shortly after, the subclade R1 appeared, which moved westwards into Europe in small pockets, from which the subclades R1a and R1b later appeared.
- R2 appeared sometime after in the ancestral population in the central Asian steppes. By some unknown route, it reached India about 25,000 years ago, and spread throughout India (both northern and southern Indian populations). About 90% of R2 individuals are found in India, the other 10% being distributed in the Caucasus and central Asia. The greater frequency of R2 among the upper Hindu castes seems to indicate that it may have entered India more than once. In fact, the distribution of R1a and R2 in India are almost identical, which suggests that both groups moved to India at roughly the same time.
- R1a is odd, having two different distribution ranges, one in south eastern Europe (in the area of Ukraine), and another in northern India. The original theory was that it originated in Europe or central Asia, from the ancestral R population which probably lived in central Asia. However, more recent studies show that it appears to go back much farther in time in India, perhaps as much as 36,000 years. So it's possible that it arose in northern India, and then spread northwards towards Europe about 30,000 years ago. This fits in with the archeological record of migration from India to Europe around that time.
- R1b is the most common western European haplogroup. It probably arose either in southwestern Asia or Europe, and is found pretty much everywhere along the Atlantic coast of Europe (as well as the Americas due to migration). Much smaller frequencies are found in populations from western and central Asia, as far south as Pakistan.
Because of the much higher number of genetic tests done on Europeans, the R1b subclade has been divided into over a dozen smaller subclades, which can be used to trace the movement of people across Europe.
Mitochondrial DNA Lineage
Mitochondria are intracellular organelles that are responsible for the aerobic pathway of cellular respiration, which is also known as the tricarboxylic acid (TCA) cycle. Mitochondria are unusual in that they contain extranuclear DNA.
Structure of a Mitochondrion
Lynn Margulis published a paper in 1966 suggesting that mitochondria were in fact once free-living cells that had become symbiotically associated with eukaryotic cells. This is known as the endosymbiotic theory, and was later expanded to include chloroplasts as well as mitochondria. Both chloroplasts and mitochondria contain DNA, which is distinct from the cell's own nuclear DNA. During cellular reproduction, mitochondria also replicate separately from the nucleus, and in doing so, their DNA is copied to daughter mitochondria.
This process of mitochondrial DNA replication does not involve any other cell as a DNA donor or for recombination; therefore the mitochondrial DNA of the daughter cell is an exact copy of the mitochondrial DNA of the parent cell (except for mutations which may result from errors in the replication process). Additionally, since sperm contain no mitochondria, the entire complement of mitochondria in the zygote comes from the mother only. For this reason, mitochondria can be used to trace the matrilineal ancestry of individuals and populations, in much the same way as the Y-chromosomal lineage. Just as Y-chromosomal lineage involves only men, having no contribution from the mother, mitochondrial lineage involves only women, having no contribution from the father. It is transmitted from mother to offspring, and those offspring who are female in turn transmit it to their own children. Again, if the rate at which mutations accumulate can be calculated (the molecular clock mentioned above), then it would be possible to relate populations on the basis of their mt-DNA ancestry
In this way, everyone alive today can be linked back to the last common mt-DNA ancestor, a woman known as the "Mitochondrial Eve". Mitochondrial Eve lived about 8000 generations (roughly 170,000 years) ago, so she was not contemporary with Y-chromosomal Adam, who lived much later. In fact, the date 170,000 years ago almost coincides with the dawn of modern humans, who first appeared 160,000 to 195,000 years ago, according to the archeological record. So Mitochondrial Eve was among the early fully modern humans (Homo sapiens sapiens), living somewhere in Africa. Probabilistic studies point to a date of around 140,000 years ago for Mitochondrial Eve, with 120,000 years as the very latest possible date.
Like Y-Chromosomal Adam, Mitochondrial Eve is a necessary mathematical consequence of the Out-of-Africa theory. For any population over a period of time, some lineages will become more common and others will die out. The mitochondrial DNA lineage depends on direct transmission from mother to daughter, so any generation which does not produce offspring, or which does not produce daughters, will wipe that particular mtDNA lineage from humanity forever. Over a period of time, as more and more lineages are wiped out due to chance, fewer lineages will remain, until all humans alive can be traced back to a single ancestor. The "last common matrilineal ancestor" depends very much on the present population on which it's based. For example, 10,000 years from now, more of our ancestral lineages will be wiped out, and therefore the "last common matrilineal ancestor" or Mitochondrial Eve will change -- she will be some other woman, who lived more recently than the current Mitochondrial Eve.
As with Y-Chromosomal Adam (mentioned earlier), there are some things to remember about Mitochondrial Eve. She is our last common maternal ancestor. This does not mean that she is our only ancestor, nor does it mean that she is our only female ancestor. Since it takes a male plus a female to reproduce, there is a whole line of paternal ancestry that is completely distinct from the maternal lineage. So it's obvious that she can't be our only ancestor, because half our genes have come from our fathers. Nor is she our only female ancestor. For example, people inherit genes from their father's mother, through their father -- and from her father's mother, etc. But our father's mother does not pass on her mitochondrial DNA to us. All mitochondrial DNA comes only from one's own mother. So while we are all descended in matrilineal lineage from Mitochondrial Eve, we do in fact have other female and male ancestors besides Mitochondrial Eve.
Since Mitochondrial Eve is simply a mathematical concept (last common matrilineal ancestor), the same concept can be applied to any mode of lineage. When applied to the Y-Chromosome, it identifies a last common patrilineal ancestor (the Y-Chromosomal Adam), who lived about 60,000 - 90,000 years ago. In a similar fashion, it could be applied to our last common female ancestor on our father's side, or the last common male ancestor on our mother's side. These lineages would similarly identify certain individuals in the past who are the common ancestors of all humanity today, for the type of lineage specified. These ancestors would all be different people, just as Mitochondrial Eve and Y-Chromosomal Adam lived tens of thousands of years apart. The problem is that while we have easy ways to identify Mitochondrial Eve and Y-Chromosomal Adam, we don't have easy ways to identify "last common female ancestor on the father's side", because doing so would require sorting out various confounds introduced in DNA during the fertilization process, which has not been done.
It's also good to remember that the "Adam" and "Eve" terminologies can be quite misleading. As mentioned, these individuals were not contemporaries; they lived tens of thousands of years apart, in different regions of the African continent. Nor were they the only people alive at the time. They were both part of large populations of individuals. The other individuals in their groups also reproduced, and their descendents are also part of our history. For all we know, their descendents may have survived as recently as 20 years ago. It's just that if you take a snapshot of humanity at this particular time, then all mtDNA lineages lead back to the individual known as Mitochondrial Eve. A thousand years ago, a similar snapshot would have pointed to a different Mitochondrial Eve, and a thousand years from now it will point to yet another Mitochondrial Eve. It is completely dependent on the present population at any given time.
Mitochondrial DNA Haplogroups
Mitochondrial DNA Lineage
The chart on the right shows a simplified version of the mitochondrial DNA haplogroups present today, and their relationship to each other. The time scale runs from top to bottom, with the older lineages towards the top. Such lineage maps are always a work in progress, as more and more people across the world get their mtDNA sequenced, and this added information allows geneticists to better identify SNPs and categorize people based on specific mutations in their mtDNA. The latest "build" of the human mtDNA tree can be obtained from the PhyloTree site here.
Looking at the chart on the right, we can see that there are a relatively small number of "super groups" that exist today (shown at the bottom), but that many of them have been subdivided into a large number of subclades. While the major groups and the divisions between them are well-established, the precise order of emergence of the subclades is not so clear, and therefore they are not shown in a tree structure (which would indicate when and where they diverged from the parent tree), but rather in clumps which don't assign relative dates or orders of emergence.
Supergroup L: Almost all L lineages are found only in Africa (and places with migrants from Africa), showing that they emerged before the Out-of-Africa migrations. The oldest mtDNA lineage known is L0, which split off from the parent group very soon after the time of Mitochondrial Eve, probably 150,000 - 170,000 years ago. There are dozens of subdivisions of L0, found in different parts of Africa. The oldest of these are L0d and L0k, which are found today mostly in the Khoisan people. For this reason, many studies that try to identify what the earliest humans might have looked like turn to the Khoisan people.
L1 emerged very soon after L0, probably in East Africa, and is today found in sub-Saharan Africa. L2 is a bit more recent, having emerged 87,000 - 107,000 years ago. It's also found in sub-Saharan Africa, in about 1/3 of the population. L4, L5 and L6 are much smaller haplogroups, all found in East Africa today.
L3 is the most important haplogroup in terms of the number of people. It's believed that L3 arose in East Africa (where it still exists today) about 84,000 to 104,000 years ago. Its importance lies in that L3 people were the ones who first left Africa in the Out-of-Africa migrations. Today, everyone alive who is not native African belongs to L3 or to some group descended from L3. This makes it by far the most numerous ancestral group present today.
There is another interesting note about L3. The commonly accepted theory is that modern human behavior, including such things as abstract and decorative art, as well as more finely crafted stone tools appeared fairly recently, about 40,000 years ago. The best preserved and studied sites from this period are in Europe. However, there is a much more ancient phase during the middle paleolithic, from Africa, that was also characterized by such "modern" innovations. This includes sites at the southern tip of the African continent, at Still Bay and Howieson's Port (see Blombos Cave). Over 70,000 years ago, tens of thousands of years before the evolution of modern behavior in Europe, there were groups of humans living along the coast of South Africa, who created intricate decorative patterns on ochre, strung shells together into necklaces, and made much finer tools than their contemporaries. This culture did not last long; the developments were lost to their descendents, and did not re-emerge until about 40,000 years ago. The interesting correlation is that this development happened during a period of marked expansion of the L3 haplogroup in Africa, between 61,000 and 86,000 years ago. Other haplogroups did not expand until much later -- as late as 20,000 years ago. But the expansion of L3 correlates both with the appearance of modern behavior in South Africa, and the movement of humans out of Africa, from East African sites. What this means remains to be settled, but it seems that the increase in population density among L3 probably had something to do with both events -- the development of modern behavior, as well as the migrations.
Spread of Mitochondrial DNA Haplogroups from Africa
Supergroup M: As mentioned above, L3 was the first and only haplogroup to leave Africa in ancient times, and everyone alive today who is not of recent African origin belongs to haplogroup L3 or one of its two descendents: the supergroups M and N. Some of the subclades within supergroup M include:
- M1: The original M (now represented as M1) occurs near and about the regions where humans left Africa (the second migration, across the Red Sea). Found in northeast Africa, mainly Ethiopia and Somalia, as well as some Arab (Saudi Arabia, Yemen, Oman) and Indian populations. Probably reflects a gene flow across the Red Sea.
- E: southeast Asia, including Malaysia, Indonesia (specially Borneo), the Philippines, Taiwanese aborigines, and Papua New Guinea
- G: northeast Asia, including northeast Siberia, northeast China, and some in central Asia
- CZ: branch which led from northeast Asia to the Americas. C includes the people who made it to the Americas, found mostly along the eastern coasts of both North and South America. Z includes the people who stayed in Asia, including northern Chinese, some Koreans, Saami, some central Asians.
- Q: far east and Pacific - including Melanesian, Polynesian and some New Guinea populations.
- D: also part of the new world migrations - found in Siberians and northeast Asians, as well as across native American populations.
As can be seen in the chart above, groups E, G, C, Z and Q are more closely related to each other, while D is somewhat different. Although all of them descend from Supergroup M, it's thought that D branched out before the remaining members of Supergroup M branched out into the other subclades listed above.
Supergroup N: This was the second supergroup to split off from L3. There are a large number of subclades arising from N, which are related as shown in the mtDNA chart above. Briefly, there are several direct subclades of Supergroup N, including Y, S, A, I, X and W. Then there is a branching from N to R, and several subclades of R, including B, F, J, P and T. R further branches into R0 and U, each with their own subclades (for R0 = HV, H, and V; for U = K). This sort of relationship simply indicates a time-dependent hierarchy of branching, with some mutations appearing earlier, and then subsequent mutations taking place in subpopulations created by the earlier mutations.
Here's a list of the subclades of N:
- A: native Americans in both North and South America, some Japanese and Koreans.
- S: some Australian aborigines
- Y: the Ainu people of Japan, Nivikhs, a small number of southern Siberians (1%)
- I: about 10% of the people in northern and eastern Europe
- X: southern Siberians, southern Europe and southwest Asia, some native Americans
- W: south and southeast Asia, some eastern Europeans
Group R is a large heterogeneous group descended from N. It can be divided into a west Eurasian branch (includes almost all Europeans, as well as a large number of middle eastern people) and an east Eurasian branch. It includes the following subclades:
- B: part of the east Eurasian branch, consists of some Chinese, Mongolians, Tibetans, south Siberians, Koreans, Japanese, some Amerindians
- F: also part of the east Eurasian branch, found further south (in southeast Asia), specially Vietnam
- JT: part of the west Eurasian branch, probably arose near Lebanon. Found in much of the middle east, eastern Europe, west Pakistan (near the Indus), southern Europe (the Mediterranean region), about 25% of Bedouin populations in Africa. Branch J is more middle eastern, branch T is more European/Mediterranean/Indus.
- P: found today mostly in the southern Pacific, including New Guinea, Melanesia, and some Australian aborigines.
Group R0 arose from R and includes the following subclades:
- H: Arabian peninsula and north-east Africa (Somalia, Ethiopia)
- V: more western and northern, including parts of the middle east west of the Arabian peninsula, some parts of western Europe
In addition to R0, the following also arose from R:
- U: mostly found in Scandinavia and the Baltic Republics today, and to a smaller extent in the Mediterranean.
- K: this is a subtype of U, but is a fairly large subclade on its own, including about 1/3 of Ashkenazi Jews, as well as many people around the Alps and in the British Isles. It's found to a lesser degree in north Africa, south Asia, and the middle east.
Both mitochondrial and Y-chromosomal haplotype maps are very much works in progress. In actual numbers, very few people around the world have had their DNA sequenced to check for haplotypes. Representations of frequencies of occurrence are therefore inexact, being calculated from small samples. This outline simply presents a summary of the state of affairs now, but it continues to change as new information becomes available. For the latest mtDNA tree, you can always check the source.
Comparison of Mitochondrial and Y-Chromosomal Lineages
The movement of people across the globe as determined by mtDNA and Y-chromosomal lineages shows some degree of overlap, as can be expected when groups of people migrate and colonize new areas. However, we cannot expect the correspondence to be exact for several reasons.
Small populations often have pronounced founder effects. What this means is that if, for example, 100 people out of an original population of 1000 people migrate to a new area and start a new colony there, the genetics of this new population will not be identical to the original population. This is because simply as a matter of statistics, a subset of 100 people out of 1000 may not be a fair representation of all the alleles in the original population, at their original frequencies. Some alleles present in the original population may be entirely absent in the few people who migrate. Others may be present at different frequencies than in the original population. Since the future genetic makeup of the colony is determined by the genetics of its founders, it will therefore differ from the original population. In this way, some mitochondrial or Y-chromosomal lineages may be lost entirely in a migratory group, while others may become more attenuated or more pronounced.
Genetic Drift is another way in which lineages may change. This is simply a stochastic effect. Simply as a consequence of random factors inherent in reproduction (who reproduces, which of their alleles make it into the haploid gametes that actually take part in fertilization, etc.), the genetic composition of a population tends to drift, or change, over time. This effect is much more pronounced in smaller populations than in large ones, and therefore can be a significant factor soon after a migration, which usually involves small groups. Due to genetic drift, some mitochondrial or Y-chromosomal lineages may be lost over time.
Population bottlenecks are another reason why mtDNA and Y-chromosomal lineages don't show as much overlap as expected. Bottlenecks are events when for some reason (such as famine, drought, plagues, etc.) the population is drastically reduced for a period of time. Bottlenecks are basically random events, so they may purely by chance affect one lineage more than another.
Finally, the pattern of migration of men and women aren't always the same. A disproportionately large number of either men or women may be part of a migration. For example, many migration events primarily consisted of groups of men moving from one area to another, where they took local wives. In such cases, a new Y-chromosomal lineage was introduced into the area, but no corresponding mtDNA lineage was introduced because no women took part in the migration. Conversely, it was custom in many parts of the world for tribes to raid each other to steal women. This introduced women from a different area into the gene pool of the former, and therefore introduced new mtDNA lineages. It did not introduce corresponding Y-chromosomal lineages, because men from the foreign tribes were not brought back for reproduction. Even today, in many parts of the world, it is the custom that the wife moves into the husband's family, and becomes part of it. This is specially common where arranged marriages take place, and it often involves the physical movement of the woman to a geographically different area, and therefore she becomes part of a new population, bringing her genetics into the new gene pool.
For all of these reasons, mtDNA and Y-chromosomal lineages may indicate different patterns of migrations. However, there are still large areas of overlap, some of which are shown in the chart below.
|Y-Chromosomal Haplogroup||found associated with mtDNA Haplogroup||in population|
|A||L0||Khoisan people in South Africa|
|B||L1, L2||Pygmies in Africa|
|E1, E2, E1b1a||L3||Sub-Saharan Africa, specially Bantus|
|O, N, C3||CZ/C/Z, D, G, A, B, F||East Asia and Siberia|
|K, M||R, P, Q||Oceania|
|R, I, J, E1b1b||HV/H/V, JT/J/T, U/K||Europe, West Asia, North Africa, African Horn|
|Q, C3||A, X, Y, C, D||East Siberia, Americas|
As can be seen, there is a strong association between certain mtDNA and Y-chromosomal Haplogroups in some populations. This is generally stronger in populations that have not mixed much, have not migrated, or have migrated in relatively large groups which included both men and women.
Also, the ages of different lineages tend to correlate. For example, Y-chromosomal Haplogroup A is the oldest Y-chromosomal lineage; while mtDNA Haplogroup L0 is the oldest mtDNA lineage. Both are found in the Khoisan people, as would be expected if (1) they are genetically closest to the ancestral populations of all humans, (2) they haven't interbred much with other people, and (3) they have been relatively geographically isolated, not having migrated much. These people are the bushmen of the Kalahari, and their life style and physical environment probably have some role to play in their isolation and the preservation of their haplogroups.
SNPs and Variation
Since we've talked about SNPs - differences between individuals at the single nucleotide level - it might seem natural to conclude that these are in fact the genetic differences that exist between people, the reason why you are different from your siblings, or your friends. Or the reason why Asians are different from Africans or Europeans. That we are different because each of us has some combination of SNPs that is unique to us, or conversely, that we are similar to other people in our country but less similar to people in some other country, because we share certain SNPs with some people but not with others.
This is to some extent true, but not the whole story. The International HapMap Project estimated that the total number of SNPs across the entire human population is about 10 million, of which about 1% are functional (meaning they actually produce a detectable effect). If the human genome is 3 billion base pairs, then 10 million SNPs account for roughly 0.3%. However, it's important to remember that nobody has all the possible SNPs - each person only has some of them. For this reason, the difference between any two random people is only about 0.1%, on a nucleotide-by-nucleotide comparison. This is the reason for the commonly quoted 99.9% number, that any two people are 99.9% identical in terms of DNA. And since only about 1% of the SNP differences are functional, you could further elaborate that to say only 1/100th of those SNP differences matter in terms of producing physical differences between any two people.
However, as mentioned before, this is not the whole story. People can genetically differ in other ways besides SNPs. The most important of these is copy number variation, where long stretches of DNA are repeated. These stretches of repeated DNA can be thousands to even millions of basepairs long, and everyone of us has such repeating regions. But we don't all have the same number of repeats of each repeating region, and this is a huge source of individual differences between people. For example, one person may have 4 repeats of a certain repeating region we'll call "A", while another person may have 7 repeats of the same region. These repeats can occur in both non-coding and coding parts of the DNA, so it's obvious that they can and do have consequences for the phenotype, when the duplication involves genes. Copy number variation actually accounts for a greater proportion of the variation between people than SNPs - about 0.4% versus the 0.1% for SNPs. That's not to say it accounts for more phenotype differences than SNPs, that would have to be proven separately, and that's not been done. But it certainly does produce physical differences between people, from appearance to susceptibility to disease.
Epigenetics is another source of variation between people, which doesn't involve differences in DNA sequences, but rather other factors that cause the DNA to behave differently in different people, in a heritable fashion.
So while SNPs are a great way to study variation among populations, they are certainly not the only way. Techniques for other methods for studying variablity are only now being refined, and it may take some time before we see any large scale statistics.