Information

Confusion about a gene's description

Confusion about a gene's description


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have a very basic biology question. I am reading the description of gene FAM166A here, and I have no idea what "sequence similarity 166" means. What does 166 stand for, what is this gene's sequence "similar to"? And what does member A mean?


The FAM symbols are defined in the HUGO Gene Nomenclature Committee guidelines.

The FAM symbol is an anonymous and temporary identifier that is given to groups of poorly characterised genes which share more than 40% amino acid sequence identity (see Section 4.1).

There are two human genes/proteins in the sequence similarity 166 group; FAM166A and FAM166B. These have some sequence similarity (more than 40%) but function of this similar sequence is currently unknown. Once further characterisation has been performed for any FAM group they will likely receive more meaningful names.


Confusion about a gene's description - Biology

R.C. Lewontin, Alexander Agassiz Professor Emeritus of Zoology at Harvard University, has written a number of books and articles on evolution and human variation, including Biology as Ideology: The Doctrine of DNA and The Triple Helix: Gene, Organism, and Environment

Over the last thirty five years a major change has taken place in our biological understanding of the concept of human “race,” largely as a consequence of an immense increase in our knowledge of human genetics. As a biological rather than a social construct, “race” has ceased to be seen as a fundamental reality characterizing the human species. Nevertheless, there appear from time to time claims that racial categories represent not arbitrary socially and historically defined groups but objective biological divisions based on genetic differences. The most recent widely noticed rebirth of such claims is an essay by Armand Marie Leroi on the Op-Ed page of The New York Times (March 14, 2005), an essay that illustrates both the classical confusions about the reality of racial categories and the more recent erroneous conclusions about the relevance of such racial identifications for medical practice.

There are four facts about human variation upon which there is universal agreement. First, the human species as a whole has immense genetic variation from individual to individual. Any two unrelated human beings differ by about 3 million distinct DNA variants.

Second, by far the largest amount of that variation, about 85%, is among individuals within local national or linguistic populations, within the French, within the Kikuyu, within the Japanese. There is diversity from population to population in how much genetic variation each contains, depending upon how much immigration into the population has occurred from a variety of other groups and also on the size of the population. The United States, with a very large population whose ancestors came from all over the earth including the original inhabitants of the New World, is genetically very variable whereas small populations of local Amazonian tribes are less genetically variable, although they are by no means genetically uniform. Despite the differences in amount of genetic variation within local populations, the finding that on the average 85% of all human genetic variation is within local populations has been a remarkably consistent result of independent studies carried out over twenty-five years using data from both proteins and DNA.

Of the remaining 15% of human variation, between a quarter and a half is between local populations within classically defined human “races,” between the French and the Ukrainians, between the Kikuyu and the Ewe, between the Japanese and the Koreans. The remaining variation, about 6% to 10% of the total human variation is between the classically defined geographical races that we think of in an everyday sense as identified by skin color, hair form, and nose shape. This imprecision in assigning the proportion of variation assigned to differences among population within ”races” as compared to variation among “races,” arises precisely because there is no objective way to assign the various human populations to clear-cut races. Into which “race” do the Hindi and Urdu speakers of the Indian sub-continent fall? Should they be grouped with Europeans or with Asians or should a separate race be assigned to them? Are the Lapps of Finland and the Hazari of Afghanistan really Europeans or Asians? What about Indonesians and Melanesians? Different biologists have made different assignments and the number of “races” assigned by anthropologists and geneticists has varied from 3 to 30.

Third, a small number of genetic traits, such as skin color, hair form, nose shape (traits for which the genes have not actually been identified) and a relatively few proteins like the Rh blood type, vary together so that many populations with very dark skin color will also have dark tightly curled hair, broad noses and a high frequency of the Rh blood type R0. Those who, like Leroi, argue for the objective reality of racial divisions claim that when such covariation is taken into account, clear-cut racial divisions will appear and that these divisions will correspond largely to the classical division of the world into Whites, Blacks, Yellows, Reds and Browns. It is indeed possible to combine the information from covarying traits into weighted averages that take account of the traits' covariation (technically known as "principal components" of variation). When this has been done, however, the results have not borne out the claims for racial divisions. The geographical maps of principal component values constructed by Cavalli, Menozzi and Piazza in their famous The History and Geography of Human Genes show continuous variation over the whole world with no sharp boundaries and with no greater similarity occurring between Western and Eastern Europeans than between Europeans and Africans! Thus, the classically defined races do not appear from an unprejudiced description of human variation. Only the Australian Aborigines appear as a unique group.

A clustering of populations that does correspond to classical continental "races" can be acheived by using a special class of non-functional DNA, microsatellites. By selecting among microsatellites, it is possible to find a set that will cluster together African populations, European populations, and Asian populations, etc. These selected microsatellite DNA markers are not typical of genes, however, but have been chosen precisely because they are "maximally informative" about group differences. Thus, they tell us what we already knew about the differences between populations of the classical "races" from skin color, face shape, and hair form. They have the added advantage of allowing us to make good estimates of the amount of intermixture that has occurred between populations as a result of migrations and conquests.

The every-day socially defined geographical races do identify groups of populations that are somewhat more closely similar to each other genetically. Most important from the standpoint of the biological meaning of these racial categories, however, most human genetic variation does not show such "race" clustering. For the vast majority of human genetic variations, classical racial categories as defined by a combination of geography, skin color, nose and hair shape, an occasional blood type or selected microsatellites make no useful prediction of genetic differences. This failure of the clustering of local populations into biologically meaningful "races" based on a few clear genetic differences is not confined to the human species. Zoologists long ago gave up the category of "race" for dividing up groups of animal populations within a species, because so many of these races turned out to be based on only one or two genes so that two animals born in the same litter could belong to different "races."

In his article, Leroi is inconsistent and shifting in his notion of race. Sometimes it corresponds to the classical social definitions of major races, but elsewhere he makes “race” coincident with a small local group such as the Negritos or Inuit. In this shifting concept of “race” he goes back to the varying use of the term in the 19th century. Then people spoke of the “Scots race,” “the Irish race” and the “race of Englishmen.” Indeed “race” could stand for a family group defined by male inheritance, as in the description of the last male in a family line as “the last of his race.” This inconsistent usage arises from the fact that there is no clear criterion of how much difference between groups of genetically related individuals should correspond to the category “race.” If it had turned out that groups of related populations were clearly different in the great majority of their genes from other groups, then racial categories would be clear and unambiguous and they would have great predictive power for as yet unstudied characters. But that is not the way it has turned out, at least for the human species.

The fourth and last fact about genetic differences between groups is that these differences are in the process of breaking down because of the very large amount of migration and intergroup mating that was always true episodically in the history of the human species but is now more widespread than ever. The result is that individuals identified by themselves or others as belonging to one “race,” based on the small number of visible characters used in classical race definitions, are likely to have ancestry that is a mixture of these groups, a fact that has considerable significance for the medical uses of race identification.

A common claim, repeated by Leroi, is that racial categories are of considerable medical use, especially in diagnostic testing because some genetic disorders are very common in ancestral racial populations. For example sickle cell anemia is common among West Africans, who were brought as slaves to the New World, and Tay-Sachs disease is common among Ashkenazi Jews. So, it is argued, racial information can be a useful diagnostic indicator. Certainly classical “race” contains some medically relevant information in some cases, as for example “white” as opposed to “African American” if the contrast is between Finland and West Africa, but not if it is a contrast between a “white” Mediterranean and an “Asian” Indian. There is a confusion here between race and ancestry. Sickle cell anemia is in high frequency not only in West Africans but also in some “white” Middle Eastern and Indian populations. Moreover, a person with, say, one African great-grandparent, but who is identified by herself and others as “white” has a one in eight chance of inheriting a sickle-cell mutation carried by that ancestor. There are, in addition, a number of other simply inherited hemoglobin abnormalities, the thalassemias, that are in high frequency in some places in the Mediterranean (Sardinia), Arabia and southeast Asia. The highest frequency known for a thalassemia (80%) is in Nepal, but it is rare in most of Asia. The categorization of individuals simply as “white” or “Afro-American” or “Asian” will result in a failure to test for such abnormal hemoglobins because these abnormalities do not characterize the identified “race” of the patient. Even group identities below the level of the conventional races are misleading. Two of my incontrovertibly WASP grandchildren have a single Ashenazi Jewish great-grandparent and so have a one in eight chance of inheriting a Tay-Sachs abnormality carried by that ancestor. For purposes of medical testing we do not want to know whether a person is “Hispanic” but rather whether that person’s family came from a Caribbean country such as Cuba, that had a large influx of West African slaves, or one in which there was a great deal of intermixture with native American tribes as in Chile and Mexico, or one in which there was only a negligible population of non-Europeans. Racial identification simply does not do the work needed. What we ought to ask on medical questionnaires is not racial identification, but ancestry. “Do you know of any ancestors who were (Ashkenazi Jews, or from West Africa, from certain regions of the Mediterranean, from Japan)?” Once again, racial categorization is a bad predictor of biology.

There has been an interesting dialectic between the notion of human races and the use of race as a general biological category. Historically, the concept of race was imported into biology, and not only the biology of the human species, from social practice. The consciousness that human beings come in distinct varieties led, in the history of biology, to the construction of “race” as a subgrouping within species. For a long time the category “race” was a standard taxonomic level. But the use of “race” in a general biological context then reinforced its application to humans. After all, lots of animal and plant species are divided into races, so why not Homo sapiens? Yet the classification of animal and plant species into named races was at all times an ill-defined and idiosyncratic practice. There was no clear criterion of what constituted a race of animals or plants that could be applied over species in general. The growing realization in the middle of the twentieth century that most species had some genetic differentiation from local population to local population led finally to the abandonment in biology of any hope that a uniform criterion of race could be constructed. Yet biologists were loathe to abandon the idea of race entirely. In an attempt to hold on to the concept while make it objective and generalizable, Th. Dobzhansky, the leading biologist in the study of the genetics of natural populations, introduced the “geographical race,” which he defined as any population that differed genetically in any way from any other population of the species. But as genetics developed and it became possible to characterize the genetic differences between individuals and populations it became apparent, that every population of every species in fact differs genetically to some degree from every other population. Thus, every population is a separate “geographic race” and it was realized that nothing was added by the racial category. The consequence of this realization was the abandonment of “race” as a biological category during the last quarter of the twentieth century, an abandonment that spread into anthropology and human biology. However, that abandonment was never complete in the case of the human species. There has been a constant pressure from social and political practice and the coincidence of racial, cultural and social class divisions reinforcing the social reality of race, to maintain “race” as a human classification. If it were admitted that the category of “race” is a purely social construct, however, it would have a weakened legitimacy. Thus, there have been repeated attempts to reassert the objective biological reality of human racial categories despite the evidence to the contrary.


A survey of nomenclatures: genes, streets and stars

For human observers, the genome is merely the latest example of a vast, partially understood landscape of objects to label. Pioneering explorers, cartographers, astronomers and city planners all faced a similar task, generating nomenclatures with a variety of coherence. Street names within cities provide a good analogy to genes within species. North American cities, for instance, often share a corpus of conserved street names, some of which convey useful information (consider Church Street or College Street). This is reflected in the commonsense naming of certain 'landmark' genes and proteins (for example, those for ribosomes) in most species. Of course, it is possible for names to lose their meaning (though the church is demolished, Church Street remains). Similarly, some genes whose names describe their function later turn out to perform an entirely different activity.

A central-planning approach to naming, in cities and genes alike, is sensible but often bland. Genes in many newly sequenced organisms are named according to a rigorous system (for example, sequential open reading frame numbering), just as certain newer cities adhere to a rational system (consider Pierre l'Enfant's plan for Washington DC or the numbered grid of Manhattan).

In astronomy, the names of the primal heavenly bodies - the Sun and the Moon -come down to us from prehistoric times. The constellations were named several thousand years ago on the basis of their semblance to animals and mythical beings. Roman astronomers offered up names of deities for the planets in accordance with their observed characteristics. This nomenclature sufficed until increasingly powerful telescopes revealed unending swathes of astral objects to name. Accordingly, celestial nomenclature evolved into a pseudo-consistent system of numbered galaxies, stars and other objects. The resulting bricolage of astronomical names parallels that found in gene nomenclature: a manageable set of initial core objects gives way to waves of thematic naming, until the avalanche of new genes brought on by large-scale sequencing forces us to bland, systematic identifiers.


The Evolving Definition of the Term "Gene"

This paper presents a history of the changing meanings of the term "gene," over more than a century, and a discussion of why this word, so crucial to genetics, needs redefinition today. In this account, the first two phases of 20th century genetics are designated the "classical" and the "neoclassical" periods, and the current molecular-genetic era the "modern period." While the first two stages generated increasing clarity about the nature of the gene, the present period features complexity and confusion. Initially, the term "gene" was coined to denote an abstract "unit of inheritance," to which no specific material attributes were assigned. As the classical and neoclassical periods unfolded, the term became more concrete, first as a dimensionless point on a chromosome, then as a linear segment within a chromosome, and finally as a linear segment in the DNA molecule that encodes a polypeptide chain. This last definition, from the early 1960s, remains the one employed today, but developments since the 1970s have undermined its generality. Indeed, they raise questions about both the utility of the concept of a basic "unit of inheritance" and the long implicit belief that genes are autonomous agents. Here, we review findings that have made the classic molecular definition obsolete and propose a new one based on contemporary knowledge.

Keywords: function gene gene networks regulation structure theory.


Scientists discover link between genes and being transgender

Researchers at the Hudson Institute of Medical Research in Melbourne, Australia, have discovered that gender dysphoria may have a biological basis.

The study examined the genetic variations of 380 transgender women and compared them to those of non-transgender men.

Within the transgender women, they found a significant over-representation of four genes that are involved in processing sex hormones. This variation suggests a potential biological reason why certain people experience gender dysphoria.

Those behind the study propose that these genetic variations can affect the male brain’s ability to process androgen, meaning that the brain develops differently in a way that is less “masculine” and more “feminine,” contributing to gender dysphoria in transgender women.

The study has been deemed the largest and most comprehensive of its kind by lead author Professor Vincent Harley.

Speaking to the Australian Broadcasting Company, Harley said: “While it should not hinge on science to validate people’s individuality and lived experience, these findings may help to reduce discrimination.”

Proving a link between gender dysphoria and genetics may have the potential to improve diagnosis and increase social acceptance for those who are transgender.

A 2017 report conducted by the Telethon Kids Institute found that approximately three in four trans people between the ages of 14 and 25 experience anxiety or depression and four in five have engaged in self-harm. The report also uncovered the alarming statistic that almost half of all young trans people surveyed have attempted suicide.

Fran was one of the transgender women who participated in the Hudson Institute’s study. She told the ABC, “The mental journey of self-acceptance has really been one of the dominating features of my life.”

This study adds to a growing field of research that suggests there is a biological basis to transgender identities. However, despite providing a deeper look into what makes one feel like a male or female, Harley insists that genes are “not the only factors involved in determining gender identity.”


Materials and methods

SO and SOFA have been built and are maintained using the ontology-editing tool OBO-Edit. The ontologies are available at [34].

The FlyBase D. melanogaster [35] data was derived from the GadFly [36] relational database and converted to Chaos-XML using the Bio-chaos tools. The features were annotated to the deepest concept in the ontology possible, given the available information. For example, the degree of information in annotations was sufficiently deep to describe the transcript features with the type of RNA such as mRNA, or tRNA. It was therefore possible to restrict the analysis to given types of transcript. CGL tools were used to validate each of the annotations, iterate through the genes and query the features. EM-operators were applied to the part features of genes.

Other organism data was derived from the genomes section of GenBank [37]. GenBank flat files were converted to SO-compliant Chaos-XML using the script cx-genbank2chaos.pl (available from [19]) and BioPerl [23]. The BioPerl GenBank parser, Bio::SeqIO::genbank was used to convert GenBank flat files to Bioperl SeqFeature objects. Feature_relationships between these objects were inferred from location information using the Bioperl Bio::SeqFeature::Tools::Unflattener code. GenBank Feature Table types were converted to SO terms using the Bio::SeqFeature::Tools::TypeMapper class, which contains a hardcoded mapping for the subset of the GenBank Feature Table which is currently used in the genomes section of GenBank. The same Perl class was used to type the feature_relationships according to SO relationship types. The EM analysis was performed over the Chaos-XML annotations using the CGL suite of modules to iterate over the parts of each gene.


Molecular biology

Our editors will review what you’ve submitted and determine whether to revise the article.

Molecular biology, field of science concerned with studying the chemical structures and processes of biological phenomena that involve the basic units of life, molecules. The field of molecular biology is focused especially on nucleic acids (e.g., DNA and RNA) and proteins—macromolecules that are essential to life processes—and how these molecules interact and behave within cells. Molecular biology emerged in the 1930s, having developed out of the related fields of biochemistry, genetics, and biophysics today it remains closely associated with those fields.

Various techniques have been developed for molecular biology, though researchers in the field may also employ methods and techniques native to genetics and other closely associated fields. In particular, molecular biology seeks to understand the three-dimensional structure of biological macromolecules through techniques such as X-ray diffraction and electron microscopy. The discipline particularly seeks to understand the molecular basis of genetic processes molecular biologists map the location of genes on specific chromosomes, associate these genes with particular characters of an organism, and use genetic engineering (recombinant DNA technology) to isolate, sequence, and modify specific genes. These approaches can also include techniques such as polymerase chain reaction, western blotting, and microarray analysis.

In its early period during the 1940s, the field of molecular biology was concerned with elucidating the basic three-dimensional structure of proteins. Growing knowledge of the structure of proteins in the early 1950s enabled the structure of deoxyribonucleic acid (DNA)—the genetic blueprint found in all living things—to be described in 1953. Further research enabled scientists to gain an increasingly detailed knowledge not only of DNA and ribonucleic acid (RNA) but also of the chemical sequences within these substances that instruct the cells and viruses to make proteins.

Molecular biology remained a pure science with few practical applications until the 1970s, when certain types of enzymes were discovered that could cut and recombine segments of DNA in the chromosomes of certain bacteria. The resulting recombinant DNA technology became one of the most active branches of molecular biology because it allows the manipulation of the genetic sequences that determine the basic characters of organisms.


Voices of the new generation: science in a state of benign confusion

Building a research group in China (when you have not mastered the language) means you face the same problems as elsewhere, with an extra bit of confusion.

My computational biology research group at Fudan University, which I started in September 2018, works on the global microbiome. We endeavour to understand the ecological processes behind the microbial ecology of the whole of the Earth, trying to answer questions such as how genes and species emerge, how they evolve and what they do. We approach this issue computationally, by analysing large datasets of metagenomes.

My job interview at Fudan University, in Shanghai, was my first visit to China. Moving here was not my first international experience: I have studied and worked in Portugal, the USA and Germany, and I can speak and write in several European languages. However, in China I have become familiar with a quotidian feeling of benign confusion.

As my Mandarin can generously be described as rudimentary, and few people, outside of the university environment, speak fluent English, communication is often a mystifying dumb-show. I get by with gestures, automated translation, my mispronounced Mandarin and copious smiles. Confusion is inevitable. However, if you display humility rather than entitlement, people are very friendly and understanding of your ignorance. Hence, benign confusion.

To a surprising degree, my job is very similar to what it would be elsewhere. A typical pre-COVID-19 day involved picking up a flat white at the hip campus café on the way in to the office, answering emails, working on grants and manuscripts, meeting with trainees and, if time permits, reading papers. I do it in English, submitting to the same journals and conferences as my international colleagues. Twitter and Slack groups are invaluable for keeping up with the scientific conversation.

Grants are available from a variety of Chinese sources: national grants, city grants and I have even received some funding from our neighbourhood. (For perspective, with over one million inhabitants, our neighbourhood is larger than some of the smaller European countries.) Like everywhere, to apply for a grant you write a proposal arguing for the scientific importance of your ideas and your ability to test them. Although grants can be submitted in English, much of the information about them is only available in Chinese. Therefore, I must rely on my department and colleagues to share this information.

When there are calls for international collaboration, my contacts outside of China provide me with extra opportunities. For example, we are currently part of a JPIAMR-funded collaboration to work on antimicrobial resistance, with partners in Europe and Pakistan. As part of these consortia, I often have to deal with impedance mismatches between administrative systems. We naturally have to translate contracts and documents, which increases paperwork. The real problems start, however, when one partner requires a document that has no equivalent in China, or vice versa. Although some of these issues would exist even within Europe, they are much greater between such different administrative systems.

Nevertheless, many of the problems I face here are the same as those faced by junior group leaders everywhere. For example, once funding is secured, how to attract postdocs and students, when you are competing with more established researchers for talent. Indeed, for the first six months on the job, my ‘group’ consisted of just myself, working alone at my computer. At times, it felt like an anticlimatic continuation of my postdoc.

Eventually, I was lucky to have a few talented students and one extraordinary postdoc join my group. Slowly, the texture of the job changed. Then, one day, I took a break and went to the departmental lounge to get myself a cup of coffee. As I walked in, I saw that ‘my’ students and postdoc were sitting together with a visiting speaker, discussing science. At that moment, I realized that I had finally established a research group.


Genes are inherited through both asexual reproduction and sexual reproduction. In asexual reproduction, resulting organisms are genetically identical to a single parent. Examples of this type of reproduction include budding, regeneration, and parthenogenesis.

Sexual reproduction involves the contribution of genes from both male and female gametes that fuse to form a distinct individual. The traits exhibited in these offspring are transmitted independently of one another and may result from several types of inheritance.

  • In complete dominance inheritance, one allele for a particular gene is dominant and completely masks the other allele for the gene.
  • In incomplete dominance, neither allele is completely dominant over the other resulting in a phenotype that is a mixture of both parent phenotypes.
  • In co-dominance, both alleles for a trait are fully expressed.

What is ancestry?

Copyright: © 2020 Mathieson, Scally. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This research was supported by a Research Fellowship from the Alfred P. Sloan foundation [FG-2018-10647], a New Investigator Research Grant from the Charles E. Kaufman Foundation [KA2018-98559], and NIGMS award number [R35GM133708] to I.M. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Ancestry connects genetics and society in fundamental ways. For many people it has cultural, religious or even political significance, and can play a key role in shaping personal and public identities. People’s desire to discover their own ancestry drives the multibillion-dollar genealogy industry, which has grown rapidly in the era of consumer genomics. Companies such as 23andMe and Ancestry now claim tens of millions of customers worldwide. In parallel, our scientific understanding of the human past is being transformed by studies of ancient and modern genetic data, which allow us to track changes in ancestry over space and time. Sophisticated methods have been developed to infer and visualise these relationships. Thus, it seems that both scientists and the wider public are learning more and more about ancestry, and there is an optimistic sense that genetic data provide an exhaustive repository of ancestral information.

However, although frequently discussed, ancestry itself is rarely defined. We argue that this reflects widespread underlying confusion about what it means in different contexts and what genetic data can really tell us. This leads to miscommunication between researchers in different fields, and leaves customers open to spurious claims about consumer genomics products and overinterpretation of individual results.

In wider usage, the terms ancestry and ancestors often indicate a general connection to people or things in the past. But in a genetic context they have a more specific meaning: your ancestors are the individuals from whom you are biologically descended and ancestry is information about them and their genetic relationship to you. Even here however, confusion arises from the way that ancestry is presented and discussed. Rather than emphasising its complex structure, results are often simplified in terms of discrete categories. While convenient and sometimes useful, ultimately this is misleading about the nature of ancestry. These labels can also impose contemporary political or cultural divisions which may be misrepresentative of ancestral relationships.

Another source of confusion is that three distinct concepts–genealogical ancestry, genetic ancestry, and genetic similarity–are frequently conflated. We discuss them in turn, but note that only the first two are explicitly forms of ancestry, and that genetic data are surprisingly uninformative about either of them. Consequently, most statements about ancestry are really statements about genetic similarity, which has a complex relationship with ancestry, and can only be related to it by making assumptions about human demography whose validity is uncertain and difficult to test.

Genealogical ancestry probably reflects the most common and intuitive understanding of the term ancestry. Consider your parents, grandparents, or even great-grandparents. You likely have a sense of these people as individuals, even if you have never met them. If one of them belonged to a particular group X, you might say that you have some “X” ancestry. You might even be able to claim ancestry in this way from more distant ancestors, based on historical or genealogical records. Thus genealogical ancestry is defined in terms of identifiable ancestors in your family tree or pedigree. Often it may be quantified for example, if one of your eight great-grandparents belonged to X you might describe yourself as “one eighth X”. N generations ago you have at most 2 N genealogical ancestors, and if some proportion of them belonged to X you might claim that proportion of ancestry from X.

There are two concepts here: the pedigree, which specifies how all your genealogical ancestors are related to each other, and the ancestry category ‘X’ which ascribes a characteristic of interest to some of them. The pedigree can be thought of as a graph, with nodes representing your genealogical ancestors connected by edges representing parent-child relationships between them (Fig 1A). Were we able to draw it in full it would be impracticably large, but in principle from a pedigree alone we can deduce facts about relatedness, for example that Charles Darwin’s wife Emma was his first cousin (and therefore approximately one eighth of their genomes were identical). Importantly, whereas the pedigree is fixed, ancestry categories can be arbitrary, reflecting aspects of ancestry that we happen to be interested in. X for example could be “British”, “English”, “Huguenot”, or any label referring to culture, geography or some other aspect of an individual’s identity.

A: The pedigree of a single individual. Circles indicate specific ancestors that could be used to define ancestry categories. B: At any single position in the diploid genome, genetic ancestry over the past N generations traces two paths (red and blue) through the, at most, 2 N available. C: Genetic ancestry in the form of the ARG for a single individual. Combining genetic ancestry from different positions leads to a graph, incorporating all realized genetic ancestry paths, implicitly passing through points representing specific individuals. The ARG is contained within the structure of the pedigree, with nodes corresponding to ancestors in which there was a recombination or coalescence event, and edges or lines between them representing paths of descent (through other ancestors which are not represented) for particular segments of DNA. D: The ARG is usually used in the context of the merged ARGs of multiple individuals.

Thus, to describe your genealogical ancestry requires knowledge of your ancestors, the pedigree relating them to you, and sufficient information to assign them to categories of interest. In practice however, few people have comprehensive knowledge of their ancestors beyond a handful of generations ago. Even when researching their genealogy, people tend to focus on a small number of lineages for which records exist or which are of particular interest, neglecting the exponential growth in the number of genealogical ancestors back in time. Genetic data can help with this limitation through genetic genealogy, which identifies relatives based on the distinctive patterns of genetic variation they share. Knowledge of your relatives, while not ancestry in itself, can facilitate the pooling of information about shared ancestors in combination with traditional genealogical information. However, this information may still be difficult to obtain. This limitation raises the question of whether there is a form of ancestry that could be learned from genetic data alone.

The natural definition of this kind of ancestry is genetic ancestry, which differs from genealogical ancestry in that it refers not to your pedigree but to the subset of paths through it by which the material in your genome has been inherited. Because parents transmit only half their DNA to offspring each generation, an individual’s genetic ancestry involves only a small proportion of all their genealogical ancestors [1,2]. At any given position in one of your chromosomes, your DNA is inherited through one of the many possible paths through your pedigree (Fig 1B). Different positions in the genome may have different paths of inheritance, because parental chromosomes are shuffled together during meiotic recombination. Thus the difference between genealogical and genetic ancestry can be summed up by the observation that full siblings have identical genealogical ancestry but differ in their genetic ancestry, due to differences in the transmission of chromosomal segments from their parents.

The fundamental representation of genetic ancestry is a structure called an ancestral recombination graph (ARG Fig 1C) [3]. The ARG is central to population genetics, and many methods for making inferences about demographic history proceed by either implicitly or explicitly reconstructing ARGs [4–7]. Recalling the graph structure of an individual’s pedigree, the ARG is a subset of the pedigree representing the ancestry of the DNA inherited by that individual. It contains only those edges along which inherited segments of DNA have been transmitted, and only those nodes corresponding to ancestors in which there was a recombination or coalescence event. The ARG therefore tells you which parts of your genome were inherited from which ancestors, and represents all the ancestral information that can be obtained from genetic data alone. For example, your pedigree includes many ancestors from whom you inherited no genetic material, but such ancestors are not included in the ARG, and your genome cannot provide any information about them. The ARG can also be used to represent the genetic ancestry of a sample of multiple individuals by merging the individual ARGs into a single graph (Fig 1D).

Just as for genealogical ancestry, we may want to summarize the genetic ancestry of an individual in terms of particular groups or categories of interest. If we could identify specific ancestors in the ARG (Fig 1C) then, analogous to genealogical ancestry, we could say that an individual has genetic ancestry in a given category if any edge in his or her ARG passes through an ancestor in that category. In other words, genetic ancestry in category X means that some fraction of an individual’s genome is inherited directly from an ancestor in X. Genetic ancestry in X implies genealogical ancestry in X, but not vice versa. And as with genealogical ancestry, we could extend this approach to summarizing genetic ancestry by counting the proportion of an individual’s genome inherited from ancestors in X.

One factor motivating interest in particular ancestors or categories may be the idea that such ancestry is associated with genetic effects on certain traits. Whether this is plausible or not, genetic ancestry appears to provide an essentialist notion of ancestry that excludes any relationship that does not correspond to an inherited DNA segment. However, it turns out that determining genetic ancestry is even less practical an idea than genealogical ancestry. Whereas at least some of the pedigree may be pieced together from genealogical records, the ARG must be inferred solely from patterns of genetic variation–a very challenging problem (despite impressive recent progress [7,8]). Even if we could reconstruct the true ARG and the ancestors on each edge, we would have the same problem of needing information about membership in specific ancestry categories in order to give meaningful summaries.

The impracticality of fully determining either form of ancestry means that most analyses take an alternative approach. Typically, they aim to infer an approximate summary of genetic ancestry without reconstructing the ARG. For example, researchers may be interested in the demographic relationships between human populations but not necessarily the details of individual ancestors. Perhaps the closest we can practically get to this is an admixture graph [9,10], which relies on a concept of “population ancestry”, embedded in a graph which is similar to the ARG but relates populations rather than individuals. In any real population, individuals will differ in their ancestry and the true ARG will be extremely complex. The admixture graph focuses on the idea that, when averaged over the whole genome, these differences can be approximated as varying proportions of ancestry from multiple source populations. Since populations are explicitly represented as nodes in the admixture graph, it is more straightforward to attribute ancestry categories to them (compared to an ARG or pedigree), which makes this an appealing way to summarize ancestry. However there are drawbacks: populations may be poorly sampled or include unrepresentative individuals, or they may not correspond to identifiable groups. It is hard to know whether the inferred sources of ancestry (sometime called “ghost populations”) are real but unsampled populations or simply algorithmic constructs representing a simplification of more complex demography. More fundamentally, admixture graphs enforce the idea of discrete populations in a way which is at odds with the complexity of human demographic structure. For now they remain rare outside the population genetic literature, and care is needed in presenting them to a wider audience, as they represent an abstraction of ancestral demography which can easily be misinterpreted as something more concrete.

More commonly, when geneticists and consumer genomics companies talk about ancestry they are really talking about genetic similarity between populations and individuals. For example, the output of methods that summarize genetic variation among samples such as principal component analysis (PCA), ADMIXTURE [11] (an implementation of the STRUCTURE model [12]) and Chromopainter [13] (based on the Li & Stephens haplotype copying model [14]) are frequently interpreted in terms of ancestry. Some of these methods allow individual genomes to be represented as combinations of reference populations, which are either explicitly defined in terms of other individuals in the dataset, or constructed implicitly as part of the algorithm. These are ‘ancestry-like’ relationships, and since the ARG contains all the information about the evolutionary genetic process which produced the differences between samples, the outputs of these methods can be seen as summaries of the ARG (Fig 2).