17.1: Motif Representation and Information Content - Biology

17.1: Motif Representation and Information Content - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Instead of a Profile Matrix, we can also represent Motifs using information theory. We can calculate the specific amount of information in a given message with the equation: − log p.

Shannon Entropy is a measure of the expected amount of information contained in a message. In other words, it is the information contained by a message of every event that could possibly occur weighted by each events probability. The Shannon entropy is given by the equation:

[ H(X)=-sum_{i} p_{i} log _{2} p_{i} onumber ]

Entropy is maximum when all events have an equal probability of occurring. This is because Entropy tells us the expected amount of information we will learn. If each even has the same chance of occurring we know as little as possible about the event, so the expected amount of information we will learn is maximized. For example, a coin flip has maximal entropy only when the coin is fair. If the coin is not fair, then we know more about the event of the coin flip, and the expected message of the outcome of the coin flip will contain less information.

We can model a motif by how much information we have of each position after applying Gibs Sampling or EM. In the following figure, the height of each letter represents the number of bits of information we have learned about that base. Higher stacks correspond to greater certainty about what the base is at that position of the motif while lower stacks correspond to a higher degree of uncertainty. With four codons to choice from, the Shannon Entropy of each position is 2 bits. Another way to look at this figure is that the height of a letter is proportional to the frequency of the base at that position.

There is a distance metric on probability distributions known as the Kullback-Leibler distance. This allows us to compare the divergence of the motif distribution to some true distribution. The K-L distance is given by

[ D_{K L}left(P_{ ext {motif}} mid P_{ ext {background}} ight)=Sigma_{A, T, G, C} P_{ ext {motif}}(i) log underset{P ext {background}(i)}{P_{ ext {motif}}(i)} onumber ]

In Plasmodium, there is a lower G-C content. If we assume a G-C content of 20%, then we get the following representation for the above motif. C and G bases are much more unusual, so their prevalence is highly unusual. Note that in this representation, we used the K-L distance, so that it is possible for the stack to be higher than 2.


  1. [1] Timothy L. Bailey. Fitting a mixture model by expectation maximization to discover motifs in biopoly- mers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pages 28–36. AAAI Press, 1994.
  2. [2] C E Lawrence and A A Reilly. An expectation maximization (em) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7(1):41–51, 1990.

Retrozymes are a unique family of non-autonomous retrotransposons with hammerhead ribozymes that propagate in plants through circular RNAs

Catalytic RNAs, or ribozymes, are regarded as fossils of a prebiotic RNA world that have remained in the genomes of modern organisms. The simplest ribozymes are the small self-cleaving RNAs, like the hammerhead ribozyme, which have been historically considered biological oddities restricted to some RNA pathogens. Recent data, however, indicate that small self-cleaving ribozymes are widespread in genomes, although their functions are still unknown.


We reveal that hammerhead ribozyme sequences in plant genomes form part of a new family of small non-autonomous retrotransposons with hammerhead ribozymes, referred to as retrozymes. These elements contain two long terminal repeats of approximately 350 bp, each harbouring a hammerhead ribozyme that delimitates a variable region of 600–1000 bp with no coding capacity. Retrozymes are actively transcribed, which gives rise to heterogeneous linear and circular RNAs that accumulate differentially depending on the tissue or developmental stage of the plant. Genomic and transcriptomic retrozyme sequences are highly heterogeneous and share almost no sequence homology among species except the hammerhead ribozyme motif and two small conserved domains typical of Ty3-gypsy long terminal repeat retrotransposons. Moreover, we detected the presence of RNAs of both retrozyme polarities, which suggests events of independent RNA-RNA rolling-circle replication and evolution, similarly to that of infectious circular RNAs like viroids and viral satellite RNAs.


Our work reveals that circular RNAs with hammerhead ribozymes are frequently occurring molecules in plant and, most likely, metazoan transcriptomes, which explains the ubiquity of these genomic ribozymes and suggests a feasible source for the emergence of circular RNA plant pathogens.

Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Get time limited or full article access on ReadCube.

All prices are NET prices.

Results and discussion

Kinase inhibitors with different binding modes

Type I, I½, and II kinase inhibitors were extracted from X-ray structures of kinase-inhibitor complexes contained in the KLIFS database [6, 7], a specialized repository for kinase structures and associated activity data, as detailed in “Methods” section. The composition of the kinase inhibitor data set is reported in Table 1.

Study design

We have aimed to compare distinct molecular and interaction representations for machine learning using different modeling strategies. For this purpose, kinase inhibitors with different binding modes were classified. This investigation was inspired by previous findings that such inhibitors could be predicted with high accuracy on the basis of chemical structure using standard machine learning approaches such as random forest (RF) [18]. These observations and the availability of large numbers of kinase inhibitors with experimentally determined binding modes provided a sound basis for a comparative study including active learning strategies to assess the information content of structural and interaction representations on a relative scale.

First, conventional RF models were derived using 90% of available inhibitors and applied to classify the test set containing the remaining 10% of the inhibitors. Moreover, an active learning strategy was implemented, which iteratively selects informative training instances in order to reduce training data to a required minimum. Hence, if successful, active learning reveals information that is essential for predictive modeling. Active learning employed a multi-class RF model starting with a corresponding data split for iterative sample selection and class label prediction, as illustrated in Fig. 1. Training instances were selected on the basis of information entropy from the compound pool, which initially corresponded to a 90% of the data set. The model trained with selected instances was then used to predict the test set (10%). Further details and calculation protocols are provided in the Methods section.

Active learning strategy. Training instances are selected randomly (first iteration) or based on an entropy criterion (subsequent iterations) after predicting pool compounds. For performance evaluation, the multi-class RF model is then used to predict the external test set

Random forest predictions

Binding mode predictions were attempted with fundamentally different representations including IFPs and molecular graph-based fingerprints (see “Methods” section for details). IFPs included an 85-bit version accounting for the presence or absence of ligand interactions with 85 residue positions forming the binding site region in kinases (IFP_85), and a further expanded 595-bit version distinguishing between seven different types of interactions for inhibitors and each residue position (85 × 7 IFP_595). The 85 residues represent the complete active site region in kinases defined on the basis of many X-ray structures [6, 7]. Others have previously used smaller subsets of these residues focusing on the ATP site, which were predicted to be important for conferring kinase selectivity [20, 21]. However, in our analysis, the comprehensive representation of the binding site region was used because different inhibitor binding modes were predicted. As a representation of chemical structures, the folded (1024-bit) and unfolded (variably sized feature set) version of the extended connectivity fingerprint with bond diameter 4 (ECFP4) were generated for each inhibitor (termed ECFP4_folded and ECFP4_unfolded, respectively). ECFP4 is a topological fingerprint encoding layered atom environments.

For classification, multi-class RF models were derived to distinguish between type I, I½, and II inhibitors. Figure 2 reports the Matthew’s correlation coefficient (MCC) and balanced accuracy (BA) values for RF models trained with both IFPs, ECFP4, and combined representations over 20 independent trials. Overall, RF models on the basis of ECFP4 yielded accurate predictions, consistent with our previous observations. This was the case for the folded and unfolded ECFP4 version, with median BA and MCC values greater than 0.70 and 0.65, respectively. However, application of IFPs further increased global prediction accuracy. IFP_85 yielded median BA and MCC values of 0.85 and 0.76, respectively. In addition, IFP_595 with further refined interaction information produced comparable BA but further increased MCC values, with a median MCC of 0.81. Compared to IFPs, model performance essentially remained constant when IFP and ECFP4 representations were combined (i.e., when fingerprints of different design were concatenated). Only very minor changes were observed that were not significant. Hence, IFP contributions mostly determined prediction accuracy and the minor fluctuations or reductions were likely due to ECFP4 feature noise in combined representations.

Predictive performance of random forest models on test sets. MCC and BA value distributions are reported for RF models using different representations

As a control, permutation tests were carried out (see Methods section) to confirm that RF models indeed detected inhibitor type-specific patterns. Figure 3 shows the results of permutation tests, i.e., the distribution of MCC values for 1000 RF models trained on data with randomized (shuffled) class labels using different representations. The results show that control models had only very little predictive capacity. None of the control models approached the accuracy levels of models with non-permuted labels, which supported the significance of the results.

Permutation tests. For predictions on test sets, MCC value distributions are shown for RF models trained with randomized class labels using different representations. The vertical dashed line indicates MCC = 0 and the solid colored lines mark model performance for the same individual trial

Figure 4 reports the per-class performance for different types of kinase inhibitor with RF models using basic fingerprint versions. Type II inhibitors were most accurately predicted especially using interaction information, with a median MCC of 0.95. Furthermore, prediction accuracy was higher for type I than type I½ inhibitors, which yielded median MCC values of 0.67 (IFP_85) and 0.63 (EFCP_folded). Thus, inhibitors with binding modes combining binding characteristics of type I and II inhibitors were most challenging to predict, as one might expect. The more accurate predictions of type II compared to type I inhibitors were likely due to the presence of unique hydrogen bonding groups present in many type II inhibitors that distinguish them from type I inhibitors [22, 23]. These signature groups or substructures and their interactions are accounted for by atom environment/fragment fingerprints and IFPs, respectively.

Per-class performance. MCC value distributions are separately shown for test set predictions of type I (blue), I½ (orange), and II (green) kinase inhibitors, respectively, with RF models using IFP_85 and ECFP4_folded, respectively

Unsupervised learning for visualization

The unsupervised machine learning method t-distributed stochastic neighbor embedding (t-SNE) was applied for further comparison of representations and data visualization. Using this non-linear dimension reduction approach, a two-dimensional (2D) embedding was constructed from a multi-dimensional feature space on the basis of Tanimoto distances to preserve local similarities (see “Methods” section). Figure 5 shows t-SNE visualizations for IFP_85 and ECFP4_folded feature spaces containing all kinase inhibitors. The 2D t-SNE representations reveal much clearer clustering of inhibitors by type for IFP_85 than ECFP4_folded, which further prioritized IFPs for modeling. For example, t-SNE map for IFP_85 clearly separated the majority of type II inhibitors from those with other binding modes. In addition, a separate cluster of type I inhibitors of a group of phosphatidyl inositol kinases (p110a, p110d, p110g, PIK3C3, PI4KA, and PI4KB) and serine/threonine-protein kinase mTOR emerged. These kinases differ structurally from many others in the human kinome, which is also reflected by different interactions with co-crystalized inhibitors that were accounted for by IFPs. In both maps, however, type I½ inhibitors often co-localized with type I inhibitors, which also illustrated why type I½ inhibitors were overall most challenging to predict.

Visualization of feature spaces. Scatter plots show 2D T-SNE representations of the IFP_85 (left) and ECFP4_folded (right) fingerprint spaces on the basis of Tanimoto distances. Inhibitors (dots) are color-coded according to binding modes: type I (blue), I½ (orange), and II (green)

Active learning

To further compare the information content of structural and interaction representations, an active learning strategy was applied combining multi-class RF modeling and entropy-based selection of training instances. RF models were iteratively built with increasing numbers of training instances for the prediction of an external test set and the remaining compound pool. While test set predictions enable the estimation of model performance, predictions of the compound pool determine the choice of instances for addition to the training set. Initially, only three compounds were randomly selected from the pool for training the first RF model (one of each inhibitor type). At subsequent iterations, 10 compounds from the pool were chosen and added for retraining the model. Compounds from the pool with the highest uncertainty in their predictions, quantified as information entropy, were selected. The information entropy concept can be applied to the predicted probabilities of three possible states: type I, I½, and II. Therefore, entropy can also be interpreted as the expected amount of information that an instance would add to the model. The model was iteratively refined and tested to optimize prediction accuracy.

Three independent trials with two-fold external cross-validation of active learning were performed. Figure 6 shows average MCC values at increasing numbers of training samples using different representations. As a control, entropy-based active learning was compared to random sample selection from the compound pool. In Fig. 6a, MCC values reported for the complete compound pool and training set. Since compound instances were iteratively added to the training set, the model predicts more instances from the training set and less from the compound pool at each interaction. At the end of this procedure, RF models were built to predict the complete training set (i.e. 90% of the total data set). These models displayed nearly perfect accuracy. The results for compound pool predictions using different representations are shown in Fig. 6a. Entropy-based selection yielded earlier optimization of MCC performance compared to random selection. Figure 6b reports MCC values for classifying the external test set. When using

500 training instances, prediction performance reached a plateau with MCC values

0.8 and remained constant for further increasing numbers of training samples ultimately including all pool compounds (

1800). Prediction accuracy was higher for IFPs than ECFP4. For IFPs, there was a confined early improvement in MCC performance for entropy-based over random selection. By contrast, for ECFP4, the active learning entropy selection of training instances provided a significant advantage. Taken together, the results in Fig. 6 reveal that IFPs are information-rich representations with high redundancy. A high level of interaction redundancy captured by IFPs was indicated by early saturation of prediction performance using only limited numbers of training instances, even if randomly selected. Hence, small training sets already yielded sufficient IFP information for discriminating between different types of kinase inhibitors. Furthermore, high redundancy was indicated by the observation that IFP_595 only yielded a minor improvement in prediction accuracy compared to the basic IFP_85 version with no further specified interactions. Both ECFP4_unfolded and ECFP4_folded had lower information content than IFPs but higher dimensionality. For compound pool predictions with ECFP4, many more training examples than for IFPs were required for successful model building. Interestingly, for test set predictions, selection of training instances based on entropy also resulted in an early optimization of prediction performance, albeit at a lower level than IFPs. ECFP4 predictions with entropy-based selection reached a plateau at MCC values

Active learning performance. The MCC values for a compound pool and b test set predictions are reported for different representations using entropy-based (left) and random (right) selection of training samples. In b, shaded areas of each curve indicate standard deviations of different prediction trials

Figure 7 monitors the difference between MCC values for entropy-based and random selection and increasing numbers of training instances. For each fingerprint, a performance difference peak is observed. For ECFP4_folded, the largest difference corresponded to 0.28 MCC units and occurred for

140 examples. By contrast, for ECFP4_unfolded, the largest difference was 0.4 MCC units for

120 training samples. For IFPs, the maximum MCC difference was

0.2 for small numbers of training instances including

60 compounds (IFP_595). These findings confirmed that selection based on entropy yielded informative training instances especially for atom environment fingerprints. For the information-rich IFPs, even random selection led to early increases in predictive performance, resulting in a small peak difference between entropy-based and random selection for small numbers of training instances.

Entropy-based versus random selection. For varying training set size, the MCC value difference between entropy-based and random selection is reported for test set predictions using different representations. Shaded areas of each curve indicate standard deviations of difference calculations between corresponding predictions

Although IFPs capture more information about compound binding modes than atom environment fingerprints, predicting kinase inhibitor binding modes from chemical structure also produces overall accurate predictions and remains attractive for practical applications. This is the case because X-ray structures are required to generate IFPs for predicting new compound binding modes. However, once a structure with a new inhibitor is obtained, the binding mode can be directly determined, without the need to translate interactions into an IFP for machine learning. By contrast, once a compound structure-based model is trained and validated it can be readily used to predict binding modes of new inhibitors.

The results in Fig. 8 indicate that on the order of 500 experimentally determined structures of inhibitor binding modes were required to maximize the accuracy of predictions using the folded as well as unfolded ECFP4 versions. For these ECFP4-based predictions, entropy-based instance selection was essential for effective active learning. The results reveal promising predictions of binding modes of test inhibitors on the basis of entropy-guided selection of training samples, with an accuracy approaching 80% for

500 training compounds. Prediction performance essentially remained constant for large numbers of training instances. Hence, the number of currently available kinase inhibitors with experimentally determined binding modes by far exceeds (approx. 4-fold) the numbers of informative training instances required for overall accurate multi-class prediction of inhibitor binding modes on the basis of chemical structure.

Active learning on the basis of chemical structure. Test set MCC (purple) and BA (blue) performance is shown for increasing numbers of training instances, with entropy-based (solid line) and random (dashed line) selection of compounds from the pool. Shaded areas of each curve indicate standard deviations of different prediction trials

Feature analysis

The importance of individual IFP and ECFP4 features for the prediction of kinase inhibitor binding modes was also assessed (see Methods section). For each active learning step, a multi-class RF model was built and its feature importance values were estimated. Figure 9 shows the change in feature importance over different active learning iterations, i.e., different numbers of training set samples.

Feature importance analysis. Importance values for a ECFP4 and b 85-bit IFP features are reported for different numbers of training set samples (i.e. active learning iterations). In a and b, only features with a median importance of at least 20% and 10% of the maximum are shown, respectively. Importance values are color-coded as indicated. In a, the five features with largest median values across all iterations are shown in the insert at the bottom

The median importance value of each feature was calculated over all iterations. In Fig. 9, features with a median importance value of at least 20% and 10% of the maximum are shown for ECFP4 and IFP, respectively. Overall, very similar feature sets were consistently prioritized when re-training the classification models. As indicated by the observed model performance, large training sets were not required to accurately predict kinase inhibitor binding modes. However, the RF algorithm detected discriminative feature patterns early on. The analysis showed that the important features detected with 90% of the data were very similar to those prioritized using smaller training sets.

Feature importance values were also assessed for RF models built with concatenated fingerprints, which included both atom environments and IFP features. In this case, features found to be most relevant for the predictions were the same IFP features as observed before. Thus, these findings revealed that the inclusion of ECFP4 features essentially retained prioritized IFP features, yielding very similar results.

Do you need data software packages when analysing qualitative data?

Qualitative data software packages are not a prerequisite for undertaking qualitative analysis but a range of programmes are available that can assist the qualitative researcher. Software programmes vary in design and application but can be divided into text retrievers, code and retrieve packages and theory builders.6 NVivo and NUD*IST are widely used because they have sophisticated code and retrieve functions and modelling capabilities, which speed up the process of managing large data sets and data retrieval. Repetitions within data can be quantified and memos and hyperlinks attached to data. Analytical processes can be mapped and tracked and linkages across data visualised leading to theory development.6 Disadvantages of using qualitative data software packages include the complexity of the software and some programmes are not compatible with standard text format. Extensive coding and categorising can result in data becoming unmanageable and researchers may find visualising data on screen inhibits conceptualisation of the data.

Molecular Interaction Maps

A Molecular Interaction Map (MIM) is a diagram convention that is capable of unambiguous representation of networks containing multi-protein complexes, protein modifications, and enzymes that are substrates of other enzymes. This graphical representation makes it possible to view all of the many interactions in which a given molecule may be involved, and it can portray competing interactions, which are common in bioregulatory networks. In order to facilitate linkage to databases, each molecular species is represented only once in a diagram. A formal description of the MIM notation can be found in Kohn et al., Molecular Biology of the cell 17, 1-13 2006. The updated formal specification for software implementation can be found in Luna et al., BMC Bioinformatics 2011, 12:167.

For a quick reference sheet of the MIM symbols click here.

Current diagram editors implementing these symbols are Pathvisio and MIMTool.

MIM Diagrams: Interactive electronic molecular interaction maps (eMIMs) allow the user to navigate through the molecular interaction network and link to molecular databases, references and annotations that contain pertinent information.

Molecular species can be located on the map by means of indexed grid coordinates and on eMIMs through interactive links. Each interaction is referenced to an annotation list where pertinent information and references can be found.

MIM Software: There are several ongoing software projects to simplify creating and editing MIM diagrams and related metadata. Some of the software components provided allow developers to speed up the development of MIM support, allow for interoperable tools, and to provide a means of mining the data contained in MIM diagrams for other uses.

    - AKT regulation by phosphorylation/dephosphorylation reactions. - cellular response to DNA double-strand breaks (DSB). - cellular response to DNA double-strand breaks (DSB). - heuristic MIM of signaling from EGF receptors. - chromatin assembly during replication. - transcriptional activation in response to low oxygen levels. - regulatory response to DNA damage. - network model - cell cycle regulation of the early stages of DNA synthesis. - senescence regulation by cell cycle checkpoints and the epithelial-mesonchymal transition. - connecting DNA damage and metabolism.
  • Diagrams:
  • MIM Documentation
      - Documentation on how to read and understand MIM diagrams. Note: eMIMs users refer to this description. - An XML Schema for the machine-readable format for MIM diagrams supporting the visual layout of MIM diagrams. Example datasets.
    • - A Java-based API that binds MIMML elements to Java objects and provides JavaBeans-style methods such as "getFoo()" and "setFoo()", thereby providing a mechanism for parsing, creating, and manipulating MIMML documents. Additional documentation is provided here. - Pathvisio plugin which adds the ability to draw all the MIM glyphs, and to annotate diagram elements with comments, literature references, and links to external databases outputs to PDF, PNG, GPML, and MIMML. Available for Windows, OS X, and Linux using platform-independent Java. - MIM drawing tool that outputs SBML, MIMML and PDF files. It possesses a novel semi-automatic orthogonal drawing engine to minimize bends and crossovers when drawing interactions. Available for Windows and Linux. - For use with PathVisio-MIM, this plugin assists in the creation of pathway diagrams by ensuring correct usage of the MIM notation, and thereby reducing ambiguity when diagrams are shared amongst biologists.

    Systems Biology Graphical Notation:

    The MIM notation was the basis for the development of the entity-relationship component (SBGN-ER) of the Systems Biology Graphical Notations (SBGN). SBGN is an international effort to standardize diagrams depicting biochemical and cellular processes studied in systems biology, including several notations designed for different purposes.

    An animated description of the steps leading to src activation by EGFR is available (pdf).

    This web site is a development of the Genomics and Pharmacology Facility, Developmental Therapeutics Branch (DTB), Center for Cancer Research (CCR), National Cancer Institute (NCI).


    The flow of genetic information is considered to be one of the five core, overarching concepts in undergraduate biology (AAAS, 2011). Meiosis is a topic that clearly falls within the category of information flow, as it explains how information encoded in DNA passes from one generation to the next. The process of meiosis is an important part of the curriculum, as it helps students understand major concepts in genetics and evolution. Much research on student understanding of meiosis has focused on identifying and describing the various misconceptions (or alternate conceptions) held by learners (Kindfield, 1994 Lewis et al., 2000 Wright & Newman, 2011 Newman et al., 2012 Ozcan et al., 2012 Smith & Knight, 2012 Kalas et al., 2013). While this research is extremely important for helping build awareness of the various difficulties that students will likely face when learning about meiosis, it does not help educators understand why these difficulties persist. To address this gap in the literature, much of our work has been devoted to investigating what aspects of conceptual understanding of meiosis are missing for students. We have previously established that learners and experts conceptualize aspects of meiosis very differently and that only experts bring a molecular level of understanding to their descriptions of the process (Newman et al., 2012 Wright et al., 2017).

    We argue that one of the reasons for student difficulties in understanding meiosis is the incredible complexity of DNA itself. Genetic information is encoded in DNA in both concrete and abstract ways, making DNA a difficult molecule to conceptualize. Plus, DNA is a molecule that is incredibly small (the helix cannot be observed directly, even with a microscope) while also being incredibly large (containing thousands or millions of subunits). While genetic information is encoded in DNA, not all parts of a DNA molecule are used at the same time, by the same cell type, or even for the same purpose. All of this complexity is difficult for a novice to grasp and integrate into a cohesive mental model. The DNA Triangle framework integrates three different scales at which DNA can be considered: chromosomal (C), molecular (M), and informational (I) (Wright et al., 2017). The C level describes the structure of chromosomes (with and without sister chromatids), identification of chromosomes by banding pattern and centromere location, representations of chromatin packing, and counting chromosomes. The I level describes how DNA encodes genetic information, such as genes or alleles, protein-coding regions, or regulatory information. Finally, the M level describes the chemistry and nucleotide sequence of DNA. In previous work (Wright et al., 2017), the DNA Triangle framework was applied to meiosis and used to understand how experts described the concepts of ploidy (how many sets of genetic information are contained in the cell), homology, and the mechanism of homologous pairing (renamed “segregation” in this article). Biology experts explained the concept of homology by linking the I and M levels, the concept of ploidy using both the C and I levels, and how proper segregation was achieved with the C and M levels (Figure 1). Students, on the other hand, focused mainly on the C level and did not, for any of the topics, bring in M-level knowledge.

    The DNA Triangle framework applied to meiosis. The concept of how proper Segregation is achieved links the Molecular and Chromosomal levels the concept of Homology links the Informational and Molecular levels and the concept of Ploidy links the Informational and Chromosomal levels. Figure modified from Wright et al. (2017).

    The DNA Triangle framework applied to meiosis. The concept of how proper Segregation is achieved links the Molecular and Chromosomal levels the concept of Homology links the Informational and Molecular levels and the concept of Ploidy links the Informational and Chromosomal levels. Figure modified from Wright et al. (2017).

    We then used the framework to analyze text passages from college-level introductory and mid/upper-level textbooks to better understand where students' ideas about meiosis may originate or grow from (Wright et al., 2017). While not a perfect resource, textbooks are frequently used in college science courses because they contain extensive information about the particular subject and are one medium in which scientific knowledge is transferred into teachable knowledge. The results revealed that (1) many important concepts about meiosis were missing from college-level textbooks and (2) many of the concepts were not consistently presented to students at the appropriate level of DNA, according to the framework (Wright et al., 2017). For example, homologous chromosomes in introductory books were almost always described at the chromosomal level (e.g., chromosomes with the same size and shape) but not at the molecular level (e.g., containing nearly the same sequence of DNA nucleotides). Mid- and upper-level textbooks were more likely to use molecular-level language (i.e., sequence of nucleotides, sequence of bases, base-pairing based on complementary sequences) to describe concepts of homologous chromosomes and homologous pairing introductory-level textbooks were nearly devoid of molecular-level language. This analysis partially answers the “why” and “where” questions related to students' difficulties with meiosis. Most college-level textbooks fail to describe important concepts consistently and do not help students “see” the molecular level when describing molecular-based concepts that are important for meiosis.

    As experts are well aware, biology is not solely communicated through written or spoken words. Thus, an analysis of textbook passages alone does not give the complete picture of how meiosis is presented to learners. The discipline of biology is highly dependent on visual representations (graphs, illustrations, diagrams, etc.) that are used to communicate important ideas and processes. Visual representations are abundant in most college-level biology textbooks and, thus, should be investigated for the messages they are conveying to students. For example, a prior study showed that one commonly used introductory biology textbook contained 1214 figures (Wright et al., 2018). Many textbook figures are intended to help the learner visualize structures and processes that are not directly observable and are designed to help highlight important aspects about a process or phenomenon. Quillin and Thomas (2015) argue that teaching biology, which covers a vast expanse of time scales (chemical reactions to evolutionary change) and of size scales (atoms to ecosystems), would not be possible without the use of visual representations. Visual representations also provide learners a tool for developing scientific reasoning skills, because they give learners something to reason about (Anderson et al., 2013).

    Since figures in biology textbooks are meant to help teach students (novices) biology content, we examined chapters from several commonly used textbooks for evidence that they provide the necessary information to complete the DNA Triangle for student learners. In other words, do textbook figures make up for the gaps in written descriptions of meiosis-related concepts? We analyzed meiosis-related diagrams and illustrations from 18 different textbooks (nine introductory-level and nine mid/upper-level), resulting in a total of 112 figures. Whereas our previous study (Wright et al., 2017) examined textbook passages for descriptions of ploidy, homology, and the mechanism of homologous pairing (segregation), in the present study we examined textbook figures for illustrations of the same concepts. First, we determined whether meiosis-related textbook figures made important concepts about ploidy, homology, and segregation explicit to learners. Then we used the DNA Triangle framework to determine the extent to which the figures presented information at the three levels (M, C, and/or I).

    17.1: Motif Representation and Information Content - Biology

    Differential network analysis and protein-protein interaction study reveals active protein modules in glucocorticoid resistance for infant acute lymphoblastic leukemia, Z Mousavian, A Nowzari-Dalini, Y Rahmatallah, A Masoudi-Nejad, Molecular Medicine 25 (1), 36

    Active repurposing of drug candidates for melanoma based on GWAS, PheWAS and a wide range of omics data. A Khosravi, B Jayaram, B Goliaei, A Masoudi-Nejad, Molecular Medicine 25 (1), 30

    ‎FeatureSelect: a software for feature selection based on machine learning approaches, Y Masoudi-Sobhanzadeh, H Motieghader, A Masoudi-Nejad, BMC bioinformatics 20 (1), 170

    Network-based expression analyses and experimental validations revealed high co-expression between Yap1 and stem cell markers compared to differentiated cells, F Dehghanian, Z Hojati, F Esmaeili, A Masoudi-Nejad, Genomics 111 (4), 831-839

    GPS: Identification of disease genes by rank aggregation of multi-genomic scoring schemes, A Meshkin, A Shakery, A Masoudi-Nejad, Genomics 111 (4), 612-618

    Genome-wide DNA methylation profiling in ectopic and eutopic of endometrial tissues, N Barjaste, M Shahhoseini, P Afsharian, A Sharifi-Zarchi, . , Journal of assisted reproduction and genetics, 1-10

    Trader as a new optimization algorithm predicts drug-target interactions efficiently, Y Masoudi-Sobhanzadeh, Y Omidi, M Amanlou, A Masoudi-Nejad, Scientific Reports 9 (1), 9348

    Drug databases and their contributions to drug repurposing, Y Masoudi-Sobhanzadeh, Y Omidi, M Amanlou, A Masoudi-Nejad, Genomics

    Detection of novel biomarkers for early detection of Non-Muscle-Invasive Bladder Cancer using Competing Endogenous RNA network analysis, M Kouhsar, SA Jamalkandi, A Moeini, A Masoudi-Nejad, Scientific reports 9 (1), 8434

    DrugR+: A comprehensive relational database for drug repurposing, combination therapy, and replacement therapy, Y Masoudi-Sobhanzadeh, Y Omidi, M Amanlou, A Masoudi-Nejad, Computers in biology and medicine 109, 254-262

    Systematic analysis of genes and diseases using PheWAS-associated networks, A Khosravi, M Kouhsar, B Goliaei, B Jayaram, A Masoudi-Nejad, Computers in biology and medicine 109, 311-321

    Novel putative drugs and key initiating genes for neurodegenerative disease determined using network‐based genetic integrative analysis, Z Mortezaei, JB Cazier, AA Mehrabi, C Cheng, A Masoudi‐Nejad, Journal of cellular biochemistry 120 (4), 5459-5471

    CatbNet: A Multi Network Analyzer for Comparing and Analyzing the Topology of Biological Networks, E Pournoor, N Elmi, A Masoudi-Nejad, Current genomics 20 (1), 69-75

    LncRNA and mRNA integration network reconstruction reveals novel key regulators in esophageal squamous-cell carcinoma, S Alaei, B Sadeghi, A Najafi, A Masoudi-Nejad, Genomics 111 (1), 76-89

    Block alignment: New representation and comparison method to study evolution of genomes, MNA Lanjanian H, Nowzari A, Hosseinkhan N, Masoudi-Nejad A, Genomics,

    Cattle infection response network and its functional modules, H Beiki, A Pakdel, AN Javaremi, A Masoudi-Nejad, JM Reecy, BMC immunology 19 (1), 2

    ‎Reconstruction of the genome-scale co-expression network for the Hippo signaling pathway in colorectal cancer, F Dehghanian, Z Hojati, N Hosseinkhan, Z Mousavian, A Masoudi-Nejad, Computers in biology and medicine 99, 76-84

    SCAN-Toolbox: Structural COBRA Add-oN (SCAN) for Analysing Large Metabolic Networks, Y Asgari, Z Zabihinpour, A Masoudi-Nejad, Current Bioinformatics 13 (1), 100-107

    Comparison of gene co-expression networks in Pseudomonas aeruginosa and Staphylococcus aureus reveals conservation in some aspects of virulence, N Hosseinkhan, Z Mousavian, A Masoudi-Nejad, Gene 639, 1-10

    Link prediction potentials for biological networks, S Sulaimany, M Khansari, A Masoudi-Nejad, International Journal of Data Mining and Bioinformatics 20 (2), 161-184

    The importance of α-CT and Salt bridges in the Formation of Insulin and its Receptor Complex by Computational Simulation, M Dehghan-Shasaltaneh, H Lanjanian, GH Riazi, A Masoudi-Nejad, Iranian journal of pharmaceutical research: IJPR 17 (1), 63

    Sequence-based 5-mers highly correlated to epigenetic modifications in genes interactions, D Salimi, A Moeini, A Masoudi-Nejad, Genes & genomics 40 (12), 1363-1371

    Task modulates functional connectivity networks in free viewing behavior, H Seidkhani, AR Nikolaev, RN Meghanathan, H Pezeshk, . NeuroImage 159, 289-301

    Biogeography, distribution and conservation status of maples (Acer L.) in Iran, M Mohtashamian, F Attar, K Kavousi, A Masoudi-Nejad, Trees 31 (5), 1583-1598

    Inhibitory effects of lactic acid bacteria isolated from traditional fermented foods against aflatoxigenic Aspergillus spp., M Ebrahimi, M Khomeiri, A Masoudi-Nejad, A Sadeghi, B Sadeghi, . Comparative Clinical Pathology 26 (5), 1083-1092

    Candidate novel long noncoding RNAs, MicroRNAs and putative drugs for Parkinson's disease using a robust and efficient genome-wide association study, Z Mortezaei, H Lanjanian, A Masoudi-Nejad, Genomics 109 (3-4), 158-164

    Systems biology study of transcriptional and post-transcriptional co-regulatory network sheds light on key regulators involved in important biological processes in Citrus sinensis, E Khodadadi, AA Mehrabi, A Najafi, S Rastad, A Masoudi-Nejad, Physiology and molecular biology of plants 23 (2), 331-342

    Micromorphological studies of leaf epidermal features in populations of maples (Acer L.) from Iran, M Mohtashamian, F Attar, K Kavousi, A Masoudi-Nejad, Phytotaxa 299 (1), 36-54

    Expectation propagation for large scale Bayesian inference of non-linear molecular networks from perturbation data, Z Narimani, H Beigy, A Ahmad, A Masoudi-Nejad, H Fröhlich, PloS one 12 (2), e0171240

    Network-based expression analysis reveals key genes related to glucocorticoid resistance in infant acute lymphoblastic leukemia, Z Mousavian, A Nowzari-Dalini, RW Stam, Y Rahmatallah, . Cellular Oncology 40 (1), 33-45

    A hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata, H Motieghader, A Najafi, B Sadeghi, A Masoudi-Nejad, Informatics in Medicine Unlocked 9, 246-254

    mRNA–miRNA bipartite network reconstruction to predict prognostic module biomarkers in colorectal cancer stage differentiation, H Motieghader, M Kouhsar, A Najafi, B Sadeghi, A Masoudi-Nejad, Molecular BioSystems 13 (10), 2168-2180

    Sequential and mixed genetic algorithm and learning automata (SGALA, MGALA) for feature selection in QSAR, H MotieGhader, S Gharaghani, Y Masoudi-Sobhanzadeh, . Iranian journal of pharmaceutical research: IJPR 16 (2), 533

    Predicting brain network changes in Alzheimer's disease with link prediction algorithms, S Sulaimany, M Khansari, P Zarrineh, M Daianu, N Jahanshad, . Molecular BioSystems 13 (4), 725-735

    Intelligence without representation

    This site uses cookies and Google Analytics (see our terms & conditions for details regarding the privacy implications).

    Use of this site is subject to terms & conditions.
    All rights reserved by The PhilPapers Foundation

    Page generated Wed Jun 30 19:54:29 2021 on philpapers-web-b76fb567b-jxzfk Debug information

    cache stats: hit=21660, miss=20875, save=
    autohandler : 287 ms
    called component : 271 ms
    entry : 270 ms
    entry_basics : 91 ms
    entry-header : 77 ms
    menu : 74 ms
    similar_entries : 58 ms
    citations-citations : 55 ms
    entry_stats : 27 ms
    entry-links : 10 ms
    retrieve cache object : 8 ms
    entry-cats : 8 ms
    entry-side : 6 ms
    prepCit : 5 ms
    entry_stats_query : 3 ms
    citations-references : 2 ms
    writeLog : 2 ms
    get_entry : 2 ms
    entry_chapters : 2 ms
    init renderer : 0 ms
    setup : 0 ms
    auth : 0 ms
    stat_db : 0 ms
    entry-buttons : 0 ms

    Meet the Escape Artists of X-Chromosome Inactivation

    While an escape from the zoo sounds newsworthy on its own, the addition of X-chromosome inactivation (XCI) makes it irresistible for our news crew. An exciting new epigenetic effort has captured the exceptional XCI escape artists across eutherian mammals.

    XCI is a dosage compensation mechanism in females that results in the inactivation of one of the two X chromosomes in females. Interestingly, not every gene gets inactivated, and the proportion of genes that escape varies between species. Despite being known for their calico cat mascot, the XCI experts in the lab of Carolyn Brown (University of British Columbia, Canada) have taken a trip to the zoo to expose exceptions between species.

    In this research, they leveraged several publicly available datasets. First, whole-genome sequencing and RNA-seq were used to find the ratio of inactive X (Xi) to active X (Xa) expression (Xi/Xa) for X-linked genes in humans and mice. Next, the excited examiners established a DNAm threshold for calling XCI status of X-linked genes with CpG islands for 12 species based on whole-genome bisulfite sequencing (WGBS) (humans, chimps, mice, cows, sheep, goats, and pigs), reduced representation bisulfite sequencing (RRBS) (horses) and 450k arrays (humans, chimps, bonobos, gorillas, orangutans, and dogs). Here are the extraordinary details:

    • In most species, 80-90% of X-linked genes are subject to XCI
      • Mice are an exception: they have the highest proportion of XCI genes (95%)
      • 4 genes (RPS4X, CDK16, EIF1AX, and GEMIN8) show primate-specific XCI escape
      • XCI escape for one gene, KDM5C, is specific to Artiodactyla (cows, sheep, goats, and pigs)
      • Increased LTR repeats (humans, chimps, horses)
      • Decreased LINE repeats (chimps, mice, sheep, horses)
      • Decreased DNA repeats (mice, cows, sheep)

      First author Bradley Balaton shares, “These differences follow evolutionary lines and genes escape X-chromosome inactivation when the Y chromosome homologues are conserved, and then are subject to inactivation when the Y homologue no longer exists. This opens an evolutionary aspect of the control of how genes that escape X-chromosome inactivation are regulated, and we do see some common features associated with X-chromosome inactivation status conserved across species. We also hope that our X-chromosome inactivation calls across species will be useful to researchers working with these other mammalian species.”


      Plant materials and growth conditions

      The Arabidopsis thaliana ecotype Columbia-0 (Col-0) was used as WT in this study. The T-DNA insertion mutants of SALK_025449 and WiscDsLoxHs122_02H for MYB106 were ordered from the Arabidopsis Biological Resource Center (ABRC). Homozygous mutants were screened by PCR and transcriptional levels were determined by RT-PCR. Primers used are listed in Table S1. Arabidopsis seeds were surface-sterilized and germinated on half-strength Murashige and Skoog (MS) plates (half-strength MS salts, 0.8% agar, 1% sucrose, pH 5.7). Then the 7-d-old seedlings were transferred to pots and grown in a growth chamber at 22°C (16 h light/8 h dark, 200 µmol m −2 s −1 ). Under short-day conditions, plants were grown in a growth chamber at 22°C (8 h light/16 h dark, 100 µmol m −2 s −1 ). Tobacco (N. benthamiana) plants were cultivated in a growth chamber at 22°C under long-day conditions (16 h light/8 h dark).

      Construction of transgenic over-expression lines

      To generate Pro35S:MYB106-GFP over-expression lines, the coding region of MYB106 was amplified using the primers shown in Table S1 for subsequent cloning into pYJGFP ( Niu et al., 2020 ). After confirmation by sequencing, the Pro35S:MYB106-GFP construct was transformed into myb106 mutant background lines through Agrobacterium-mediated floral dip, followed by selection of transgenic lines on 1/2 MS media containing hygromycin B and subsequent verification by western blot analysis.

      Measurement of flowering time

      The number of rosette leaves when the first flower becomes visible was used as indicator of flowering time. Also, the days after germination when the first flower bud comes out were also quantified for the measurement of flowering time ( Smyth et al., 1990 ).

      RNA-seq and data analysis

      Total RNA was extracted from 6-week-old flower tissues of WT and myb106-1 mutants by using ISOLATE II RNA Plant Kit (Catalog No. BIO-52077 Bioline, UK). The RNA was quantified and qualified by Qubit 3.0 (Catalog No. 2321610866 Thermo Fisher Scientific, USA). The RNA-seq was conducted by GENEWIZ Company (USA) with three biological replicates. RNA-seq libraries were constructed by Illumina TruSeq RNA Sample Pre Kit following the manufacturer's protocols. High-throughput sequencing was then performed by using Illumina HiSeq 2000 platform. Qualified reads were mapped to the Columbia genome using Hisat2 v.2.0.1 with default parameters ( Kim et al., 2015 ). Gene expression calculation was performed with Cuffdiff v2.2.1, which calculates FPKM ( Acevedo et al., 2016 ). Using the NGS data, genes were sorted based on log2 ratio of myb106-1/WT. Furthermore, candidate genes were clustered based on the expression profiles and related pathways. Finally, a specific regulatory model of MYB106 with up- and down-regulated targets has been generated. KEGG ( was used to analyze gene pathways.

      Quantitative real-time PCR analysis

      Total RNA was extracted from 6-week-old flower tissues. Complementary DNA was synthesized from 1 µg total RNA using iScript™ genomic DNA Clear complementary DNA (cDNA) Synthesis Kit (BIO-RAD), cDNA was then used as the template for reverse-transcription polymerase chain reaction (RT-PCR) and quantitative real-time PCR (qRT-PCR). Quantitative RT-PCR reactions were performed using SYBR Green dye (Catalog No. 4368577 Thermo Fisher Scientific) according to the manufacturer's instructions in a Bio-Rad CFX96 Real-Time PCR System (Catalog No. 1855195 Bio-Rad, USA). ACTIN2 of Arabidopsis was adopted as internal control to normalize the expression levels of the target genes. Primers used are listed in Table S1.

      Dual luciferase reporter assay

      Downstream candidate genes′ promoter sequences were identified from TAIR website and primers were designed to clone 1–2 kb upstream sequences (including the 5′ untranslated regions) of start codon (ATG). Then, promoters of candidate genes were inserted into the upstream of firefly LUC gene of pGreenII0800-LUC vector, which is used as reporter plasmid. REN under the CaMV35S promoter was used as endogenous control ( Hellens et al., 2005 ). The pYJGFP-MYB106 (35Spro: MYB106) was used as an effector plasmid, with pYJGFP (35Spro: GFP) as control plasmid. Thirty-d-old tobacco leaves were infiltrated with Agrobacterium tumefaciens (GV3101) containing both the effector plasmid and reporter plasmid. Three leaf discs of 1 cm in diameter were collected at 2 and 3 dpi and frozen in liquid nitrogen. A dual-LUC reporter assay kit (Catalog No. E1910 Promega, USA) was used to measure LUC and REN activities. The binding ability of pYJGFP-MYB106 to different promoter sequences was reported as ratio of LUC to REN.

      Protein expression and EMSA

      The coding region of the MYB106 gene was cloned into vector of pGEX-4T-1 (Amersham Biosciences, Recombinant plasmid with glutathione S-transferases (GST) tag was transformed into the Rosetta (DE3) strain of Escherichia coli and then induced by 2.5 mmol/L isopropyl-β- d -thiogalactoside (IPTG) at 25°C for 4 h. Cell pellets were collected and lysed by sonication in phosphate-buffered saline. GST-tagged proteins were purified with GST-bind resin (Catalog No. 70541 Novagen, Germany) according to manufacturer's instruction. Electrophoretic mobility shift assays were performed using 1 μg of purified proteins by a Light Shift Chemiluminescent EMSA kit (Catalog No. 20148 Thermo Fisher Scientific). GST-MYB106 protein was incubated with biotin-labeled probes, with unlabeled probes, and mutated probes were used as competitors. GST protein was used as negative control. After that, protein–DNA complexes were separated by native polyacrylamide gel, transferred onto nylon membrane (RPN203B GE, USA) and detected by a chemiluminescence method. The oligonucleotide sequences of biotin-labeled probes and unlabeled probes were synthesized from INTEGRATED DNA TECHNOLOGIES (IDT), which are listed in Table S1.

      Chromatin immunoprecipitation qPCR assay

      Chromatin immunoprecipitation assay was carried out according to the protocols described in ChIP-seq kit (Catalog No. 01010152 Diagenode, Belgium). Briefly, 3-week-old seedlings of myb106-2 and 35S:MYB106-GFP/myb106-2 were collected (1 g, fresh weight) and cross-linked for 15 min under vacuum in crosslink buffer containing 1% formaldehyde and stopped by adding 100 mmol/L glycine for another 5 min. After washing twice with distilled water, the samples were used for chromatin DNA isolation, sonication and then immunoprecipitation was performed with GFP antibodies at 4°C overnight with gentle rotation. The immunoprecipitated complexes were then precipitated with DiaMag protein A-coated magnetic beads. Finally, the precipitated DNA was eluted, de-crosslinked and isolated using IPure Kit v2, then analyzed by qPCR analysis. Specific ChIP-qPCR primers were designed to amplify promoter sequences of FT (Table S1). The sonicated chromatin DNA without precipitation served as an input control and the ChIP results were presented as a percentage of input DNA, while ACTIN2 was used as a negative control.

      Yeast-two-hybrid assay

      The coding regions of MYB106 and six BTB/POZ genes (BPM1-6) were cloned into the pGADT7 and pGBKT7 vectors (Catalog No. 630442 Clontech, USA) respectively. After that, recombinant plasmids were co-transformed into the yeast strain AH109 according to the Yeast Protocols Handbook (Clontech). Transformants were selected on Synthetic Drop-out (SD) medium lacking Trp and Leu (SD −Trp −Leu), whereas the selection of interactions was conducted on SD medium lacking His, Trp, and Leu (SD −His −Trp −Leu) containing 5 mmol/L 3-amino-1,2,4-triazole (Catalog No. 61-82-5 Sigma, USA) and −SD−Leu−Trp−His−Ade medium. Yeast plates were incubated for up to 5 d at 30°C before being photographed.

      Bimolecular fluorescence complementation assay

      The coding regions of MYB106 and BPMs were subcloned into 35S-SPYCE(M) and 35S-SPYNE(R)173 vectors, respectively. The resulting plasmids were introduced into Agrobacterium tumefaciens strain GV3101 cells, which were co-infiltrated together with the P19 strain into true leaves of 4-week-old N. benthamiana. Yellow fluorescent protein fluorescence was observed 3 d after infiltration using a confocal microscope (SP8 Leica, Germany) (Niu et al., 2016 ).


      Protein extraction and Co-IP were conducted following the established protocols ( Miao and Jiang, 2007 ). Maxi-preparation plasmids of BPM1, BPM2, BPM4, and MYB106 were transformed into Arabidopsis protoplasts derived from 5-d-old Plant System Biology Dark-type culture suspension cultured cells. Protoplasts were then incubated in 26 °C for transient protein expression about 10-14 h before harvest by 250 mmol/L NaCl. After that, the transformed protoplasts were re-suspended by ice-cold 1× IP buffer (25 mmol/L HEPES, 150 mmol/L NaCl, 2 mmol/L ethylenediaminetetraacetic acid (EDTA), 1 mmol/L MgCl2, 0.8% TritonX-100, 2 mmol/L dithiobis (succinimidyl propionate), 1× Complete Protease Inhibitor Cocktail, pH 7.4) and further lysed by syringe with needles on ice. Cell lysates were then filtered through a 0.45 µm hydrophilic Durapore membrane syringe filters were used before the incubation with GFP-Trap magnetic beads for 2 h at 4°C. After incubation, the beads were washed by wash buffer (25 mmol/L HEPES, 150 mmol/L NaCl, 2 mmol/L EDTA, 1 mmol/L MgCl2, 0.8% TritonX-100, 1× Complete Protease Inhibitor Cocktail, pH 7.4) for three times and eluted by boiling in sodium dodecyl sulfate (SDS) sample buffer. Samples were then separated by SDS-PAGE (polyacrylamide gel electrophoresis) and analyzed by immunoblot.

      Protein stability assay

      For protein stability assay of MYB106, plasmids of MYB106-HA and GFP-HA (hemagglutinin) were transiently transformed into leaf protoplasts of WT and cul3 hyp , with the treatment of proteasome inhibitor MG132 for 6 h (50 μmol/L Sigma-Aldrich) or dimethylsulfoxide as control. Then the total protein was extracted using lysis buffer (50 mmol/L of Tris-HCl at pH 7.4, 150 mmol/L of NaCl, 0.5 mmol/L of EDTA, 1% (v/v) Triton X-100, 5% (v/v) glycerol, and 1× Complete Protease Inhibitor Cocktail). The protein of MYB106 was detected with HA antibodies by western blot.

      Watch the video: A Primer on Sequence Motifs by Dr. Jaime Castro Mondragón (August 2022).