Title: Characteristics of human and viral RNA binding sites and site clusters recognized by SRSF1 and RNPS1
Type Dataset Rogan, PK, Mucaki, EJ, Shirley, BC (2020): Characteristics of human and viral RNA binding sites and site clusters recognized by SRSF1 and RNPS1. Zenodo. Dataset. https://zenodo.org/record/4315165
Links
- Item record in Zenodo
- Digital object URL
Summary
This dataset was developed for the following article:
Rogan PK, Mucaki EJ and Shirley BC. A proposed molecular mechanism for pathogenesis of severe RNA-viral pulmonary infections [version 1; peer review: awaiting peer review]. F1000Research 2020, 9:943 (https://doi.org/10.12688/f1000research.25390.1)
Section 1. Extended Data Tables
This archive contains the extended data tables for the research article "A proposed mechanism for molecular pathogenesis of severe RNA-viral pulmonary infections". These tables provide SRSF1, RNPS1 and hnRNP A1 binding site and information-dense cluster counts across various RNA viral genomes [including multiple SARS-CoV-2 and influenza strains] and the human transcriptome, the estimated SARS-CoV-2 doubling time necessary for viral genome SRSF1 binding site availability to exceed sites within the host transcriptome, and an analysis of influenza, dengue, and aplastic anemia patients misdiagnosed as irradiated by established radiation gene signatures.These tables are:
Section 1 - Table 1. RNPS1 and hnRNPA1 binding sites and Information-Dense Clusters for RNPS1 and hnRNPA1 in RNA Virus Genomes Section 1 - Table 2A. Detailed Analysis of Information-Dense Clusters for SRSF1 (Replicate 1) in RNA Virus Genomes Section 1 - Table 2B. Detailed Analysis of Information-Dense Clusters for SRSF1 (Replicate 2) in RNA Virus Genomes Section 1 - Table 2C. Detailed Analysis of Information-Dense Clusters for RNPS1 in RNA Virus Genomes Section 1 - Table 2D. Detailed Analysis of Information-Dense Clusters for hnRNP A1 in RNA Virus Genomes Section 1 - Table 3. Binding Site Analysis of Multiple Coronavirus Strains (Both Strands) Section 1 - Table 4A. Binding Site Analysis of Multiple Influenza A (H3N2) Strains (Negative Strand Only) Section 1 - Table 4B. Binding Site Analysis of Multiple Influenza A (H3N2) Strains (Both Strands) Section 1 - Table 5. SRSF1, RNPS1 and hnRNPA1 Binding Sites and Information-Dense Clusters by Gene Section 1 - Table 6A. Transcriptome-Wide Information Dense Clusters Intersecting DRIP- and DRIPc-seq Intervals Section 1 - Table 6B. Exome-Wide Information Dense Clusters within DRIP- and DRIPc-seq Intervals Section 1 - Table 6C. Transcriptome-Wide Scan of Strong Binding Sites Intersecting DRIP- and DRIPc-seq Intervals Section 1 - Table 6D. Exome-Wide Scan of Strong Binding Sites within DRIP- and DRIPc-seq Intervals Section 1 - Table 7. Rate of False Positives for Influenza, Dengue Virus and Aplastic Anemia Using Radiation Signatures Section 1 - Table 8. Radiation Model Genes Contributing to False Positives for Patients with Influenza A, Dengue Virus, and Aplastic Anemia Section 1 - Table 9A. Doubling Time of SARS-CoV-2 Needed to Exceed Host Transcriptome SRSF1 Binding Sites (Positive-Strand Sites Only) Section 1 - Table 9B. Doubling Time of SARS-CoV-2 Needed to Exceed Host Transcriptome SRSF1 Binding Sites (Both Strands Considered)
Section 2. All SRSF1, hnRNPA1 and RNPS1 binding site tracks for human and viral genomes
We provide bedgraph tracks which provide the location and strength of binding sites (and binding site clusters) for SRSF1, RNPS1 and hnRNPA1 across the human transcriptome (GRCh37), the human exome (including +/-300nt surrounding the exon; non-intergenic only), and for all viral genome investigated in this study (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [two strains]). Note that if no clusters were found for a particular viral genome, a file for said genome will not be present in the Zenodo archive.
Folder “Cluster-to-DRIPseq-Intersection-Tracks” contain tracks which indicate where binding site clusters have been identified, intersected with DRIP-seq and DRIPc-seq intervals which indicate where there is evidence of R-Loop formation in the human genome. The DRIP-seq dataset (GSE68845) is not strand specific. DRIPc-seq (GSE70189) is strand specific, and has been taken into account in the intersection (e.g. tracks only list positive strand clusters found in positive-strand DRIPc-seq intervals).
Due to sheer size, the human transcriptome and exome tracks which indicate the location of individual binding sites are split into two separate files (separated by strand). While the custom tracks containing human binding site information are designed to be uploaded to the UCSC Genome Browser, files containing transcriptome-wide binding site information may be too large to be uploaded and may require further filtering (i.e. by chromosome).
To be classified as a cluster, binding sites on the same strand must have Ri values which sum to >50 bits, each binding site must have a neighboring site within 25nt, and all binding sites in the cluster must have Ri greater than a minimum bit threshold. For human transcriptomes and exomes, this bit minimum was set to Rsequence. The bit minimum for viral binding sites was set to 0.1 * Rsequence. The information density-based clustering algorithm utilized in this work is described in Lu and Rogan 2018 (https://f1000research.com/articles/7-1933/v2) and archived source code is available through Zenodo (https://dx.doi.org/10.5281/zenodo.1892051).
Section 3. Binding site clusters - lollipop plots
Lollipop plots present the genomic coordinates and information densities of clusters across the human transcriptome, human exome, and viral genomes (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [one strain]). The height of the "lollipop" corresponds to the information density of a cluster. Labels above "lollipops" present the start and end genomic coordinate (GRCh37) of the cluster followed by the number of sites in the cluster enclosed in brackets. Lollipop plots associated with human transcriptomes/exomes each contain a single gene. Influenza has 8 segments and each segment requires its own plot, other viral genomes examined are presented in a single plot.
File naming convention for human plots:
RBP_Gene.png e.g. RNPS1_ADK.pngFile naming convention for viral plots (elements in square brackets do not always appear):
Virus[.InfluenzaSegment].RiThreshold.Strand.RBP.png e.g. Wuhan-Hu-1.complete-genome.4.2-bits.PosStrand.hnRNPA1.pngThe specified Ri threshold indicates all binding sites which comprise a cluster have Ri greater-than or equal to the threshold.
Section 4. Ri(b,l) matrices for all binding sites scanned
The information theory-based position weight matrices for the following RNA binding proteins (RBP) used in this study: SRSF1, hnRNPA1 and RNPS1. We investigated binding using two different RNPS1 binding models. While similar, these two models contained binding site information on opposing sides of the binding site motif which is why we found it prudent to scan with both models.
Structure of each file:
Line #1: Start position, End position and Rsequence [average strength of sequences used to generate the model]
Subsequent lines describe the information on each position of the binding site:
First four columns: Ri contribution of nucleotide at this position of the matrix [A, C, G, T] Row #5: Position of the matrix Last four columns: Number of binding sites used to generate model with a particular nucleotide at this position of the matrix [A, C, G, T]Example:
-2.965775 1.282153 0.034225 -4.906891 0 1 19 8 0
At zero position of the matrix (first nucleotide), a ‘C’ would have a positive contribution to binding site strength, a ‘G’ would be relatively neutral, and an ‘A’ or ‘T’ would negatively contribute to binding site strength.
Generation of Ri(b,l) matrices and computation of Ri values and can be accomplished by utilizing the Delila package (https://alum.mit.edu/www/toms/delila/delilaprograms.html).
Section 5. Ri and intersite distance - histograms
Two sets of histograms present Ri distribution and intersite distance distribution across the human transcriptome, human exome, and viral genomes (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [one strain]).
File naming convention for human plots (elements in square brackets do not always appear):
[IntersiteDistancesThreshold-]Human-[DRIPc]-AllChrs-RBP[-RiThreshold].png e.g. IntersiteDistances500-Human-AllChrs-hnRNPA1-4.6-bits.pngFile naming convention for viral plots (elements in square brackets do not always appear):
[IntersiteDistancesThreshold-]Strand-RBP-Virus[.InfluenzaSegment][-RiThreshold].png e.g. IntersideDistances1000-PosStrandOnly-SRSF1-top50000sitesReplicate1-HIV-1-Strain-B.pngIntersite distance thresholds of 500 or 1000 were assigned for all intersite distance histograms. Any distances above the corresponding threshold were excluded from the plot. Plots presenting Ri distributions contain a dashed line indicating Rsequence if it is visible within the scope of the plot.
Section 6. Perl Scripts and Descriptions
This archive contains all Perl scripts discussed in this archive's associated manuscript and a document file which describes them ("Perl-Script-Descriptions-Page.docx"). The programs and their general functions are as follows:
“ClusterToDRIPseqAnalysisProgram.pl” – reports which information-dense clusters are located within DRIPc- and/or DRIP-seq intervals (individually and by gene)
“ClusterToDRIPseqAnalysisProgram.GeneDensityFinder.pl” – uses the output from script “ClusterToDRIPseqAnalysisProgram.pl” to determine the number and the density of information-dense clusters within a gene (total clusters within the gene and those within DRIPc-seq intervals)
“calculateIntersiteDistance.pl” – determines the distance between all binding sites in the same gene from a list of genomic coordinates
“removeOutliersHigherThanN.pl” – discards intersite distances computed by script “calculateIntersiteDistance.pl” that are greater than a specified threshold
“getStatisticsOnCol.pl” – calculates the count, geometric mean, median, arithmetic mean, and standard deviation of values from the output of script “removeOutliersHigherThanN.pl”
“ScanDataSummaryProgram.pl” – determines the number of binding sites (above a specified Ri threshold) found within known genes (the program also reports the total expression of those genes using external A549 and pneumocyte expression datasets) from binding site coordinate data
“TotalBindingSitePerCellCalculator.pl” – estimates the number of binding sites expressed in a single A549 or pneumocyte cell at any given time.
More information
- DOI: 10.5281/zenodo.4315165
- Language: en
Subjects
- SARS-CoV-2, COVID19, RNA binding proteins, Coronavirus, molecular mechanisms, SRSF1, RNPS1
Dates
- Publication date: 2020
- Issued: December 10, 2020
Notes
Other: Also see Infographic: Rogan, Peter; Klesc, Ryan; Mucaki, Eliseos; C. Shirley, Ben (2020): A proposed molecular mechanism for pathogenesis of severe RNA-viral pulmonary infections. figshare. Figure. https://doi.org/10.6084/m9.figshare.12718799.v1 Other: {"references": ["Rogan et al. A proposed molecular mechanism for pathogenesis of severe RNA-viral pulmonary infections. F1000Research (2020) https://doi.org/10.12688/f1000research.25390.1"]}Rights
- https://creativecommons.org/licenses/by/4.0/legalcode Creative Commons Attribution 4.0 International
- info:eu-repo/semantics/openAccess Open Access
Format
electronic resource
Relateditems
Description | Item type | Relationship | Uri |
---|---|---|---|
IsVersionOf | https://doi.org/10.5281/zenodo.3737089 | ||
IsPartOf | https://zenodo.org/communities/covid-19 | ||
IsPartOf | https://zenodo.org/communities/zenodo |