Skip to content

Latest commit

 

History

History
414 lines (335 loc) · 20 KB

README.md

File metadata and controls

414 lines (335 loc) · 20 KB

StructuRly 0.1.0

StructuRly is an R package containing a shiny application to produce detailed and interactive graphs of the results of a Bayesian cluster analysis obtained with the most common population genetic software used to investigate population structure, such as STRUCTURE or ADMIXTURE. These software are widely used to infer the admixture ancestry of samples starting from genetic markers such as SNPs, AFLPs, RFLPs and microsatellites (such as SSRs). More generally, StructuRly can generate graphs from any file containing admixture information of each sample (encoded in percentages in a range from 0 to 1). We developed StructuRly to provide researchers with detailed graphical outputs to interpret their statistical results through the use of software with a user-friendly interface, which can, therefore, be easily used by those who do not know a programming language. In fact, in a typical StructuRly output, the user will have the possibility to display information about the ID of each sample, the original membership assigned by the researcher to the sampled populations (or subpopulations) and the label of the sampling site, a variable, the latter, which is used in software for population analysis to support the data analysis algorithm. Furthermore, interactivity is a typical feature of StructuRly outputs, which allows the user to extrapolate even more information through a single chart.

However, this shiny application presents more different features to:

  • support the statistical genetic analysis with necessary information about the molecular markers and diversity indices and through the calculation of the (P_{gen}) (if you have haploid or diploid data) or the Hardy-Weinberg equilibrium for every locus. For the calculation of the Hardy-Weinberg equilibrium, the (p)-value of the (\chi^2)-test can be calculated for any level of ploidy (>= 2), while the exact (p)-value from the Monte Carlo test is currently available just for diploids (more details are available inside the pegas package manual);

  • upload datasets with raw genetic data to analyze them through the principal coordinates analysis (MDS) and hierarchical cluster analysis algorithms, and view and download the dendrograms based on different distance matrices and linkage methods;

  • produce and customize tables ready to be imported into the STRUCTURE software for the Bayesian analysis;

  • import the results of the STRUCTURE and ADMIXTURE population analysis directly into StructuRly in different formats, without having to re-structure the dataset with other software (such as R);

  • produce an interactive barplot and triangle plot, the most well-known STRUCTURE graphical outputs. Both graphs can show the admixture ancestry of the samples subdivided in a maximum of 20 different clusters;

  • visually compare the partition obtained from the hierarchical cluster analysis and the one from the Bayesian (STRUCTURE) or maximum likelihood (ADMIXTURE) analysis through a confusion matrix and estimate an agreement value of the two partitions with two different agreement indices.

  • visualize and download the R code used inside the shiny application to produce all the plots.

Installation

You can install the released version of StructuRly from GitHub in RStudio with:

install.packages(pkgs = "devtools")

library(devtools)

install_github(repo = "nicocriscuolo/StructuRly", dependencies = TRUE)

Once the package is loaded and the dependencies installed, you can run the software in the default browser through the following functions:

library(StructuRly)

runStructuRly()

If you have trouble installing StructuRly you can follow the instructions present this link.

System requirements

StructuRly works on macOS, Windows and Linux operative systems. Install the updated version of R (>= 3.5) and RStudio and launch StructuRly on all types of browsers (Internet Explorer, Safari, Chrome, etc.). In its current version, it can also work locally and then offline. If you need any information about the usage of STRUCTURE or ADMIXTURE software (e. g. instructions to launch the software, preparation of input files and how to exports the outputs), please visit their websites at the following links:

Moreover, the user can launch the Terminal (to start an ADMIXTURE population analysis) or the STRUCTURE software directly from the user interface of StructurRly (this function is currently available for macOS and Linux users). To make this buttons work, both software must be installed on your computer.

N. B.: If you use a Linux based machine, to properly configure R and to install some StructuRly dependencies you may need specific Linux libraries to make these software work with this operative system. To install these libraries in R follow the instructions displayed inside the R console when you load the dependency packages.

Online version

If you are not familiar with R or RStudio you can access to StructuRly directly from the web by using the following link: https://nicocriscuolo1618.shinyapps.io/StructuRly/.

Data input

StructuRly is divided into three different sections depending on the input file chosen. For any type of file, the header of each variable is mandatory and varies according to the type of variable that must be present in the input dataset. When you start a new session of StructuRly, if you change the uploaded file with a new one (inside the same section), to produce new outputs remember to re-define every time the type of separator (e. g. column, semi-column or tab) and to indicate if your data have quotation marks.

Data format

In the first section of StructuRly, you can import both .txt and .csv file. Since the second section also accepts the output file obtained after the population analysis performed with ADMIXTURE, here you can import also .Q format file and a .fam file (if the latter one is available).

In StructuRly you also have the possibility to export a table ready to be imported inside the STRUCTURE software. If you need detailed references about the structure of this dataset and how to perform the population analysis with STUCTURE you can find them this link. If you want to use your raw genetic data to produce an input table for the ADMIXTURE software, you have to convert your matrix in a .ped or .bed file. You can do that through the functionalities of the PLINK software, illustrated step by step at this link. If you need more information about this last data formats, they are available here.

Download sample datasets

Examples of the .txt, .csv, .Q and .fam files that you can import into StructuRly are present at the following repository link: Sample datasets (the .Q and the .fam files are obtained after an ADMIXTURE analysis with the sample files downloadable directly from the ADMIXTURE website).
To download the sample datasets from GitHub, right-click on the desired file and choose Download linked file. The sample datasets are available in pair of two files: one contains the raw genetic data and the other the results of the STRUCTURE analysis performed on such data. They have different format and information to describe different use-case scenario, in particular:

  • Sample1: this datasets in .txt format contains random generated values of genetic triploid loci (with different names) in 500 samples, with a weight that ranges from 150 to 500 base-pairs. The additional information available are the Sample ID, the Population ID and Location ID (see Section 1);
  • Sample2: this datasets in .csv format contain information related to diploid genetic loci of simple sequence repeats (SSR) sampled in 95 Olea europaea specimens in Criscuolo et al., 2019. They contain additional information about the Sample ID and the Population ID;
  • Sample3: the last sample dataset is in .Q format and contains the results of the ADMIXTURE analysis on a genetic dataset available on the ADMIXTURE website. Moreover, the .fam file is available to add the Sample ID and the Population ID to the original dataset.

Section 1: Import raw genetic data

The input for this section can contain three optional variables present in the following order and whose header must be precisely the one shown below:

  • Sample_ID: is the variable that contains the IDs of each sample so each name in this column will be different from the others (although it is good practice to use only numbers and letters, the IDs characters can also be separated by the following symbols: "_" and “-”);

  • Pop_ID: is a categorical variable identified by an integer that indicates the putative population defined by the user for each sample (e.g .: 1, 2, 3, etc.);

  • Loc_ID: another categorical variable identified again by an integer that indicates the origin site of each sample; this variable, if present in the table produced with StructuRly and then imported into STRUCTURE for population analysis, can be used by the Bayesian algorithm as support to results elaboration.

The following variables present in the dataset to import in this section are mandatory and must contain numerical values relative to the types of markers used. Depending on the ploidy of the organism analyzed, there must be a number of columns for each locus equal to the number of alleles, in particular:

  • for haploid organisms, each column must have a unique name of a locus coded with alpha-numeric characters, but it must not contain the dot symbol (e. g.: “Locus_1”, “Locus_2”, “UDO”, “BRAC8792”, etc.);

  • for diploid and polyploid organisms the column header must contain a single name of the locus followed, this time, by the dot symbol (“.”) and the identification number of the allele, mandatorily starting from 1 (e.g .: “Locus_1.1,” “Locus_1.2”, “UDO.1”, “UDO.2”, etc.). Below there is an image that represents data stored in a spreadsheet that, once converted in .txt or .csv format, can be appropriately read by StructuRly:

image_1

N. B.: for the Sample_ID, Pop_ID and Loc_ID columns, avoid the usage of the name “NA” to indicate a name of a sample, of a putative population or a collection site, because StructuRly could recognize that characters as a missing value and the plot will not display the correct information. This also applies for the preparation of the input datasets for the Section 2.

Missing values

When you produce the file for this section of StructuRly, the missing values must be indicated only with the abbreviation NA. The cells of the reactive table (in the table panel named “Input table”) that contain missing values will appear empty, while they are codified as -9 in the table that can be produced and downloaded by StructuRly to be imported into STRUCTURE.

N. B.: if your data refer to diploid or polyploid organisms and you encounter a missing value in one or more of your samples in a specific locus, the NA value must be present for all the alleles of that locus;

Section 2: Import population analysis

Here the user can import a dataset obtained directly following the population analysis of his genetic data. The characteristics of this input file are not very different from the one to be imported in the previous section:

  • the three optional variables (Sample_ID, Pop_ID and Loc_ID) can be inserted again, in this precise order, within the import file, with the only difference that in this case the categories of the variables Pop_ID and Loc_ID do not necessarily have to be represented by numbers, but also by characters.

  • the other mandatory variables to be inserted must be those of the admixture proportion calculated by the population software mentioned above, and which will be equal in number to the number of clusters chosen by the user before executing the Bayesian analysis. Each of these variables must be identified by a header containing the letter “K” and the number of the relative cluster in sequence (e. g.: “K1”, “K2”, “K3”, etc.), i. e. in the same order of the dataset exported by the software.

Below there is an example of this type of file structure. In this case the Loc_ID column is not present; in fact, the three information variable are not mandatory for the datasets to import in section 1. and 2.:

image_2

  • If you have obtained the results of your population analysis with the ADMIXTURE software, there are two ways to proceed to prepare the dataset for StructuRly. From the analysis of a file in .bed or .ped format you will get a .Q format file that you can either import into R and then modify as you like, exporting it in .txt or .csv format and then import it into StructuRly (for example, after adding the columns identifying the name of the samples or the population) or you can import directly into StructuRly the .Q format file. This file only contains the variables with the values of the ancestry admixture: if you want to add metadata to this dataset you will have to import the .fam file into StructuRly, which generally accompanies .bed and .ped files. StructuRly will automatically use the first two variables of the .fam file, which are generally used to ensure the sample identifier and the user-defined population respectively.

Section 3: Compare partitions

The third section uses the first two sections input files to compare the partitions obtained from the hierarchical and Bayesian cluster analysis. Obviously, the imported datasets must refer to data of same nature and the number of observations must be the same in both files. The samples cluster memberships of the admixture ancestry analysis partition are assigned considering the highest value of ancestry found in a specific population (STRUCTURE or ADMIXTURE cluster) for each sample. It means that this partition will divide the observations in the same number of clusters chosen for the population analysis, but if the admixture ancestry is the lowest for a particular subpopulation, this cluster will not be shown in the comparison plot and table, because there are no observations assigned to it.

Outputs download

The following image shows the main output downloaded from StructuRly, the barplot of the ancestry admixture. The sample labels on the X axis are colored according to the population indicated in the user input file, while the symbols at the top of the plot indicate the sampling site. In StructuRly there are 25 different symbols available but you can also simply decide to split the entire plot on the basis of the different categories inside the Pop_ID and Loc_ID variables.

image_3

All StructuRly outputs can be downloaded as images in various high-quality formats directly from the user interface. However, to download the graphs related to the Triangle plot, obtained through a specific function of the plotly package (and not with those of ggplot2) you need to download the orca software in your computer and follow the instructions at this link. If you don’t install the orca software you can always download the Triangle plot through the functionalities of the plotly package through the commands displayed directly on the interactive plot.

N. B.: for a dataset with a high sample number (> 500) remember to re-size your plot (width, height and resolution) to better distinguish the bars and the relative IDs.

Example

Here’s a link to the YouTube video of StructuRly showing an example of using of the software. Moreover, the flowchart below, accessible from the Instructions panel of the application, schematize a tutorial to use the software.

image_4

Known bugs and limitations

  • in the interactive barplot, the X-axis labels are not colored according to the different populations entered by the user in the input file. To view colored labels, download the image in one of the different formats available. This bug is related to the functions of a third-party package and has been reported to the GitHub community at this link;

  • for more than 40 predefined populations present in the dataset to produce the barplot, the colors used to distinguish such populations within the labels of the X-axis of the barplot could start to repeat;

  • there is a limit of 50 Mb for data uploading;

  • in the comparison plot, the separation lines of the heatmap cells are not visible when using the interactive graph for a bug present in the package used to produce this output. Again, download the output to view the complete chart;

  • when using a large number of populations (> 60) or collection sites defined a priori by the user, the graphics output produced when the observations of the barplot are divided into sections according to the different populations or collection sites (or both) may not be accurate. Moreover, in this case, you could see a slight overlap between the axis title and the axis text of the barplot;

  • when using the online version of StructuRly, in order to download the code to produce a plot click on “Download R Code” and then select the code in the dashboard, then copy it (Cmd + C on macOS, CTRL + C on Windows and Ubuntu). Then button to directly copy some content to the user clipboard in an interactive session is still in development.

The slight bugs related to some characteristics of the graphs are shown only inside the interactive plots, but the downloaded file won’t present any problem.

Citation

StructuRly was firstly presented during the International BBCC meetings held in Naples (Italy) in November 2018 and its implementation has been described inside the paper StructuRly: a novel shiny app to produce comprehensive, detailed and interactive plots for population genetic analysis (submitted). If you use this package for your research please cite:

  • Criscuolo, N. G. & Angelini, C. StructuRly: A novel shiny app to produce comprehensive, detailed and interactive plots for population genetic analysis. PLoS One 15, e0229330 (2020). https://doi.org/10.1371/journal.pone.0229330.

Contact

For additional information about StructuRly, please consult the documentation or email us.