Procedure for generation of static content Ensembl Metazoa
Generation of static content is handled via a single BASH wrapper script. This pipeline produces the full set of Ensembl static content for any genome CoreDB loaded from RefSeq or NCBI GenBank.
- NB: Non RefSeq loaded cores: May require some manual editing to their static
>>> sh \
<RunStage> \
<Input cores> \
<MYSQL Host> \
<Unique Run Identifier>
>>> sh All CoreDB.list.txt staging-2 StaticContentE112
Static content wrapper performs a number of processing stages:
- Download Wikipedia summary information, for each species input. [BASH]
- Download NCBI assembly and annotation summary files via NCBI datasets client. [BASH] - Implemented via Singularity and datasets-cli SIF image.
- Generation of static content .md files (, *, * [Perl]
- Download species images and associated wikimedia common license information. [BASH, Perl]
- Generate formated list of species processed for ''. See ensembl-static. [Bash]
In order to run the main static content wrapper to generate species markdown contents, image resources etc. users must provide the following input parameters:
- Run-stage option: ['All', 'Wiki', 'NCBI', 'Static', 'Image', 'LicenseUsage', 'WhatsNew', 'Tidy'].
- List of Input Core DBs: Flat text file listing one one core database per line.
- Source MySQL server where cores are hosted: Typically staging host server.
- Unique run identifier: (e.g. StaticContentE112).
StaticContent_MD_Output-* (Dir)
Main output directory containing all markdown .md static content. One sub directory per species.
Log_Outputs_and_intermediates (Dir)
Directory containing run log files, including auto generated scripts used in pulling information from Wikipedia and WikiCommons.
Source_Images_wikipedia (Dir)
- Species image files obtained from Wikipedia. One per species, IF available. Image file name convention follows 'species.production_name'. - Images will need subsampled, typically can be done using 'imagemagik'
Full JSON dumps from Wikipedia (page/summary/{title}). One JSON file per core_db processed.
NCBI RefSeq genome assembly reports obtained via 'datasets' client (.json), one per species
Commons_Licenses (Dir)
Full JSON dumps files of WikiMedia commons licensing meta information ([Commons: API/MediaWiki](
Log_Outputs_and_other_intermediates (DIR)
- wget commands that generated Wikipedia JSON files.
Formatted TSV of WikiMedia Licensing meta information related to species images downloaded from Wikipedia.
Main wikipedia landing page URLs for each species. Useful for checking web content for that species.
Intermediate output of all species image resource URLs.
List of all species found to be lacking Wikipedia summary info at time of processing. NB: In such cases, a template '' file is generated which needs to be manually processed to include species information.
Summary log of static markdown file generation. Cat this file for log information formatted with colourised text.
Checkpoint file used to preventing rerunning of completed stages. Use this to control workflow if needed.
A script to automatically update the Ensembl static content repository with static content files generated from the main wrapper above. Requires a forked repo of 'ensembl-static' and a specific release e.g. 'release/eg/60'
- A script to retrieve species images and associated licening information from wikipedia. Accepts flat text file of wikipedia page title (One or more lines, one title per line).
E.g. Input: A flat text file containing the text 'Bumblebee' which will pull images from this wikipedia page