-
Notifications
You must be signed in to change notification settings - Fork 2
Basic Use
cazy_webscraper
can be used to download specific record data from CAZy, providing either command-line arguments or a YAML configuration file.
To get help in the terminal for cazy_webscraper
use:
cazy_webscraper -h
Please see the full documentation at ReadTheDocs.
-
-c
--config
: path to a YAML configuration file. -
--classes
: comma-separated list of CAZyme classes to filter download from CAZy. -
-d
--database
: path tocazy_webscraper
database. If one does not already exist, it will be created. -
--cazy_synonyms
: path to JSON file containing accepted CAZy class name synonyms. -
-f
--force
: force writing of plaintext files to existing output directory. -
--families
: comma-separated list of CAZy families to filter download from CAZy. -
--genera
: comma-separated list of genus names to filter download from CAZy. -
--kingdoms
: comma-separated list of Kingdom names to filter download from CAZy. -
-h
--help
: show help message in terminal. -
-l
--log
: path to log file. -
-n
--nodelete
: do not delete content in plaintext output directory -
-o
--output
: path to output directory for plaintext output. If it does not exist, the directory will be created. -
-r
--retries
: number of retry attempts for CAZy HTTP queries, if an error is raised. -
-s
--subfamilies
: comma-separated list of CAZy subfamilies to filter download from CAZy. -
--species
: comma-separated list of species names to filter download from CAZy. -
--strains
: comma-separated list of strain names to filter download from CAZy. -
-v
--verbose
: report verbose messages
To specify a CAZy family, use the standard CAZy notation for the family, not only its number. For for example GH1
is understood by cazy_webscraper
, but 1
is not.
If a parent family, e.g GH3, is specified and --subfamilies
is enabled, all proteins catalogued under GH3 and its subfamilies will be retrieved.
A YAML configuration file can also be used to specify cazy_webscraper
arguments, to support transparency and reproducibility of analyses. An example is shown below.
# All members of named CAZy classes will be recovered, unless the taxon-
# specific filters are active.
classes: # Only members of named classes will be recovered
Glycoside Hydrolases (GHs):
- "GH1"
- "GH2"
GlycosylTransferases (GTs):
Polysaccharide Lyases (PLs):
- "PL28"
Carbohydrate Esterases (CEs):
Auxiliary Activities (AAs):
Carbohydrate-Binding Modules (CBMs):
# Taxon-specific filters
genera: # If specified, only members of named genera will be recovered
- "Trichoderma"
species: # If specified, only members of named genera will be recovered
strains: # If specified, only members of named species will be recovered
kingdoms: # If specified, only members of named Kingdoms will be recovered
- "Bacteria"
Each requested family must be listed on a separate line and the name surrounded by double or single quotation marks.
All proteins catalogued under any of the named classes will be retrieved, unless the taxon-specific filters are active. If taxon-specific filters are active, then only sequences corresponding to those filters will be retrieved.
cazy_webscraper
understands synonyms for the CAZy class names:
- "Glycoside Hydrolases (GHs)":
- "Glycoside-Hydrolases", "Glycoside-Hydrolases", "Glycoside_Hydrolases", "GlycosideHydrolases", "GLYCOSIDE-HYDROLASES", "GLYCOSIDE-HYDROLASES", "GLYCOSIDE_HYDROLASES", "GLYCOSIDEHYDROLASES", "glycoside-hydrolases", "glycoside-hydrolases", "glycoside_hydrolases", "glycosidehydrolases", "GH", "gh"
- "GlycosylTransferases (GTs)"
- "Glycosyl-Transferases", "GlycosylTransferases", "Glycosyl_Transferases", "Glycosyl Transferases", "GLYCOSYL-TRANSFERASES", "GLYCOSYLTRANSFERASES", "GLYCOSYL_TRANSFERASES", "GLYCOSYL TRANSFERASES", "glycosyl-transferases", "glycosyltransferases", "glycosyl_transferases", "glycosyl transferases", "GT", "gt"
- "Polysaccharide Lyases (PLs)"
- "Polysaccharide Lyases", "Polysaccharide-Lyases", "Polysaccharide_Lyases", "PolysaccharideLyases", "POLYSACCHARIDE LYASES", "POLYSACCHARIDE-LYASES", "POLYSACCHARIDE_LYASES", "POLYSACCHARIDELYASES", "polysaccharide lyases", "polysaccharide-lyases", "polysaccharide_lyases", "polysaccharidelyases", "PL", "pl"
- "Carbohydrate Esterases (CEs)"
- "Carbohydrate Esterases", "Carbohydrate-Esterases", "Carbohydrate_Esterases", "CarbohydrateEsterases", "CARBOHYDRATE ESTERASES", "CARBOHYDRATE-ESTERASES", "CARBOHYDRATE_ESTERASES", "CARBOHYDRATEESTERASES", "carbohydrate esterases", "carbohydrate-esterases", "carbohydrate_esterases", "carbohydrateesterases", "CE", "ce"
- "Auxiliary Activities (AAs)"
- "Auxiliary Activities", "Auxiliary-Activities", "Auxiliary_Activities", "AuxiliaryActivities", "AUXILIARY ACTIVITIES", "AUXILIARY-ACTIVITIES", "AUXILIARY_ACTIVITIES", "AUXILIARYACTIVITIES", "auxiliary activities", "auxiliary-activities", "auxiliary_activities", "auxiliaryactivities", "AA", "aa"
- "Carbohydrate-Binding Modules (CBMs)"
- "Carbohydrate-Binding-Modules", "Carbohydrate_Binding_Modules", "Carbohydrate_Binding Modules", "CarbohydrateBindingModules", "CARBOHYDRATE-BINDING-MODULES", "CARBOHYDRATE_BINDING_MODULES", "CARBOHYDRATE_BINDING MODULES", "CARBOHYDRATEBINDINGMODULES", "carbohydrate-binding-modules", "carbohydrate_binding_modules", "carbohydrate_binding modules", "carbohydratebindingmodules", "CBMs", "CBM", "cbms", "cbm"