Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRCh37-style chromosome names fail panel schema checks #172

Open
lordkev opened this issue Dec 31, 2024 · 1 comment
Open

GRCh37-style chromosome names fail panel schema checks #172

lordkev opened this issue Dec 31, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@lordkev
Copy link

lordkev commented Dec 31, 2024

Description of the bug

If I use a panel csv that uses chromosome names such as 1 instead of chr1, it fails the schema check that it is a string.

eg.

panel,chr,vcf,index
1kgp3,1,/data/1KG/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,/data/1KG/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi

results in:

The following invalid input values have been detected:

* --panel (phaseimpute-1kgp3-reference.csv): Validation of file failed:
	-> Entry 1: Error for field 'chr' (1): Chromosome must be provided as a string and cannot contain spaces

If I surround the chromosome name with single quotes it's accepted, but then produces an error that it's missing from the reference panel.

eg.

panel,chr,vcf,index
1kgp3,1,/lilnasx/data/1KG/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,/lilnasx/data/1KG/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi

results in:

WARN: Chr : [1, 2, 3, 4, ...] are missing from reference panel
WARN: The following contigs are absent from at least one file : [1, 2, 3, 4, ...] and therefore won't be used
No regions left to process
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

I believe this is related to this issue: nextflow-io/nf-schema#81

Since currently only CSV files are accepted as panel input, the only way I can get the pipeline to run successfully is by adding validation.lenientMode = true to my config file and modifying the panel channel creation by adding a map that manually converts the chr names to strings.

Command used and terminal output

nextflow run -r b02be7c nf-core/phaseimpute --panel phaseimpute-1kgp3-reference.csv --steps panelprep --genome GRCh37 --outdir results -profile docker

Relevant files

No response

System information

No response

@lordkev lordkev added the bug Something isn't working label Dec 31, 2024
@LouisLeNezet
Copy link
Collaborator

Hi @lordkev,
I've been confronted to this issue a little ago.
This is indeed a problem with csv file.
I would recommend to use a json or yaml file instead as they are explicit about the data type.
Unfortunately addind ' around the number doesn't really solve the problem.

You can use the following code to convert your csv to a json:

awk -F',' -v nlines=$(wc -l < $CSV) '
    BEGIN {
        print "["
    }
    { if ( NR > 1 ) {
        printf "  {\"panel\": \"%s\", \"chr\": \"%s\", \"vcf\": \"%s\", \"index\": \"%s\"}", $1, $2, $3, $4
        if (NR != nlines) {
            print ","
        } else {
            print ""
        }
    }}
    END {
        print "]"
    }
    ' $CSV > panel.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants