| Title: | What the Package Does (One Line, Title Case) |
|---|---|
| Description: | What the package does (one paragraph). |
| Authors: | Sam El-Kamand [aut, cre] (ORCID: <https://orcid.org/0000-0003-2270-8088>) |
| Maintainer: | Sam El-Kamand <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.0.9000 |
| Built: | 2026-05-21 10:48:24 UTC |
| Source: | https://github.com/CCICB/vcf2mafR |
This function just pulls out digits from the cDNA_position column. Note transcript length isn't encoded into cDNA position column UNLESS –total_length flag is on when running vep (turns format into cdna_position/total_length) It is not possible to turn this flag on using VEP online. If we can't find total length - we just return 0, so for online version we just won't have a transcript sort.
cdna_to_transcript_length(cdna)cdna_to_transcript_length(cdna)
cdna |
vector representing cDNA_position VEP annotations. |
numeric transcript length passed from VEP cDNA_position
Converts a VEP-like (chr-pos-ref-alt-consequence-sample style) dataframe with to a minimal MAF dataframe. Input dataframe should include the columns:
chr
pos (1-based)
sample
ref
alt
consequence (Sequence Ontology terms)
gene
df2maf( data, ref_genome, keep_all = TRUE, col_chrom = "chr", col_pos = "pos", col_sample_identifier = "sample", col_ref = "ref", col_alt = "alt", col_consequence = "consequence", consequence_dictionary = c("SO", "PAVE", "AUTO"), missing_to_silent = FALSE, col_gene = "gene", col_center = NULL, col_entrez_gene_id = NULL, col_dbSNP_rsid = NULL, col_dbSNP_validation_status = NULL, col_matched_normal_sample_identifier = NULL, col_sequencer = NULL, col_sequence_source = NULL, col_mutation_status = NULL, verbose = TRUE )df2maf( data, ref_genome, keep_all = TRUE, col_chrom = "chr", col_pos = "pos", col_sample_identifier = "sample", col_ref = "ref", col_alt = "alt", col_consequence = "consequence", consequence_dictionary = c("SO", "PAVE", "AUTO"), missing_to_silent = FALSE, col_gene = "gene", col_center = NULL, col_entrez_gene_id = NULL, col_dbSNP_rsid = NULL, col_dbSNP_validation_status = NULL, col_matched_normal_sample_identifier = NULL, col_sequencer = NULL, col_sequence_source = NULL, col_mutation_status = NULL, verbose = TRUE )
data |
a data.frame with 1 row per mutation in a cohort (data.frame) |
ref_genome |
name of the reference genome used to call variants (string) |
keep_all |
keep all columns in the original data.frame? (flag). If FALSE, only includes the minimal required columns and any column explicitly mapped using |
col_chrom |
name of column describing chromosome of the mutation (string) |
col_pos |
name of column describing the 1-based position of the mutation (string) |
col_sample_identifier |
name of column describing the sample containing the mutation (string) |
col_ref |
name of column describing the reference allele (string) |
col_alt |
name of column describing the alternate allele (string) |
col_consequence |
name of column describing the consequence of the mutation (in SO ontology terms e.g. those that VEP would use OR PAVE ontology terms) containing the mutation (string) |
consequence_dictionary |
What dictionary is to describe variant consequences (SO / PAVE / etc) |
missing_to_silent |
Assyne any missing (NA) or empty (”) consequences are 'Silent' mutations |
col_gene |
name of column containing Hugo_Symbol of the gene affected by the mutation (string) |
col_center |
name of column containing the genome sequencing center reporting the variant (string) |
col_entrez_gene_id |
name of column containing entrez gene IDs (string) |
col_dbSNP_rsid |
name of column describing the dbSNP rsid of the variant, or "novel" if there is no dbSNP record (string) |
col_dbSNP_validation_status |
name of column describing the validation status of the variant. Elements must be one of by1000genomes;by2Hit2Allele; byCluster; byFrequency; byHapMap; byOtherPop; bySubmitter; alternate_allele (string) |
col_matched_normal_sample_identifier |
name of column describing the matched normal sample identifier |
col_sequencer |
name of column describing the instrument used to produce data (string) |
col_sequence_source |
name of column describing the molecular assay type used to produce the analytes used for sequencing (string). Elements are usually one of 'WGS', 'WGA', 'WXS', 'RNA-seq', etc |
col_mutation_status |
name of column describing the mutation status (string). Elements must be one of: None, Germline, Somatic, LOH, Post-transcriptional modification, or Unknown |
verbose |
verbose (flag) |
Alternate column names can be used, but require col_chrom, col_pos, col_
a maf-like data.frame (data.table)
# Start with a vcf/maf-like 'chr-pos-ref-alt-consequence-sample' data.frame df = read.csv(system.file(package = "vcf2mafR", "testfiles/test_so.tsv"), sep = "\t") # Convert to a MAF dataframe df2maf(df, ref_genome = "hg38")# Start with a vcf/maf-like 'chr-pos-ref-alt-consequence-sample' data.frame df = read.csv(system.file(package = "vcf2mafR", "testfiles/test_so.tsv"), sep = "\t") # Convert to a MAF dataframe df2maf(df, ref_genome = "hg38")
Extract Sample Names From VCF filepath
paths_to_sample_names( filenames, extract = c("before_dot", "before_underscore") )paths_to_sample_names( filenames, extract = c("before_dot", "before_underscore") )
filenames |
paths to vcf files |
extract |
what text do we extract as sample name. See details |
| Extraction | Explanation |
| before_dot | sample_name.otherinfo.vcf |
| before_underscore | samplename_other_info.vcf |
sample names extracted from filename (character)
path <- "/path/to/sample_name.otherinfo.vcf" paths_to_sample_names(path)path <- "/path/to/sample_name.otherinfo.vcf" paths_to_sample_names(path)
List all valid MAF columns based on GDC specification
valid_maf_columns()valid_maf_columns()
Names of each GDC MAF column in the order they appear (character)
valid_maf_columns()valid_maf_columns()
Expects a vepped VCF. VEP must be run with a couple of options:
Identifiers: Gene symbol & Transcript Version & (HGVS: optional)
Transcript annotation: Transcript biotype & 'Identify canonical transcripts'
(Optional) If using the commandline version of vep also make sure to use –total_length (helps in transcript choice)
vcf2df( vcf, tumor_id = vcf_tumor_id, normal_id = vcf_normal_id, vcf_tumor_id = "TUMOR", vcf_normal_id = "NORMAL", debug_mode = FALSE, verbose = TRUE )vcf2df( vcf, tumor_id = vcf_tumor_id, normal_id = vcf_normal_id, vcf_tumor_id = "TUMOR", vcf_normal_id = "NORMAL", debug_mode = FALSE, verbose = TRUE )
vcf |
path to a VEP-annotated VCF file |
tumor_id |
desired value of Tumor_Sample_Barcode in maf. By default will use the name of the tumor sample in the VCF (string) |
normal_id |
desired value of Matched_Norm_Sample_Barcode in maf. By default will use the name of the normal sample in the VCF (string) |
vcf_tumor_id |
the sample ID describing the tumor in the (string) |
vcf_normal_id |
the sample ID describing the normal in the (string) |
debug_mode |
run in debug mode (flag) |
verbose |
(flag) |
There are a couple of different types of inputs we supp
data.frame
path_vcf <- system.file(package = "vcf2mafR", "testfiles/test_b38.vepgui.vcf") vcf2df(path_vcf)path_vcf <- system.file(package = "vcf2mafR", "testfiles/test_b38.vepgui.vcf") vcf2df(path_vcf)
Expects a vepped VCF. VEP must be run with a couple of options:
Identifiers: Gene symbol & Transcript Version & (HGVS: optional)
Transcript annotation: Transcript biotype & 'Identify canonical transcripts'
(Optional) If using the commandline version of vep also make sure to use –total_length (helps in transcript choice)
vcf2maf( vcf, ref_genome, tumor_id = vcf_tumor_id, normal_id = vcf_normal_id, vcf_tumor_id = "TUMOR", vcf_normal_id = "NORMAL", missing_to_silent = TRUE, verbose = TRUE, debug_mode = FALSE )vcf2maf( vcf, ref_genome, tumor_id = vcf_tumor_id, normal_id = vcf_normal_id, vcf_tumor_id = "TUMOR", vcf_normal_id = "NORMAL", missing_to_silent = TRUE, verbose = TRUE, debug_mode = FALSE )
vcf |
path to a VEP-annotated VCF file |
ref_genome |
name of the reference genome used to call variants (string) |
tumor_id |
desired value of Tumor_Sample_Barcode in maf. By default will use the name of the tumor sample in the VCF (string) |
normal_id |
desired value of Matched_Norm_Sample_Barcode in maf. By default will use the name of the normal sample in the VCF (string) |
vcf_tumor_id |
the sample ID describing the tumor in the (string) |
vcf_normal_id |
the sample ID describing the normal in the (string) |
missing_to_silent |
Assyne any missing (NA) or empty (”) consequences are 'Silent' mutations |
verbose |
(flag) |
debug_mode |
run in debug mode (flag) |
a maf compatible data.frame
path_vcf_vepped <- system.file("testfiles/test_b38.vepgui.vcf", package = "vcf2mafR") vcf2maf(vcf = path_vcf_vepped, ref_genome = "b38")path_vcf_vepped <- system.file("testfiles/test_b38.vepgui.vcf", package = "vcf2mafR") vcf2maf(vcf = path_vcf_vepped, ref_genome = "b38")
Take multiple VCF files, each representing a single tumour-normal file and combine into one big maf
vcfs2maf( vcfs, ref_genome, tumor_id = paste0("Tumor", seq_len(length(vcfs)), times = length(vcfs)), vcf_tumor_id = "TUMOR", vcf_normal_id = "NORMAL", parse_tumor_id_from_filename = TRUE, extract = c("before_dot", "before_underscore"), verbose = TRUE )vcfs2maf( vcfs, ref_genome, tumor_id = paste0("Tumor", seq_len(length(vcfs)), times = length(vcfs)), vcf_tumor_id = "TUMOR", vcf_normal_id = "NORMAL", parse_tumor_id_from_filename = TRUE, extract = c("before_dot", "before_underscore"), verbose = TRUE )
vcfs |
path to vcf files (character) |
ref_genome |
name of the reference genome used to call variants (string) |
tumor_id |
desired value of Tumor_Sample_Barcode in maf. By default will use the name of the tumor sample in the VCF (string) |
vcf_tumor_id |
a vector describing what the tumor_names to expect in a VCF are |
vcf_normal_id |
the sample ID describing the normal in the (string) |
parse_tumor_id_from_filename |
should tumour id be parsed from filename (flag) |
extract |
what text do we extract as sample name. See details |
verbose |
(flag) |
a maf-compatible data.frame
vcf_filepaths = dir(system.file(package='vcf2mafR', 'testfiles/cohort_of_vcfs/'), full.names = TRUE) vcfs2maf(vcf_filepaths, ref_genome = "b38")vcf_filepaths = dir(system.file(package='vcf2mafR', 'testfiles/cohort_of_vcfs/'), full.names = TRUE) vcfs2maf(vcf_filepaths, ref_genome = "b38")
This function takes a vector of VEP annotated biotypes and returns their ranks based on priority.
vep_rank_biotypes(biotypes)vep_rank_biotypes(biotypes)
biotypes |
A character vector of VEP annotated biotypes. |
A numeric vector where values represent the priority of each biotype (lower number = higher priority).
vep_rank_biotypes(c("protein_coding", "miRNA", "pseudogene"))vep_rank_biotypes(c("protein_coding", "miRNA", "pseudogene"))
This function takes a vector of VEP annotated biotypes and returns their ranks based on priority. If there are multiple consequences in one string (e.g. splice_region_variant&TFBS_ablation) this function will automatically return the priority for the most severe event
vep_rank_consequences(consequences)vep_rank_consequences(consequences)
consequences |
A character vector of VEP annotated consequences. |
A numeric vector where values represent the priority of each consequence (lower number = higher priority).
vep_rank_consequences(c("protein_coding", "miRNA", "pseudogene"))vep_rank_consequences(c("protein_coding", "miRNA", "pseudogene"))