基于GATK基础流程使用的代码合集与简单解释

0 软件准备

软件名称 下载地址
fastp https://github.com/OpenGene/fastp/releases
bwa https://github.com/lh3/bwa/releases
samtools https://github.com/samtools/samtools/releases
gatk https://github.com/broadinstitute/gatk/releases
annovar https://annovar.openbioinformatics.org/en/latest/user-guide/download/

* 软件版本可能随时间改变而改变,连接可能会失效

1 Fastq文件拼装

1.1 原始fastq质控

fastp -i [R1_In] -I [R2_In] \
      -o [R1_Out] -O [R2_Out] \
      --json [Name].json \
      --html [Name]].html

1.2 fastq组装

Thread_Num=[number]
bwa mem -t $Thread_Num -R  "[Head Infomation]" [Genome] [R1] [R2] | \
samtools view -b -S -@ $Thread_Num | \
samtools sort -@ $Thread_Num -o [Name].sorted.bam

1.3 BQSR

GATK MarkDuplicates -I [Name].sorted.bam -O [Name].marked.bam -M [Name].metrics

GATK FixMateInformation -I [Name].marked.bam -O [Name].fixed.bam -SO coordinate

GATK BaseRecalibrator \
     -R [Genome] -I [Name].fixed.bam \
     --known-sites [db_snp] \
     --known-sites [dn_indel] \
     -O [Name].recal.table
 
GATK ApplyBQSR -R [Genome] -I [Name].fixed.bam -bqsr [Name].recal.table -O [Name].bqsr.bam

如果处理的是肿瘤/正常配对数据,需要在 [Fix Mate Information] 后加入局部重比对

GATK3 -T RealignerTargetCreator \
      -R ${Ref} \
      -known ${Ind1} \
      -known ${Ind2} \
      -I $Directory_Output/${SampleName[$i]}/$Chr/${SampleName[$i]}.$Chr.fixed.bam \
      -o TMP/${SampleName[$i]}.$Chr.realn.intervals
           
GATK3 -T IndelRealigner \
      -R ${Ref} \
      -known ${Ind1} \
      -known ${Ind2} \
      -targetIntervals TMP/${SampleName[$i]}.$Chr.realn.intervals \
      -I $Directory_Output/${SampleName[$i]}/$Chr/${SampleName[$i]}.$Chr.fixed.bam \
      -o $Directory_Output/${SampleName[$i]}/$Chr/${SampleName[$i]}.$Chr.realn.bam

1.4 SNP获取

1.5 数据筛选

1.6 基因注释

table_annovar.pl [Name].vcf [ANNOVA Home Pathway]/humandb/ \
            -buildver [version] \
            -out [Name].annovar \
            -protocol [refGene,cytoBand,exac03,avsnp147,dbnsfp30a] \
            -operation [gx,r,f,f,f] \
            -nastring . -vcfinput --otherinfo