Applying PSMC to Neandertal data
Guidelines to apply the psmc to Neandertal data
In 2014, it was published The complete genome sequence of a Neanderthal from the Altai Mountains in nature. As a part of their study, authors did a psmc analysis with the data. I describe here how I was able to reproduce the analysis at July 3, 2015. The tools I used where:
First we download and compile the tools
git clone https://github.com/samtools/samtools.git
git clone https://github.com/samtools/htslib.git
git clone https://github.com/samtools/bcftools.git
cd bcftools
make
cd ../samtools
make
The read sequences of the AltaiNeandertal where aligned to the the GRCh37 build of the human reference
Assume we have all the aligned chromosomes in a folder called “bam”, named chr1.bam, chr2.bam and so on.
We must first uncompress and index the reference file
gunzip ./human_g1k_v37.fasta.gz
./samtools/samtools faidx human_g1k_v37.fasta
This will produce a file human_g1k_v37.fasta.fai
Now I run the following pipeline:
./samtools/samtools mpileup -C50 -uf ./human_g1k_v37.fasta ./bam/chr1.bam | \
./bcftools/bcftools call -c | ./bcftools/vcfutils.pl vcf2fq -d 10 -D 100 | \
gzip > ./chr1.fq.gz
Note: Samtools is also able to open a BAM (not SAM) file on a remote FTP or HTTP server (see the doc). So in the last command we could replace ./bam/chr1.bam by the link to the bam file in the server (i.e. http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/bam/AltaiNea.hg19_1000g.1.dq.bam)
By using the last pipeline I was able to have one fq file per chromosome. The .fq file produced by vcfutils.pl contains lowercase and uppercase letters (eg ‘nnATgagtttAGnn’) wich is a bit confusing given that a fasta sequence (normally) should contains only uppercase characters (see the fastq Format Specification). After a bit of goggling it seems that lowercase indicates low coverage.
In order to get the input sequence for psmc, I used the fq2psmcfa tool that comes with the psmc software. For each chromosome
for i in {1..22};
do ../psmc/utils/fq2psmcfa -q20 ./chr$i.fq.gz > chr$i.psmcfa;
done
And finally, I put all the chromosomes together in a single file.
for i in {1..22};
do cat ./chr$i.psmcfa >> AltaiNea.psmcfa;
done
Now I can run the psmc analysis. I use the deffault parameters. Assuming you have a compiled version of the psmc inside a folder called psmc and your data (the file AltaiNea.psmcfa) is in the current directory, you can do:
./psmc/psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o ./AltaiNea.psmc ./AltaiNea.psmcfa
This will produce the psmc inferred history (it could take a while). The output will be written in a file named AltaiNea.psmc
To make a plot and see the inferred demographic history, you can type
./psmc/utils/psmc_plot.pl -p ./AltaiNea.pdf ./AltaiNea.psmc
This will produce a pdf file with a plot of the demographic history.