File talk:OSNP PGR.png
Part of the goal is to validate the original studies' haplotype blocks on 23andMe (v3) chip, and determine replacements for the SNP's not available on that platform. When only SNP's from the original study are included, it's fairly straightforward to replicate the original haplotype blocks. With 1000 Genomes data it's possible, but requires some unbecoming statistical abuse, namely fine-tuning confidence intervals to produce matching block division. However, with full OpenSNP 23andMe (v3) dataset such a thing doesn't appear possible, using either 4 gamete rule or fine-tuned confidence intervals.
I suspect that the assumption that 23andMe has already ran quality-controls with their more complete and large dataset on the raw data is partly to blame. SNP's such as rs25531 indicate that the raw datafiles may contain SNP's without rudimentary quality controls ran on them, and many previously available SNP's have indeed been dropped by 23andMe on quality control basis. This is a problem partly because the "raw" datafiles aren't really raw data, but have a lot of quality information stripped. Filtering just on basis of call rate is also problematic because the dataset includes SNP's that haven't been reported on every kit.
One first idea that comes to mind would be to filter the SNP's to those included in 23andMe's current v3 raw data. Unfortunately this will also do away with a lot of information from for example v2+v3 kits which would provide useful information for phasing, so more detailed effort to identify SNP's dropped on quality control basis is probably warranted. Nevertheless there's the problem that after vcf-merge no-calls and missing genotypes are considered equal, so filtering to only SNP's tested on every kit would be most expedient now, but undesirable for eventual combination with v4 and FTDNA chips.
Another issue is related individuals, I'm trying to review the BEAGLE documentation to determine if this has much effect on BEAGLE's model, as it is simultaneously determing IBD anyway. HaploView's LD-plots are almost certainly affected by unmarked relationships. The easiest way to resolve this is probably to repeat analysis on founder population only, but it's going to be lot easier to address after quality issues are dealt with.
Right now the results support that the haplotype identification can be done with the tag SNP's like in the original studies, although the currently missing SNP's are going to cause some false positives. The inclusion of the whole "long haplotype" across whole four haplotype blocks does significantly reduce the false positives, though. The results also support that risk allele of rs11224561 is in a separate haplotype block from the two previously fingered causal variant SNP's, which was left as an open question in [PMID 21148628].