Preprint reviews by Hugues Roest Crollius

Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication

Carlos Nossa, Paul Havlak, Jia-Xing Yue, Jie Lv, Kim Vincent, H Jane Brockmann, Nicholas H Putnam

Review posted on 30th November 2013

The manuscript describes an approach to reconstruct chromosome-scale assemblies of genetic information. In short, the strategy relies on the simultaneous high-throughput sequencing of two parents and 34 offspring genomes, followed by the identification of 23-mer sequences in reads. Unique 23-mers are chained into contigs, and these are assembled into scaffolds by read-pair information. High confidence SNPs are identified, and used to cluster scaffolds based on their segregation patterns. The approach is applied here to the genome of Atlantic horseshoe crab Limulus polyphemus. The assembly is made of 91,320 markers (i.e. small scaffolds) grouped in ~2000 genetic bins, themselves forming about 32 linkage groups. The authors use different statistical tools to estimate the average sequence divergence among individuals and the distribution of allele frequencies. They map the scaffolds to ancestral chordate linkage groups, and they generate data that suggests a whole genome duplication in the horseshoe crab lineage.

In general we are enthusiastic about the new GBS method. This kind of map fills a gap between genetic maps that often provide chromosome scale resolution but little functional insight into the genome, and highly fragmented genome sequence assemblies that cover most genes but with no long-range contiguity. The distribution of minor, tied and major allele distributions are well recovered in the expected frequencies from the raw data, which supports the general principle of the approach.

Major Compulsory Revisions

Our main concerns with the manuscript are two folds. First, there is a lack of clarity in stating the aims of the work, and in describing the results. The genomic resource is in effect a hybrid between a high density genetic map (with ~2000 distinct genetic intervals or bins) and a draft sequence of the genome (with ~90.000 scaffolds). Within each bin, about 45 scaffolds on average are clustered with no known order or orientation. I think the authors should state this clearly in the abstract, so that the reader has a clear understanding of where the manuscript is going and of the advantages and limits of the resource. Indeed, many parts of the results are difficult to follow, or too often take for granted a wealth of background knowledge from the reader.

The second point concerns the evidence provided in support of a whole genome duplication (WGD). There are 4 lines of evidence: (i) duplicated Hox clusters (ii) 25 paralogons of >6 duplicated markers separated by <300 (500?) markers covering 30% of the map (iii) duplicated markers from these paralogons that connect to the same ALG. (iv) a Ks peak modelled from a distribution of a few high confidence paralogous markers. However there is a lack of precision in the description of the methods that infuse a general sense of weakness in these evidence. For example, how do we know that several markers per paralogon are not exons of the same L. polyphemus gene, in which case a given paralogon could be created by one or a few duplicated gene? We need to know that each of the 2716 duplicated markers each map to a different gene in the tick genome. Similarly, the 45% of duplicated markers that map to the same linkage group can be explained if the two members of each pair are in fact exons of the same gene. Indeed, we are not told if the two markers used to identify paralogous genes actually overlap on their orthologous tick gene. Finally, the Ks peak is not very convincing. This might be because each “coding” sequence reconstructed by exonerate is likely to be short (a single exon each?), leading to a high variance of Ks measures and a lot of noise. Regarding the Hox clusters, they could partially be explained by the same reasons given above, and in any case, may represent a specific situation in the horseshoe crab. Finally, ancient WGDs are often followed by a striking radiation of new species (50,000 vertebrates after 1R/2R, 25000 fish after 3R), presumably because of genetic incompatibility leading to speciation. Was this the case in an ancestor of the horseshoe crab? This should perhaps be discussed.

We would strongly recommend a re-writing of the manuscript with these recommendations in mind. Some specific areas that need more attention:

Page 4 - Assembly and Mapping – this part is too synthetic to be understood when reading it on its own (reading the Methods is necessary not only to understand the gritty details, but also the entire philosophy of the paper). This part dives straight into specialised jargon that even in the methods is poorly described.

Page 4 line 102: The term “paired-end reads” should appear clearly.

Page 4. The sentence about “we tabulated high-quality 23-mers occurring in two sequencing runs (and thus also in at least two individuals). » cannot be understood as is. Do authors mean sequencing reads? Of what technology? Why do two sequencing reads/runs imply two individuals?

Page 4: To understand the expected frequencies of 25%, 50%, 75% of parental alleles (minor, major, and tied alleles) most readers could do with a small schematic drawing to remind them of these simple genetic principles.

Page 4 line 112: what is an edit distance?

Page 4 line 122: “across the four parental haplotypes” would be a useful precision.

Page 4 line 126: We do not understand how the “map bins” are obtained. We do not see which part of the Methods corresponds to this step of the analysis.

Page 5 line 137: The authors mention that they model the simulated Ciona data as a single stretched exponential distribution. Figure 2 mentions a sum of two stretched exponentials. Which one is correct? Additionally, a short description of the model in the Methods would be necessary (equation, goodness of fit statistics) to evaluate how reliable this model is at estimating the error rate without overfitting, etc.

Page 6 line 160: We need to see some quantification for the prediction that the genome is germline methylated. What is the value of the anticorrelation coefficient?

Page 6 line 165: why is there a consistency expected between genome size and recombination rate? When stating such a result, the reader should be reminded of the hypothetical/theoretical framework in which to interpret a result.

Page 6 line 166: Correlations of recombination rates between parents and with local SNP density: the reported correlations, although significant, are extremely low. This should be discussed. Why is this biologically relevant?

Page 6 line 179: why the tick genome? How phylogenetically distant is it from the horseshoe crab? Why is it the best outgroup?

Discussion:

We think the authors should mention a major limit of this approach, which is that it requires a parental pair and their offspring (tens of individuals to obtain sufficient meiosis), and thus can only be applied to species that can be bred in a controlled environment and have sufficiently large numbers of offspring.

Line 358: what is a library in this context?

Line 368: what is a “run” in this context?

Line 480-486: The multiplication of coverage according to SNPs density while genotyping requires more explanations. The strategy of summing the reads across a given marker to increase the support for individual SNPs does not seem to be comparable to the case where one has a reference genome at hand. Indeed, here, the physical linkage between SNPs AND the support for a given SNP are estimated at the same time. In an extreme case, two neighbouring SNPs may be linked by one read (weak linkage) but individually supported by 3 reads each. Summing reads across the two SNPs would illegitimately raise the support for the genotype.

Line 664: authors should indicate precisely what conditions were used for the clustering of paralogs into “paralogons”. The text in “results” says that max-gap is 300 markers (line 210), in methods it says 500?

Minor Essential Revisions:

Page 4 line 91: why is L. polyphemus of commercial interest?

Page 4 line 120: Typo: it should read “1.3 billion”, not “million”, according to Table 1.

Line 306: How do the authors know that the parental pair was monandrous?

Line 280: that have given rise.

Figure 10: The unit should be indicated on the Y axis.

Level of interest: An article of importance in its field

Quality of written English: Acceptable

Statistical review: Yes, and I have assessed the statistics in my report.

Declaration of competing interests: I declare that I have no competing interests

Camille Berthelot and Hugues Roest Crollius

show less