Carlos Nossa, Paul Havlak, Jia-Xing Yue, Jie Lv, Kim Vincent, H Jane Brockmann, Nicholas H Putnam
The manuscript “Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication” is an interesting and important contribution to the field of comparative genomics. The paper details a valuable contribution to the technique of mapping-by-sequencing, including a novel combination of de novo assembly and mapping, which should help to propel other important study organisms into the genomic era. The study also reveals an interesting pattern of gene linkage and evolution consistent with ancient whole-genome duplications, thus expanding our understanding of ecdysozoan diversity and deep evolution. A noteworthy aspect of the study is the apparent rigor of the analysis; the choices made here are sound and well-supported. My feedback thus focuses mostly on presentation issues as outlined below. First though, I make some general comments about the details of genome assembly presented in the Methods.
In the section on SNPmer identification, it is not clear to me why false or missed pairing should have a limited effect on subsequent analysis (line 399). If such errors affect scaffolding, haplotype phasing and linkage analysis, and many loci in the genome have clearly been duplicated, it seems to me that together these issues could wreak havoc when reconstructing the genome. Perhaps the authors could elaborate on why they believe these types of errors are not a critical issue, and provide some evidence (like a simulation, etc) to elucidate.
In the Contigging and Scaffolding section, I was confused by the description of how initial scaffolds were created by independent runs of Bambus. The caveat here is that I am not a specialist in this area, and it may be clear to someone versed in the art; nevertheless, I hope the authors will consider how they might improve the clarity of their explanation so that a non-specialist may more easily follow the logic and mechanics of the process. By comparison, I found the following sections on SNP phasing and genotype inference, likelihood maximization and linkage group construction to be quite lucid (although I note a concern enumerated below about the EM step). So it may be that I simply lack the background needed to understand this section.
I have no major compulsory revisions to request, and just a few minor essential revisions. Most of my comments are discretionary, in that they affect the presentation rather than the content of the manuscript; it is a challenge to summarize such a complex and multifaceted project in a few graphics.
Minor Essential Revisions (in no particular order):
Much of the ‘biological story’ of this work centers around the presence of the putative Ancestral Linkage Groups (ALG) found in metazoan genomes. Given its importance, and the well-known liabilities of reciprocal best-blast hits in the presence of multiple paralogs, it would be reasonable to leverage software created for this type of problem, such as proteinortho (http://www.bioinf.uni-leipzig.de/Software/proteinortho/) or orthoMCL (http://www.orthomcl.org). The e-value used in the Ixodes reciprocal blasting seems slightly low, and these approaches may provide more power for the analysis.
Along these lines, I find it difficult to extract immediate and relevant biological insight from figures 4 and 5. I appreciate that creating effective graphical representations of data like these is not trivial; I would suggest a couple of tweaks. A chief concern is that the figures look like random scatterplots at first glance, which is the opposite of the intended effect (from what I gather). The legends as is are quite short, and it could be beneficial to the naïve reader to state the expected pattern both in the complete absence of sytenty, and with completely syntenic alignments. To clarify the pattern, consider using greyed-out grid lines, and eliminating the smallest third of the Limulus linkage groups, where there is much less information (and the patterns are difficult or impossible to see). Particularly for figure 5, the heavy black lines depicting Markov-derived segmentation of linkage groups should be set apart from the linkage group boundaries themselves by a different color (possibly a stippled/dashed line). Alternatively, or in addition, focus the graphics on selected linkage group/chromosome comparisons that most clearly illustrate the patterns of interest. As-is, the figures need improving in order to convey the intended message. Figure 8, by comparison, is much more compelling.
Figure 3 provides a nice summary of a lot of data but could use some work. In the rendering I have on my pdf, I cannot legibly make out the smallest linkage groups (nor, from what I can tell, are they important enough to merit inclusion, except for completeness); the columns are unlabeled; the meaning of “pi” is not explained in the legend; and placing the number of the linkage group directly over the corresponding recombination frequency block effectively (at least for the small groups) obscures them. The most salient point I draw from the figure is that male and female recombination rates differ spatially and across linkage groups, which is not surprising. The more salient and novel observations are the p-values for enrichment of ALG, the ALG groups themselves (for which I did not see a key, although this is of debatable importance) and the location of hox genes, which are more of an afterthought in this figure. For this reason, I urge you to consider (1) whether you are drawing the reader’s attention to the most important aspects of the data and (2) whether you are effectively conveying them. Might a more comparative depiction of hox gene complements showing order and presence/absence (such as e.g. Genome Biol. Evol. 4(9):937-953) be a revealing way to show the data? It may also be useful to depict the ALG and the hox separately.
In my pdf, the legend for figure 1 simply says “Fitting Poisson distributions to Limulus 23mer frequencies (filtered as described below)” which I gather is incomplete. There is no text at all under the main paper heading, “Figures”, rather figures 1 and 2 are embedded in the main text and the rest follow the references. This may well be a production issue out of the author’s hands; I thought it worth mentioning.
In the Methods section “Error model calibration”, did the authors attempt simulations at coverage much lower than the reported 20X and 5X levels for parental and offspring genotypes? It is clear why these levels were chosen (to approximate the amplified coverage of actual data due to multiple redundant/informative individuals in the mapping panel), but less clear is the effect on incorrect genotype calls when coverage is lower (as it often is, stochastically). It would be useful to others to report such data here, if available.
As with any manuscript, there are typos and similar errors to clean up. For me, it is always jarring to see these in an otherwise stellar report. An abbreviated list (because I did not keep close track):
line 197: Figure 3, not Figure 1
- line 262: misspelled ‘quantitative’
- lines 270-273: consider moving this paragraph below the following paragraph for flow
- line 280: misspelled ‘gave’
- line 402: ‘happened’
- line 475: ‘sections and above’
- lines 598-600: incomplete sentence, or remove ‘which’
- line 652: misspelled ‘homeobox’
- references 10 and 35 need correction
In the results section “Genomic distribution of paralogous genes”, it would be useful to define (for the general reader) the difference between segment creation by duplication versus fission: how do the mechanisms and outcomes differ? On lines 232-233, you state, “We examined the relationships among paralogons for evidence of successive rounds of duplication”. You then describe a reasonable approach based on analysis of clustering coefficients to a random graph, and find them to be much higher in observed versus random data (0.19 compared to 0.034). No interpretation is then given about how these numbers correspond to answer the question above. Can you provide more analysis or interpretation about what you can conclude based on this calculation? Or was it simply a way to quantify that paralogons share more clustering that expected by chance?
The nomenclature used to distinguish Limulus orthologs in the phylogenetic trees (figures 6 and 7) is distracting; the codes used have no meaning to the reader and only raise needless questions. I would suggest something more straightforward and refined.
Level of interest: An article of outstanding merit and interest in its field
Quality of written English: Acceptable
Statistical review: Yes, and I have assessed the statistics in my report.
Declaration of competing interests: I declare that I have no competing interests.