A Synthetic Biology Project (8/14/2015)


Fri, Aug 14, 2015 at 2:25 PM

Customer: I'm making DNA construct libraries from a certain number of parts (~36), each construct has 4-6 positions in which the number of variants varies, e.g. for 5 positions I maximise the number of possible combinations by using 7-8 variants per position, resulting in ~ 20.000 possible constructs.

number of positions: 5

number of variants part 1: 7

number of variants part 2: 7

... ...

number of possible combinations: 19208

I can make either all possible combinations and then sequence to verify that I really made them all, or I make subsets from that combinatorial space and then sequence those subsets. The subset size would be 1000-5000 constructs.

What I need to do is basically compare the DNA sequence that I put in for each construct with the outcome after synthesis and sequencing. I will create paired end reads to hopefully cover the entire constructs (it should only be between 200-400 bp in length, so paired end reads on a MiSeq should easily cover that), but for some projects there might be gaps.

What I will give you: FASTQ files from the MiSeq reads and fast files for my libraries and subset sequences.

What I would need you to do: stitches the paired ends together and compare each (paired) read to all possible constructs to see what I got and how often I got each construct.

Fri, Aug 14, 2015 at 6:52 PM

AccuraScience LB: I have a few questions: (1) The term "position" in the description seems to mean something differently from what we refer to when using this term in the sequencing field - which is nucleotide position. Could you elaborate what this term means in this project? (2) NGS data have considerable error rates, e.g., MiSeq's reported error rate is at 0.2% level, but some reads may have error rates that could exceed 1%. It might get even worse for bases close to the 3' end of the reads. What this implies is that a 200 bp read could contain >1 miscalled base. This is an issue people have to baffle with if highly accurate sequencing results are desirable. One (what I consider as) fascinating way of dealing with this is to use both ends of the pair-ended read to sequence the same sequence, and trust the data only if both ends produce consensus base calls. The down-side of this solution is that it reduces the effective length of the measurable sequence to be ~200. (3) How are those different "parts" (totaling 36 in number) connected with one another? If part of the analysis is to determine the order in which those parts get connected, then this work could involve some assembly algorithms, which would be considerably more sophisticated than just checking the read sequences and check if they were mis-synthesized. (4) I wonder what you would be looking at as the result of this work? Would it be counting of sequences with all those parts/positions, an assessment of mis-synthesized sequences in any way, or something else?

Back to Other Selected Recent Inquiries

Note: LB stands for Lead Bioinformatician. An AccuraScience LB is a senior bioinformatics expert and leader of an AccuraScience data analysis team.

Disclaimer: This text was selected and edited based on genuine communications that took place between a customer and AccuraScience data analysis team at specified dates and times. The editing was made to protect the customer's privacy and for brevity. The edited text may or may not have been reviewed and approved by the customer. AccuraScience is solely responsible for the accuracy of the information reflected in this text.