Original Source: http://www.tpj.com/ author: Lincoln Steinx article Translator: tcwu
Reprinted courtesy of the perl journal, http://www.tpj.com/lincoln stein's Website is http://stein.cshl.org/
Location: British Cambridge, Europe's largest DNA sequencing research center meeting room: Higher's high-class meeting of computer scientists in this center with the US's largest DNA Sequencing Research Center: Although these two centers use almost the same experimental technology, Almost the same database, there are almost the same data analysis tool, but still unable to exchange or compare meaningful results. Solution: Perl Human Genetic Plan started to develop in terms of approximately 1990, expecting to identify human and partial experimental animals with complete DNA sequences in the power of international cooperation. The nature of this work has both scientific and medical. By understanding the structure of biological origins, and how disturbed details, we want to understand how organism develops from an egg to complex multi-cell organisms, how to metabolize food, and is converted into a body, and how the nervous system is smooth work. From a medical perspective, from the entire discretion of the complete DNA sequence, it is possible to effectively accelerate the cause of human disease (and possible diseases). After six years of hard work, the genetical program has exceeded the original planning schedule. The detailed gene map of human and experimental animals has been completed (with a series of markers to schedule DNAs to determine the first step in the full sequence). The smallest organism, yeast, is almost completed, then next, small worms are not far away. A large-scale human DAN sequencing work in a few months ago has been carried out in the center and will be fully engaged in the whole year. From the perspective of information handling, DNA is a very long string and consists of four letters, which are A, T, G, and C (these four letters are abbreviations of four chemicals, respectively, are constituent double shares. Spiral basis. The length of the string is impressive but is not particularly surprised: 3 x 10 ^ 9 letters long, or if you use a bit to store each letter, you need 3GB of storage space. 3GB Not small but today's standard can be managed. Unfortunately, this is just the space you need to store the result. To get these results, the storage space required is much larger. The fundamental problem is that the current DNA sequence The technology can only read up to 500 consecutive letters. In order to determine longer sequences, the DNA must be sequenced for a small piece of a piece of a small piece of sections, called "Reads", these jiguk area calculations Facts to check and find a match. Because the DNA sequence is not random (similar but not exactly the basic pattern there is many times in the gene), and because DNA sequencing technology has noise and tendency to errors, it must be targeted The area is 5 to 10 times, which can reliably reorganize the "Reads" fragment into a real sequence, which increases the number of materials to be managed. On this is the information related to laboratory work: Who has been experiment When do you finish, which software is used, which software is used, and which version of the recombinant sequence is used, and the annotations are added to the experiment. In addition to this, the average person usually wants to generate the machine used. The original data is stored, and each 500 letters will generate the data file of 20-30Kilobytes length! This is not all. It is not all. It is not enough to decide whether the DNA sequence is not enough. In these sequences, the functional area is spread in a long time / energy Regions. There are genes, control areas, structural areas in human DNA, and even some viruses are involved in the human body and become the ruins of fossils. Because the genetic follow-up control area is responsible for health and disease, the researchers will be marked with their annotations When they are reorganized. This type of annotation has produced more information that needs to be tracked and stored. Some people evaluate the information of 1 to 10TB of information that needs to be stored in order to see the conclusion of the human gene program. So Perl can What do you do? From the beginning of the researcher, the information will play a very important role in the genetics. A information core that integrates the center of each gene is established. These core tasks have two: provided Computer Support and Data Kufu gives them related laboratories, there are also development data analysis and management software to the entire genetic research community.
As a result, the results of the information science team started are good or bad. Some groups attempt to establish large systems on a complex associated database, and they are blocked by a highly dynamic obstruction of biological research again. When the external operation of the system is normal to cooperate with complex laboratory agreements, it has been made out of the wrong, and the new technology has replaced the old agreement, so the software engineer has to return to the design. However, most groups learned to construct an moderate, loosely combined system, which can be taken out or put in and do not need to replace the entire system. For example, in my group, we have found that many data analysis work involve a series of semi-independent steps, imaginary someone wants to operate a bit (Figure 1). First of all, a basic quality test is required: The sequence is not enough? Is the vague-vague nodule not below? Continued "Vector Check." For technical reasons, human DNA must be treated with bacteria before being seized (this is the "copy" program). It is not often happening, but human DNA is sometimes lost during processing, and the entire sequence contains the vectors of bacteria. Vector check Make sure that only human genes are recorded in the database. Next is an inspection of the anti-sequence. Human DNA is filled with repetitive elements, making the puzzle challenging. Duplicate sequence test Attempts to find new sequences from a database of a known repeated sequence, the second test is attempted to find a new sequence depends on the DNA sequence database of a large community. At this time, a comparison success usually provides a clue to find a new sequence of work / energy. After completing these tests, the sequence and its information have been collected and loaded into the regional database of the laboratory. It seems similar to a water pipe with the process of these independent analysis steps, so we don't have to use "PIPE" on the UNIX system to operate. We developed a simple-based data exchange format called "Boulderio", which allows loose linkage to join information to input / input string stream based on catheter (PIPE). Boulderio foundation is a pair of pairs of labels / values. The Perl module makes it easy to get the input, take out the labels they are interested, do something on it, then throw the output string stream. All labels that are not interested in the program are transmitted to standard output, so other programs can use them. Under this form of design, analyze a new DNA sequence looks like this (this is not the source code for us, but it is already close) Name_sequence.pl Name = l26p93.2 sequence = gattcagagtcccagattcccccccagggggttccagagagcccc ...... input from Name_sequence.pl then passed into the quality checkbox, it will look for the Sequence tag, run the quality inspection algorithm, then write its conclusion to the data string Inside the stream. The data string now looks like: Name = L26P93.2 SEQUENCE = gattcagagtcccAgattccccccagggggttcccagagagccc ... Quality_Check = OK Now the string stream enters the vector check. It will remove the Sequence label from the string stream and perform the vector check algorithm. The data string now looks: NAME = L26P93.2SEQUENCE = GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC ...... QUALITY_CHECK = OKVECTOR_CHECK = OKVECTOR_START = 10VECTOR_LENGTH = 300 these things continue through the conduit (pipeline), until the last "load_lab_database.pl" Checklist of all the information collected will be some final conclusions Whether it is suitable for future fit markers and send all the results into the laboratory database. A good feature in the Boulderio format is that many sequence records can be continuously processed in the UNIX catheter. With a "=" indicates the start of a sum represents the end of the record, and write information: NAME = L26P93.2SEQUENCE = GATTTCAGAGTCCCAGATTTCCCCCAGGGGGTTTCCAGAGAGCCC ...... = NAME = L26P93.3SEQUENCE = CCCCTAGAGAGAGAGAGCCGAGTTCAAAGTCAAAACCCATTCTCTCTCCTC ... = can also re-record in the Establish a sub-record, which makes we can use structured data. The following is an example source code, which processes the Boulderio format. It has object-oriented style, and records from the input string stream, modify, and then throw the word stream. Use boulder :: stream; $ stream = new boulder :: stream; while ($ record = $ stream-> read_record ('name')) {$ name = $ record-> get ('name'); $ sequence = $ record-> get ('sequence "; ... $ record-> add (Quality_Check =>" OK); $ stream-> write_record ($ record);} If you are interested in Boulderio format, you can find more information and the Perl library that it needs to be found in http://stein.cshl.org/software/boulder/ is interesting that several information groups are not as development. A idea similar to Boulderio. For example, several group worm sequence plans to use a data exchange format called ".ace". Although this format starts to be designed to make ACE (a database of biological data) database data pour and repeated load format, it uses a label / value format and Boulderio very like. Soon, the .ace file is handled by the Perl program catheter and loads the database in the last step. Use of the other parts of Perverl Labs also find the purpose. For example, many centers, including me, using web-based interfaces to display the project status, and let the researchers take action. Perl documentation is a perfect language for writing a web page CGI engine. Similarly, Perl performs a mail database query server, regularly work, ready to summarize the daily laboratory activity summary report, establish a control robot's instruction file to operate the information management of almost every busy genetic laboratories. In terms of laboratory management, the information core team is reasonable. Things don't be so optimistic on the development and dissemination of general use. These issues will be similar to those who participate in a large, loose organizational software project. Despite the biggest force, the project still begins to drift. Programmers conduct those ideas that make them interested, need to communicate with each other, and the same problem is solved several times in different, unable to solve several times. When you need to put all parts together, no one can work properly. These situations have also occurred above the genetics. Although the fact is that everyone works in the same project, there is no two groups use the same solution. The program used to solve the known problem has been written and written several times. When you don't guarantee that a program is working more than a similar program developed than other places, you can only expect to boast its special user interface and data format. A typical example is a restructuring of thousands of short DNA READs a collection of collections of sorted overlapping segments, at least six different versions are used all over place, and there is no two identical information inputs Or output format. This lack of exchanging exhibits a serious dilemma of the gene center. There is no exchangeableness, and a newspaper group is locked only with their own software. If other Genetic Center makes better tool software to solve the same problem, the former must re-happen to their systems to use this new tool. The long-term solution to this problem is to propose a standardized data exchange standard for each gene software. This also allows the common module to be easily loaded or removed. However, such standards take time to make everyone recognize, while when different groups participate in discussions and negotiation, there is still an urgent need to use existing tools to complete the present task. At this time, I used Perl again and reached an emergency rescue. Cambridge Advanced Meeting This article quoted this article and requested the exchange of commercial data exchange. Although the relationship between the two groups is intimate collaborator, and the shallow perspective, it seems to solve the same problem with the same tool, but it can find that they don't have one thing after observation. . The main software components in the DNA order include: a tracking editor, used to analyze, DNA READS color layers from the DNA sequencing machine from the small fragment. A READ group transpiler, finds a repeated part between the Reads and recombines them into a continuous interval. A group translator editor, view the result of the group translator and change the content when discovered group translation errors. A database that is able to track everything above. After several years, these two groups have developed suits that can work smoothly in their hands. Some components are developed by similar genetically developing modes. As shown in Figure 2, Perl is used as glue to combine these software fragments. Between the modules of each pair of interactions, one or more Perl documents are responsible for adjusting the output of a component to another module expectations. However, when the exchange information is required, these two groups have encountered difficulties. Between two groups, there are two tracking programmers, three group translators, two sectors editors, and (thank God) a database. If two Perl documents are required (one direction), 62 different documents will be required to meet all possible mutual exchange. Each time one of the modules is output or input format, 14 documents will be more inspected and corrected. Conclusion in this meeting is in Figure 3. These two groups decided to use a common map of data exchange format called CAF. CAF will contain both sides of the analysis and editing tools. For each module, two modules convert CAF into the data format of the module, and convert it into CAF. This simplifies writing procedures and maintenance. Now just write 16 contributions; when one of the modules change, only two modules need to be checked. This incident is not unique. Perl has become a Solution of the Center. When they need to exchange, or when they refurbish a central module to work with other centers. Therefore, Perl has become the main software calculated between the genetics, as if the glue will connect them together. Although the Gene Information Team is always thinking that there is no other high-order language, such as Python, TCL has recent Java, but it is not possible to spread like Perl. How does Perl have achieved such extraordinary achievements? I want to have the following factors: Perl is very good at cutting, twisting, twisting, flat, summary, and other operational text files. Although the biological science began to use a number of analytics, most of the information is still a text file: breeding name, annotation, evaluation, directory review. Even DNA sequences are also class text. Mutual exchange incompatible data format is a topic to deal with creative guess on text files. Perl powerful regular expression comparison and string operation make this work simple and there is no other language. Perl can be faulty. Biological data is usually incomplete, the field can be ignored, or a certain field is expected to appear several times (for example, an experiment may be repeated), or the data is manually input, so there is an error. Perl does not mind that a value is empty or strange. The formal representation can be written to take out and correct the error. Of course, this flexibility may also be bad. I will follow it later. Perl is elemental orientation. Perl encourages people to write their software into a small module, and do not ask the PERL library module or an orthodox way. External programs can be easily integrated into the Perl program, leaning against a pipe (PIPE), system call, or socket. The dynamic loader introduced by Perl5 allows people to use C's functions or allow the entire programmed library to be used in the Perl direct translator. Recent results is that smart crystals around the world will be included in a set of modules, called "Bioperl" (please refer to Perl journal) Perl easy to write and can be developed quickly. The direct translator allows you to declare all your function patterns and data type. When an undefined group is called, it will only cause an error, and the decentralizer can cooperate well with Emacs and let you use it. Comfortable conversation model. Perl is a good prototype language. Because it is fast and dirty, the original model with Perl constructs a new calculation is written directly into a fast-friendly language. Sometimes the result is that Perl is already fast enough, so the program does not need to be transplanted; more case is that someone can write a small core program with C, programming into dynamic loaded modules or external executable, Then other parts are done with Perl. Examples of this section can be found in http://waldo.wi.mit.edu/ftp/distribut/software/rhmapper/). Perl is very good in writing web CGI, and the importance increases after the experiment is published on the network. I used Perl's experience in the Central Environment. It is worthy of praise from beginning to end. However, I found that Perl also has its problems. Its loose program style causes many errors, which are caught in other stringent languages. For example, Perl allows you to use a variable before being specified, this is a very useful feature when you need it, but it is a disaster when you are a simple misconscious identification name. Similarly, it is easy to forget to declare the region variables in a single function, causing accidentally changed to the whole country variable. Finally, the deficiencies of Perl are the establishment of a graphical user interface. Although UNIX faithful believans can be done in command mode, most end users do not agree. Windows, menu, bounce pattern have become necessary. Until recently, Perl's user interface (GUI) is still immature. However, Nick Ing-Simmons efforts have enable PerLtk (PTK) to make the user interface of Perl driven on the X-WINDOW. My partner and I have written a few PTK based applications in the Mit Gene Center for internet users, and are a satisfactory experience from head to tail. Other genetics are larger using PTK, in some places have become the main productivity.