High throughput biology provides the tools to understand cellular function in a global and meaningful context. Employing a systems approach, scientists are currently focusing on synthesizing data from an array of experiments analyzing genes, mRNA, proteins, metabolites, sugars, lipids, etc. A major component of this effort involves genomic data acquired through the use of such techniques as amplification, DNA microarray expression, genotyping, real-time PCR, RNAi and sequencing.
Determining the quality of these data, sorting and transmitting them and then integrating disparate data types are all significant challenges scientists must overcome. To assess these barriers, The Science Advisory Board conducted a study of the hardware and software requirements of more than 600 genomics researchers. Their insights into these enabling tools are shared in this summary and can be used to help other scientists who plan on assembling or expanding their own computing and data infrastructure.
Scope of the Problem
Many scientists express frustration with their inability to adroitly manage the sheer volume of information their experiments produce. While narrowing the scope of their inquiry and anticipating the types of data their inquiries will generate may help contain a data deluge, access to appropriate computing resources is essential for extracting useful information. Although they perform exceedingly complex and tedious functions, these information-based technologies must employ user-friendly interfaces and provide rapid outputs.
Key Technologies
Most genomics researchers employ multiple techniques in their research with amplification serving as the foundation for their data acquisition activities. The three instruments most commonly used in genomics research are DNA microarrays, fragment analysis/genotyping systems and real-time PCR. However, regardless of the instrument used to generate genomics data, more than two-thirds of researchers analyze their data using software embedded on it. “The convenience aspect is likely the overarching reason for such high usage rates of supplier-supplied software,” explains Tamara Zemlo, Ph.D., MPH, Executive Director, The Science Advisory Board.
The scientists shared with The Science Advisory Board that they are most satisfied with the fragment analysis and genotyping systems that are part of the instrumentation packages.? ? However, a majority of them cautioned that the sequence analysis and standardized database structures and nomenclature are two features that did not fully live up to their expectations for their integrated DNA microarray software and real-time PCR software packages.
Software Options
To meet these unrealized expectations, these scientists therefore rely on a hodgepodge of in-house software packages, freeware distributed throughout the scientific community, commercial packages from small independent vendors, software embedded in analytical instruments and proprietary software designed to run on enterprise solutions. In fact, 59% of scientists obtain software used to analyze/integrate their data through a commercial third party. “The heterogeneity of these software collections—particularly at large research organizations—reflects the rapid evolution of life science computing and the difficulty of integrating not only distinct data but also multiple application tools,” claims Zemlo. Not surprisingly, scientists are concerned with software incompatibility, slow processing times and interoperability problems with other platforms. Software incompatibility may be related to a failure of developers to keep pace with operating system upgrades, but it is also a major characteristic of biological software in general.
Given this strong desire for uniformity and ease of operations, researchers using freeware/shareware or in-house software to analyze would gladly switch to commercial software, if IT suppliers could demonstrate that their products have enhanced capabilities and features. Scientists’ software decisions are not driven by concerns about installation, documentation or future upgrades and enhancements. These findings indicate that end-users are unable to find exactly what they need in a “one-size fits all” commercial software package.
Processing Power
No matter what software package a researcher uses, if they cannot do so in a timely and efficient manner, it is essentially powerless to help them analyze their data. Hence many researchers are increasingly relying upon storage area networks (SANs), which are specialized, high-speed networks that connect users to their data.? ? By connecting data storage devices (e.g., disks and tapes) and servers, a SAN enables users to access vast amounts of data stored on shared devices in a centralized location.
? ?
Because the stored data does not reside directly on any of a network’s servers, more processing power and network capacity are available to users.? ? Industrial scientists think that SANs would be more useful that their academic counterparts. Overall, the researchers felt that SANs would be more useful than high performance computing (HPC) and collaborative software. Over 50% all industrial scientists currently use a SAN and 43% of the academic scientists do as well. One difference was noted in that almost as many academic researchers did not know if they had a SAN as those who currently use a SAN.
Sharing Information
Researchers stated that compatibility with different operating systems and hardware platforms is the most important distribution/transmission capability for their lab to have. Industrial scientists are more likely than academic scientists to feel it is important to be able to exchange data with others at their institution, possibly reflecting the autonomy of many academic labs. However, both academic and industrial scientists are concerned with data backup and recovery issues.
Essential to the success of any “-omics-based” research program is a powerful technology platform that facilitates the process of discovery. Many scientists struggle to create processes and infrastructure that maximize the value of their investments in informatics. Successful research programs will need to optimize the way data acquisition, analysis, integration and transmission is managed. This multistage effort will require a dynamic partnership between research scientists and their IT departments, instrumentation vendors and hardware and software suppliers.