For the first time, researchers have automatically grouped fluorescently tagged proteins from high-resolution images of cells. This technical feat opens a new way to identify disease proteins and drug targets by helping to show which proteins cluster together inside a cell.
The approach, developed by Carnegie Mellon University, outperforms existing visual methods to localize proteins inside cells, says Professor Robert F. Murphy, whose report, “Data Mining in Genomics and Proteomics,” appears in an upcoming special issue of the Journal of Biomedicine and Biotechnology.
“Our approach really enables the new field of location proteomics, which describes and relates the location of proteins within cells,” said Murphy, a professor of biological sciences, machine learning, and biomedical engineering. “This work should provide a more thorough understanding of cellular processes that underlie disease.”
Using this approach to spot a protein cluster could help scientists identify a common protein structure that enables those proteins to gather in one part of the cell, according to Murphy. Getting this information is critical to foil a disease like cancer, where you might want to identify and disable part of a tumor cell’s machinery needed to process proteins for cancer growth.
“Our tool represents a step forward because it is based on standardized features and not on features chosen by the human eye, which is unreliable. By automating the clustering of proteins inside cell images, we also can study thousands of images fast, objectively and without error,” Murphy said.
Murphy’s tool has two key components. One is a set of subcellular location features (SLFs) that describe a protein’s location in a cell image. SLFs measure both simple and complex aspects of proteins, such as shape, texture, edge qualities and contrast against background features. Like fingerprints, a protein’s SLFs act as a unique set of identifiers. Using a set of established SLFs, Murphy then developed a computational strategy for automatically clustering, or grouping, proteins based on SLF similarities and differences.
For his study, Murphy used images of randomly chosen, fluorescently labeled proteins. These proteins were produced inside living cells using a technology called CD tagging, which was developed by Jonathan Jarvik and Peter Berget, both associate professors of biological sciences at Carnegie Mellon. The computational analyses were carried out together with Xiang Chen, a graduate student in the Merck Computational Biology and Chemistry program.
Chen and Murphy found that the new tool outperformed existing methods of identifying overlapping proteins within cells, such as simple visual categorization of their locations.
“Our tool outperformed clustering based on the terms developed by the Gene Ontology Consortium, the best previous way of describing protein location. We found that the Gene Ontology terms were too limited to describe the many complex location patterns we found. Of course, the other drawback of term-based approaches is that they have to be assigned manually by database curators, and this is often not consistent between different curators,” said Murphy.
Murphy and his colleagues are currently amassing more protein image data using CD Tagging so that they can refine their approach further. They are also working on ways to “train” a general system that will work for different cell types.
This research was supported by the National Institutes of Health, the National Science Foundation, the Commonwealth of Pennsylvania Tobacco Settlement Fund and the Merck Computational Biology and Chemistry Program, established at Carnegie Mellon by the Merck Company Foundation.