Predicting Sub-cellular Localization using Machine-Learned
Classifiers in Proteome Analyst
Z.Lu, D.Szafron, R.Greiner, P.Lu, D.Wishart, B.Poulin, J.Anvik, C.Macdonell, and R.Eisner
Department of Computing Science University of Alberta,
Edmonton, AB, Canada T6G 2E8
Contact: bioinfo@cs.ualberta.ca
-
Paper Supplementry Material
- Introductory page
-- Under this page, we define all the abbreviations and terminologies
we used for the following webpages. Also, it includes the links of all
the data in the following experiments and explains how we extracted
the data from Swiss-Prot_41.

- A Complete
Survey of Current Sub-cellular Localization Predictors
-- Table inside lists a number of systems for sub-cellular localization
prediction that have been developed over the past few years using a series
of prediction algorithms. Both accuracy and coverage are reported.
- Confusion Matrices for
6 classifiers: Animals, Plants, Fungi, Gram+ and Gram-, and Archea
-- To evaluate our classifiers, we use standard machine
learning technique called 5-fold cross validation. Tables inside
are the confusion matrices for the results of the 5-fold cross validation
on each of the classifiers.

- Confusion
Matrices for 1-organism classifiers
-- We built a single classifier called 1-organism
classifier from all training data except the sequences
from one specific organism. Then the classifier is applied to the
specific organism in Swiss-Prot. This simulates the situation in which
a classifier is used to predict the sub-cellular locations of all
the sequences in a newly sequenced organism.
- Confusion
Matrices for PA-SUB classifier built using Nair & Rost's
1161 sequence
-- This is the comparison of our classification technique
to the Swiss-Prot lexical technique of Nair and Rost (Nair and Rost,
2002), we constructed two custom sub-cellular localization classifiers
using their single ontology and their training data.
- Confusion
Matrices for PA-SUB classifier bulit using PSORT-B training data
-- This is another comparison of our classification technique
using the reliable PSORT-B Gram-negative data (Gardy et al., 2003).
- Whole
Proteome Coverage
-- If PA-SUB is applied to an entire organism, there will be some sequences
for which there are no homologs, so no features can be extracted and used
by the classifier. In this case, PA-SUB makes no sub-cellular localization
prediction for these sequences.
- Confusion
Matrices for feature extracting methods
-- This shows us the results of many different ways of
selecting PSI-BLAST homologs and extracting features among various Swiss-Prot
fields (KWORD, IPR and SCELL).
- Confusion
Matrices for different classification techniques
-- Besides Naive Bayes, various kinds of other classifiers
have been evaluated as well. We show the results of Artificial Neural
Nets (ANN), Support Vector Machines (SVM), Tree-augmented Naive Bayes
(TAN) and three different nearest neighbour classifiers (1NN, 3NN and
5NN).