Prediction Method
1. Property based prediction
We selected the influence of sequence and structural features associated with protein expression/solubility in Escherichia Coli, wheat germ cell-free, and Brevibacillus system. First, we defined 437 features. 396 features were derived from sequence information such as occurrence frequencies of single nucleotide, codon, and single amino acid. 40 features were derived from structure information such as transmembrane region, disordered region, and occurrence frequencies of single amino acid on the protein surface. Next, we assessed which features are associated with protein expression/solubility by analyzing dataset_ME and dataset _MW (Details in Dataset section). The statistical difference between positive and negative data was determined using Student's t-test, then the features with p<0.05 were selected. In consequence, 61 features for expression in Escherichia Coli, 43 features for solubility for Escherichia Coli, 18 features for solubility in wheat germ, and 31 features for solubility in Brevibacillus were selected. Finally, we built the statistical model using selected features. We applied libSVM 1 to build models.
It is noted that the statistical model for predicting expression included nucleotide sequence information, but the model for predicting solubility does not contain them. So, if amino acid sequence is used for predicting protein expression, the property based prediction is not performed.
1 http://www.csie.ntu.edu.tw/~cjlin/livsvm/

2. Motif based prediction
We identified the motif associated with protein expression/solubility. First, we defined motifs, which consider all combination of 10 classes based on physicho-chemical property of amino acid. The length of motif was optimazed to give best prediction perfomrmance. Next, we selected motifs that found significantly in positive data of dataset_SE and dataset_SW. Similarly, we identified motifs that seen significantly in negative data. For example, we identified 774 positive motifs and 175 negative motifs for expression in Escherichia Coli. Finally, we counted the number of motifs associated proteins expression/solubility in query sequence to estimated protein expression/solubility. We defined a simple score as below.
  Smotif = Npositive_motif - αNnegative_motif
 Npositive_motif: The number of positive motifs seen in query sequence
Nnegative_motif: The number of negative motifs seen in query sequence
α: The ratio positive motif to negative motif in dataset_SE/dataset_SW
If Smotif > 0, a query sequence is expressed/soluble.


Dataset
We used a genome-scale experiment data that assessed overexpression and solubility of human full length cDNA in Escherichia Coli and wheat germ cell-free expression system. In each expression systems, the experimental data were divided into two datasets based on the number of experiments per a sequence. One group, designated as 'dataset_S' ("S" means single), is composed of sequences for which the expression/solubility were experimentally assessed onetime. The other group, designated as 'dataset_M' ("M" means multiple), is composed of sequences for which experiments were conducted two or more times. The initial letter of Escherichia Coli expression system and wheat germ cell-free expression system was added at the end of dataset name. In each dataset, pairewise identities were less than 25%.
The dataset size
expression system dataset expression solubility
positive negative positive negative
Escherichia Colidataset_ME106 (60.6%)69 (39.4%)69 (38.8%)109 (61.2%)
dataset_SE3,996 (52.4%)3,597 (47.6%)1,705 (34.6%)3,217 (65.4%)
Wheat germ dataset_MW189 (99.5%)1 (0.5%)81 (65.9%)42 (34.1%)
dataset_SW4,974 (97.3%)138 (2.7%)1,860 (66.2%)949 (33.8%)
Brevibacillus dataset_MB - - - -
dataset_SB - - 138 (47.8%) 151 (52.2%)
Numbers in parentheses signify the rations of positive data and negative data for respective datasets.


Experimental Condition
1. Escherichia Coli expression system
Human open reading frames (ORFs as Getway entry clones1 were subcloned into pDEST17 (T7 promoter, amino-terminal His fusion) with LR Clonase (Invitrogen). The reaction products were transformed into E.Coli BL21 star (DE3) pLysS; the SOC expression mixture was plated on ampicillin LB agar plate. A single colony was inoculated into LB medium and grown overnight at 37℃. The overnight culture was diluted 1:100 into SB medium, grown at 37℃ for 3h, and cooled to 20℃. Then, protein expression was induced by adding isopropyl 1-thio-beta-D-galactopyranoside to a final concentration of 0.1 mM. After 16h at 20 ℃, cells were harvested and suspended in BugBuster (Novagen Inc.). A portion of the lysate was prepared as a whole cell sample. The lysate were centrifuged at 15,000 × g for 5 min. The supernatants were prepared as soluble fraction samples.

2. Wheat germ cell-free system
The expressed proteins were fused with a carboxy-terminal His tag (destination vectors: pEW-3H). Wheat germ extract was purchased from Toyobo and Cell Free Sciences. The expressed proteins were separated into a soluble and an insoluble fraction by centrifugation at 19,000 × g for 20 min.

3. Brevibacillus
The target gene was inserted into the expression vector incoproted promoter region, SD-region, and sectetory signal. The vector was transfered into Brevibacillus by electroporation. We cultured from 24 h on selection plate, then the clones were selected by bleomycin, neomycin. After growing for 48 h at 33℃, The lysate were centrifuged at 15,000 × g for 10 min. The supernatants were prepared as soluble fraction samples.


Prediction Accuracy
1. Assessment
Four criteria are defined to assess the classification ability of protein expression/solubility of two prediction methods.
Recall TP / ( TP + FN )
Precision TP / ( TP + FP )
Accuracy ( TP + TN ) / ( TP + FN + TN + FP )
. F-score 2* Recall * Precision / (Recall + Precision)
TP is defined as the number of correctly predicted as positives. Similarly, FN, TN, FP are defined as respectively; the number of positives incorrectly predicted as negatives, the number of correctly predicted as negatives, and the number of negatives incorrectly predicted as positives.

2. Performance
The performance of two prediction methods compared with those of three public prediction methods. Wilkinson and Harrison model is the statistical model that predicts protein solubility assuming the protein is being overexpressed in Escherichia Coli 2. PROSO is a sequence-based protein solubility evaluator using machine-learning approach (http://mips.helmholtz-muenchen.de/proso/proso.seam)3. SOLpro predicts the propensity of a protein to be soluble in Escherichia Coli using a two-stage SVM architecture based on multiple representations of the primary sequence (http://solpro.proteomics.ics.uci.edu/)4.

The comparison of prediction performance.
expressionEscherichia ColiWheat germBrevibacillus
dataset_MEdataset_SEdataset_MWdataset_SWdataset_SB
property based predictionRecall0.860.87---
Precision0.850.63---
Accuracy0.830.66---
F-score0.860.73---
motif based predictionRecall0.770.86---
Precision0.810.67---
Accuracy0.750.70---
F-score0.790.75---

solubilityEscherichia ColiWheat germBrevibacillus
dataset_MEdataset_SEdataset_MWdataset_SWdataset_SB
property based predictionRecall0.850.720.760.790.70
Precision0.560.440.790.670.67
Accuracy0.680.580.710.610.69
F-score0.670.540.770.730.69
motif based predictionRecall0.520.650.860.92-
Precision0.530.590.780.77-
Accuracy0.630.720.750.77-
F-score0.530.620.840.84-
Wilkinson and Harrison modelRecall0.300.300.310.310.15
Precision0.470.430.890.830.47
Accuracy0.600.620.520.500.52
F-score0.370.350.460.450.23
PROSORecall0.420.370.280.340.33
Precision0.470.380.740.740.55
Accuracy0.590.580.460.480.55
F-score0.440.380.410.470.42
SOLproRecall0.640.580.510.50.22
Precision0.420.370.710.680.34
Accuracy0.520.500.540.540.42
F-score0.510.450.590.620.26

In property based prediction, the kinds of evaluetion methods was executed. One was a five-cross validation using dataset_M series. The other used dataset_M series to evaluate the model trained bu the dataset_S series. In motif based prediction, positive and negative motifs were identified in dataset_S series. The prediction performances for each dataset were estimated by using them.
2Wilkinson DL. and Harrison RG. Biotechnology (NY) 1991 May;9(5):443-8.
3Smiaowski P. et al. Bioinformatics 2007 Oct 1;23(19):2536-42.
4Magnan CN. et al. Bioinformatics 2009 Sep 1;25(17):2200-7.


Publication
1.Hirose, S. and Noguchi, T. ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics 2013 13:1444-1456.
2. Hirose S. et al. Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia Coli expression system and a wheat germ cell-free expression system. J Biochem 2011 Jul;150(1):73-81.
3. Hirose S. et al. Development and evaluation of data-driven designed tags (DDTs) for controlling protein solubility. N. Biotechnology 2011 Apr 30;28(3):225-31.