Performance measurement *

Root-mean-square deviation
        The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. Basically, the RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation, and are called prediction errors when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSD is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scale-dependent.[1]
        The RMSD of an estimator \hat{\theta} with respect to an estimated parameter   is defined as the square root of the mean square error:
For an unbiased estimator, the RMSD is the square root of the variance, known as the standard error.
        The RMSD of predicted values for times t of a regression's dependent variable is computed for n different predictions as the square root of the mean of the squares of the deviations:
        In some disciplines, the RMSD is used to compare differences between two things that may vary, neither of which is accepted as the "standard". For example, when measuring the average difference between two time series    and , the formula becomes
Normalized root-mean-square deviation
        Normalizing the RMSD facilitates the comparison between datasets or models with different scales. Though there is no consistent means of normalization in the literature, the range of the measured data defined as the maximum value minus the minimum value is a common choice:[2]
        This value is commonly referred to as the normalized root-mean-square deviation or error (NRMSD or NRMSE), and often expressed as a percentage, where lower values indicate less residual variance.
Another common choice is to normalize by the mean value of the measurements:[3]
When the RMSD is normalized by the mean measured value, is usually called coefficient of variation of the RMSE, CV(RMSE). It is analogous to the coefficient of variation with the RMSE taking the place of the standard deviation.
  1. Hyndman, Rob J. Koehler, Anne B.; Koehler (2006). "Another look at measures of forecast accuracy". International Journal of Forecasting 22 (4): 679–688. doi:10.1016/j.ijforecast.2006.03.001.
  2. "Coastal Inlets Research Program (CIRP) Wiki - Statistics". Retrieved 4 February 2015.
  3. "FAQ: What is the coefficient of variation?". Retrieved 4 February 2015.
  4. J. Scott Armstrong and Fred Collopy (1992). "Error Measures For Generalizing About Forecasting Methods: Empirical Comparisons" (PDF). International Journal of Forecasting 8 (1): 69–80. doi:10.1016/0169-2070(92)90008-w.
  5. Anderson, M.P.; Woessner, W.W. (1992). Applied Groundwater Modeling: Simulation of Flow and Advective Transport (2nd ed.). Academic Press.
  6. Ensemble Neural Network Model
Cohen's kappa
        Cohen's kappa coefficient is a statistic which measures inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, since κ takes into account the agreement occurring by chance. [1] Some researchers[2][citation needed] have expressed concern over κ's tendency to take the observed categories' frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement. Others[3][citation needed] contest the assertion that kappa "takes into account" chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario.
        Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first mention of a kappa-like statistic is attributed to Galton (1892),[4] see Smeeton (1985).[5]
The equation for κ is:
where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as defined by Pr(e)), κ = 0.
        The seminal paper introducing kappa as a new technique was published by Jacob Cohen in the journal Educational and Psychological Measurement in 1960.[6]
        A similar statistic, called pi, was proposed by Scott (1955). Cohen's kappa and Scott's pi differ in terms of how Pr(e) is calculated.
        Note that Cohen's kappa measures agreement between two raters only. For a similar measure of agreement (Fleiss' kappa) used when there are more than two raters, see Fleiss (1971). The Fleiss kappa, however, is a multi-rater generalization of Scott's pi statistic, not Cohen's kappa. Kappa is also used to compare performance in Machine Learning but the directional version known as Informedness or Youden's J statistic is argued to be more appropriate for supervised learning.[7]
        Suppose that you were analyzing data related to a group of 50 people applying for a grant. Each grant proposal was read by two readers and each reader either said "Yes" or "No" to the proposal. Suppose the dis/agreement count data were as follows, where A and B are readers, data on the diagonal slanting left shows the count of agreements and the data on the diagonal slanting right, disagreements:
Note that there were 20 proposals that were granted by both reader A and reader B, and 15 proposals that were rejected by both readers. Thus, the observed proportionate agreement is Pr(a) = (20 + 15) / 50 = 0.70
To calculate Pr(e) (the probability of random agreement) we note that:
Therefore the probability that both of them would say "Yes" randomly is 0.50 · 0.60 = 0.30 and the probability that both of them would say "No" is 0.50 · 0.40 = 0.20. Thus the overall probability of random agreement is Pr(e) = 0.3 + 0.2 = 0.5.
So now applying our formula for Cohen's Kappa we get:
Same percentages but different numbers
        A case sometimes considered to be a problem with Cohen's Kappa occurs when comparing the Kappa calculated for two pairs of raters with the two raters in each pair having the same percentage agreement but one pair give a similar number of ratings while the other pair give a very different number of ratings.[8] For instance, in the following two cases there is equal agreement between A and B (60 out of 100 in both cases) so we would expect the relative values of Cohen's Kappa to reflect this. However, calculating Cohen's Kappa for each:
we find that it shows greater similarity between A and B in the second case, compared to the first. This is because while the percentage agreement is the same, the percentage agreement that would occur 'by chance' is significantly higher in the first case (0.54 compared to 0.46).
  1. Carletta, Jean. (1996) Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), pp. 249–254.
  2. Strijbos, J.; Martens, R.; Prins, F.; Jochems, W. (2006). "Content analysis: What are they talking about?". Computers & Education 46: 29–48. doi:10.1016/j.compedu.2005.04.002.
  3. Uebersax, JS. (1987). "Diversity of decision-making models and the measurement of interrater agreement" . Psychological Bulletin 101: 140–146. doi:10.1037/0033-2909.101.1.140.
  4. Galton, F. (1892). Finger Prints Macmillan, London.
  5. Smeeton, N.C. (1985). "Early History of the Kappa Statistic". Biometrics 41: 795.
  6. Cohen, Jacob (1960). "A coefficient of agreement for nominal scales". Educational and Psychological Measurement 20 (1): 37–46. doi:10.1177/001316446002000104
  7. Powers, David M. W. (2012). "The Problem with Kappa" . Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop.
  8. Kilem Gwet (May 2002). "Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity" . Statistical Methods for Inter-Rater Reliability Assessment 2: 1–10.
Terminology and derivations from a confusion matrix
true positive (TP)
eqv. with hit
true negative (TN)
eqv. with correct rejection
false positive (FP)
eqv. with false alarm, Type I error
false negative (FN)
eqv. with miss, Type II error
sensitivity or true positive rate (TPR)
eqv. with hit rate, recall
specificity (SPC) or true negative rate
precision or positive predictive value (PPV)
negative predictive value (NPV)
fall-out or false positive rate (FPR)
false discovery rate (FDR)
accuracy (ACC)
F1 score
is the harmonic mean of precision and sensitivity
Matthews correlation coefficient (MCC)
Sources: Fawcett (2006) and Powers (2011).[1][2]
        Imagine a study evaluating a new test that screens people for a disease. Each person taking the test either has or does not have the disease. The test outcome can be positive (predicting that the person has the disease) or negative (predicting that the person does not have the disease). The test results for each subject may or may not match the subject's actual status. In that setting:
In general, Positive = identified and negative = rejected. Therefore:
Let us define an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2 contingency table or confusion matrix, as follows:
Sensitivity and specificity
        Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function:
        For any test, there is usually a trade-off between the measures. For instance, in an airport security setting in which one is testing for potential threats to safety, scanners may be set to trigger on low-risk items like belt buckles and keys (low specificity), in order to reduce the risk of missing objects that do pose a threat to the aircraft and those aboard (high sensitivity). This trade-off can be represented graphically as a receiver operating characteristic curve.
        A perfect predictor would be described as 100% sensitive (e.g., all sick are identified as sick) and 100% specific (e.g., all healthy are not identified as sick); however, theoretically any predictor will possess a minimum error bound known as the Bayes error rate.
        Sensitivity relates to the test's ability to identify a condition correctly. Consider the example of a medical test used to identify a disease. Sensitivity of the test is the proportion of people known to have the disease, who test positive for it. Mathematically, this can be expressed as:
        A negative result in a test with high sensitivity is useful for ruling out disease. A high sensitivity test is reliable when its result is negative, since it rarely misdiagnoses those who have the disease. A test with 100% sensitivity will recognize all patients with the disease by testing positive. A negative test result would definitively rule out presence of the disease in a patient.
        A positive result in a test with high sensitivity is not useful for ruling in disease. Suppose a 'bogus' test kit is designed to show only one reading, positive. When used on diseased patients, all patients test positive, giving the test 100% sensitivity. However, sensitivity by definition does not take into account false positives. The bogus test also returns positive on all healthy patients, giving it a false positive rate of 100%, rendering it useless for diagnosing or "ruling in" the disease.
        Sensitivity is not the same as the precision or positive predictive value (ratio of true positives to combined true and false positives), which is as much a statement about the proportion of actual positives in the population being tested as it is about the test.
        The calculation of sensitivity does not take into account indeterminate test results. If a test cannot be repeated, indeterminate samples either should be excluded from the analysis (the number of exclusions should be stated when quoting sensitivity) or can be treated as false negatives (which gives the worst-case value for sensitivity and may therefore underestimate it).
A test with high sensitivity has a low type II error rate. In non-medical contexts, sensitivity is sometimes called recall.
        Specificity relates to the test's ability to exclude a condition correctly. Consider the example of a medical test for diagnosing a disease. Specificity of a test is the proportion of healthy patients known not to have the disease, who will test negative for it. Mathematically, this can also be written as:
        Positive result in a test with high specificity is useful for ruling in disease. The test rarely gives positive results in healthy patients. A test with 100% specificity will read negative, and accurately exclude disease from all healthy patients. A positive result will highlight a high probability of the presence of disease.[3]
        Negative result in a test with high specificity is not useful for ruling out disease. Assume a 'bogus' test is designed to read only negative. This is administered to healthy patients, and reads negative on all of them. This will give the test a specificity of 100%. Specificity by definition does not take into account false negatives. The same test will also read negative on diseased patients, therefore it has a false negative rate of 100%, and will be useless for ruling out disease.
A test with a high specificity has a low type I error rate.
Estimation of errors in quoted sensitivity or specificity
        Sensitivity and specificity values alone may be highly misleading. The 'worst-case' sensitivity or specificity must be calculated in order to avoid reliance on experiments with few results. For example, a particular test may easily show 100% sensitivity if tested against the gold standard four times, but a single additional test against the gold standard that gave a poor result would imply a sensitivity of only 80%. A common way to do this is to state the binomial proportion confidence interval, often calculated using a Wilson score interval.
        Confidence intervals for sensitivity and specificity can be calculated, giving the range of values within which the correct value lies at a given confidence level (e.g., 95%).[12]
Terminology in information retrieval
        In information retrieval, the positive predictive value is called precision, and sensitivity is called recall. Unlike the Specificity vs Sensitivity tradeoff, these measures are both independent of the number of true negatives, which is generally unknown and much larger than the actual numbers of relevant and retrieved documents. This assumption of very large numbers of true negatives versus positives is rare in other applications.[2]
        The F-score can be used as a single measure of performance of the test for the positive class. The F-score is the harmonic mean of precision and recall:
        In the traditional language of statistical hypothesis testing, the sensitivity of a test is called the statistical power of the test, although the word power in that context has a more general usage that is not applicable in the present context. A sensitive test will have fewer Type II errors.
  1. Fawcett, Tom (2006). "An Introduction to ROC Analysis". Pattern Recognition Letters 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010.
  2. Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation" (PDF). Journal of Machine Learning Technologies 2 (1): 37–63.
  3. "SpPins and SnNouts". Centre for Evidence Based Medicine (CEBM). Retrieved 26 December 2013.
  4. Mangrulkar, Rajesh. "Diagnostic Reasoning I and II". Retrieved 24 January 2012.
  5. "Evidence-Based Diagnosis". Michigan State University.
  6. "Sensitivity and Specificity". Emory University Medical School Evidence Based Medicine course.
  7. Baron, JA (Apr–Jun 1994). "Too bad it isn't true.....". Medical decision making : an international journal of the Society for Medical Decision Making 14 (2): 107. doi:10.1177/0272989X9401400202. PMID 8028462.
  8. Boyko, EJ (Apr–Jun 1994). "Ruling out or ruling in disease with the most sensitive or specific diagnostic test: short cut or wrong turn?". Medical decision making : an international journal of the Society for Medical Decision Making 14 (2): 175–179. doi:10.1177/0272989X9401400210. PMID 8028470.
  9. Pewsner, D; Battaglia, M; Minder, C; Marx, A; Bucher, HC; Egger, M (Jul 24, 2004). "Ruling a diagnosis in or out with "SpPIn" and "SnNOut": a note of caution". BMJ (Clinical research ed.) 329 (7459): 209–13. doi:10.1136/bmj.329.7459.209. PMC 487735. PMID 15271832.
  10. Gale, SD; Perkel, DJ (Jan 20, 2010). "A basal ganglia pathway drives selective auditory responses in songbird dopaminergic neurons via disinhibition". The Journal of neuroscience : the official journal of the Society for Neuroscience 30 (3): 1027–1037. doi:10.1523/JNEUROSCI.3585-09.2010. PMC 2824341. PMID 20089911.
  11. Macmillan, Neil A.; Creelman, C. Douglas (15 September 2004). Detection Theory: A User's Guide. Psychology Press. p. 7. ISBN 978-1-4106-1114-7.
  12. "Diagnostic test online calculator calculates sensitivity, specificity, likelihood ratios and predictive values from a 2x2 table - calculator of confidence intervals for predictive parameters".
Accuracy and precision
        Accuracy and precision are defined in terms of systematic and random errors. The more common definition associates accuracy with systematic errors and precision with random errors. Another definition, advanced by ISO, associates trueness with systematic errors and precision with random errors, and defines accuracy as the combination of both trueness and precision.
Common definition
        In the fields of science, engineering, industry, and statistics, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's true value.[1] The precision of a measurement system, related to reproducibility and repeatability, is the degree to which repeated measurements under unchanged conditions show the same results.[1][2] Although the two words precision and accuracy can be synonymous in colloquial use, they are deliberately contrasted in the context of the scientific method.
        A measurement system can be accurate but not precise, precise but not accurate, neither, or both. For example, if an experiment contains a systematic error, then increasing the sample size generally increases precision but does not improve accuracy. The result would be a consistent yet inaccurate string of results from the flawed experiment. Eliminating the systematic error improves accuracy but does not change precision.
        A measurement system is considered valid if it is both accurate and precise. Related terms include bias (non-random or directed effects caused by a factor or factors unrelated to the independent variable) and error (random variability).
        The terminology is also applied to indirect measurements—that is, values obtained by a computational procedure from observed data.
        In addition to accuracy and precision, measurements may also have a measurement resolution, which is the smallest change in the underlying physical quantity that produces a response in the measurement.
        In numerical analysis, accuracy is also the nearness of a calculation to the true value; while precision is the resolution of the representation, typically defined by the number of decimal or binary digits.
        Statistical literature prefers to use the terms bias and variability instead of accuracy and precision. Bias is the amount of inaccuracy and variability is the amount of imprecision.
Accuracy is the proximity of measurement results to the true value; precision, the repeatability, or reproducibility of the measurement
        In industrial instrumentation, accuracy is the measurement tolerance, or transmission of the instrument and defines the limits of the errors made when the instrument is used in normal operating conditions.[3]
        Ideally a measurement device is both accurate and precise, with measurements all close to and tightly clustered around the true value. The accuracy and precision of a measurement process is usually established by repeatedly measuring some traceable reference standard. Such standards are defined in the International System of Units (abbreviated SI from French: Système international d'unités) and maintained by national standards organizations such as the National Institute of Standards and Technology in the United States.
        This also applies when measurements are repeated and averaged. In that case, the term standard error is properly applied: the precision of the average is equal to the known standard deviation of the process divided by the square root of the number of measurements averaged. Further, the central limit theorem shows that the probability distribution of the averaged measurements will be closer to a normal distribution than that of individual measurements.
With regard to accuracy we can distinguish:
        A common convention in science and engineering is to express accuracy and/or precision implicitly by means of significant figures. Here, when not explicitly stated, the margin of error is understood to be one-half the value of the last significant place. For instance, a recording of 843.6 m, or 843.0 m, or 800.0 m would imply a margin of 0.05 m (the last significant place is the tenths place), while a recording of 8,436 m would imply a margin of error of 0.5 m (the last significant digits are the units).
        A reading of 8,000 m, with trailing zeroes and no decimal point, is ambiguous; the trailing zeroes may or may not be intended as significant figures. To avoid this ambiguity, the number could be represented in scientific notation: 8.0 × 103 m indicates that the first zero is significant (hence a margin of 50 m) while 8.000 × 103 m indicates that all three zeroes are significant, giving a margin of 0.5 m. Similarly, it is possible to use a multiple of the basic measurement unit: 8.0 km is equivalent to 8.0 × 103 m. In fact, it indicates a margin of 0.05 km (50 m). However, reliance on this convention can lead to false precision errors when accepting data from sources that do not obey it.
Precision is sometimes stratified into:
In binary classification
        Accuracy is also used as a statistical measure of how well a binary classification test correctly identifies or excludes a condition.
        That is, the accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined.[6] To make the context clear by the semantics, it is often referred to as the "rand accuracy".[citation needed] It is a parameter of the test.
On the other hand, precision or positive predictive value is defined as the proportion of the true positives against all the positive results (both true positives and false positives)
An accuracy of 100% means that the measured values are exactly the same as the given values.
Also see Sensitivity and specificity.
Accuracy may be determined from sensitivity and specificity, provided prevalence is known, using the equation:
        The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall.[citation needed] In situations where the minority class is more important, F-measure may be more appropriate, especially in situations with very skewed class imbalance.
        Another useful performance measure is the balanced accuracy[citation needed] which avoids inflated performance estimates on imbalanced datasets. It is defined as the arithmetic mean of sensitivity and specificity, or the average accuracy obtained on either class:
        If the classifier performs equally well on either class, this term reduces to the conventional accuracy (i.e., the number of correct predictions divided by the total number of predictions). In contrast, if the conventional accuracy is above chance only because the classifier takes advantage of an imbalanced test set, then the balanced accuracy, as appropriate, will drop to chance.[7] A closely related chance corrected measure is:
        A direct approach to debiasing and renormalizing Accuracy is Cohen's kappa, whilst Informedness has been shown to be a Kappa-family debiased renormalization of Recall.[9] Informedness and Kappa have the advantage that chance level is defined to be 0, and they have the form of a probability. Informedness has the stronger property that it is the probability that an informed decision is made (rather than a guess), when positive. When negative this is still true for the absolutely value of Informedness, but the information has been used to force an incorrect response.[8]
  1. JCGM 200:2008 International vocabulary of metrology — Basic and general concepts and associated terms (VIM)
  2. Taylor, John Robert (1999). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books. pp. 128–129. ISBN 0-935702-75-X.
  3. Creus, Antonio. Instrumentación Industrial[citation needed]
  4. BS ISO 5725-1: "Accuracy (trueness and precision) of measurement methods and results - Part 1: General principles and definitions.", p.1 (1994)
  5. BS 5497-1: "Precision of test methods. Guide for the determination of repeatability and reproducibility for a standard test method." (1979)
  6. Metz, CE (October 1978). "Basic principles of ROC analysis" . Semin Nucl Med. 8 (4): 283–98.
  7. Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. (2010). "The balanced accuracy and its posterior distribution". Proceedings of the 20th International Conference on Pattern Recognition: 3121–24.
  8. Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation" . Journal of Machine Learning Technologies 2 (1): 37–63.
  9. Powers, David M. W. (2012). The Problem with Kappa. Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop.
F1 score
        In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:
The general formula for positive real β is:
The formula in terms of Type I and type II errors:
        Two other commonly used F measures are the measure, which weights recall higher than precision, and the measure, which puts more emphasis on precision than recall.
        The F-measure was derived so that    "measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision".[1] It is based on Van Rijsbergen's effectiveness measure
Their relationship is   where .
While the F-measure is the harmonic mean of Recall and Precision, the G-measure is the geometric mean.[2]
  1. Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth.
  2. Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation" . Journal of Machine Learning Technologies 2 (1): 37–63.
Receiver operating characteristic
        In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate against the false positive rate at various threshold settings. (The true-positive rate is also known as sensitivity in biomedical informatics, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as 1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from to ) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability in x-axis.
        ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making.
        The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields and was soon introduced to psychology to account for perceptual detection of stimuli. ROC analysis since then has been used in medicine, radiology, biometrics, and other areas for many decades and is increasingly used in machine learning and data mining research.
        The ROC is also known as a relative operating characteristic curve, because it is a comparison of two operating characteristics (TPR and FPR) as the criterion changes.[1]
Basic concept
        A classification model (classifier or diagnosis) is a mapping of instances between certain classes/groups. The classifier or diagnosis result can be a real value (continuous output), in which case the classifier boundary between classes must be determined by a threshold value (for instance, to determine whether a person has hypertension based on a blood pressure measure). Or it can be a discrete class label, indicating one of the classes.
        Let us consider a two-class prediction problem (binary classification), in which the outcomes are labeled either as positive (p) or negative (n). There are four possible outcomes from a binary classifier. If the outcome from a prediction is p and the actual value is also p, then it is called a true positive (TP); however if the actual value is n then it is said to be a false positive (FP). Conversely, a true negative (TN) has occurred when both the prediction outcome and the actual value are n, and false negative (FN) is when the prediction outcome is n while the actual value is p.
        To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but actually does not have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.
        Let us define an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2 contingency table or confusion matrix, as follows:
ROC space
        The contingency table can derive several evaluation "metrics" (see infobox). To draw a ROC curve, only the true positive rate (TPR) and false positive rate (FPR) are needed (as functions of some classifier parameter). The TPR defines how many correct positive results occur among all positive samples available during the test. FPR, on the other hand, defines how many incorrect positive results occur among all negative samples available during the test.
        A ROC space is defined by FPR and TPR as x and y axes respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent to sensitivity and FPR is equal to 1 − specificity, the ROC graph is sometimes called the sensitivity vs (1 − specificity) plot. Each prediction result or instance of a confusion matrix represents one point in the ROC space.
        The best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). The (0,1) point is also called a perfect classification. A completely random guess would give a point along a diagonal line (the so-called line of no-discrimination) from the left bottom to the top right corners (regardless of the positive and negative base rates). An intuitive example of random guessing is a decision by flipping coins (heads or tails). As the size of the sample increases, a random classifier's ROC point migrates towards (0.5,0.5).
        The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random), points below the line poor results (worse than random). Note that the output of a consistently poor predictor could simply be inverted to obtain a good predictor.
Let us look into four prediction results from 100 positive and 100 negative instances:
        Plots of the four results above in the ROC space are given in the figure. The result of method A clearly shows the best predictive power among A, B, and C. The result of B lies on the random guess line (the diagonal line), and it can be seen in the table that the accuracy of B is 50%. However, when C is mirrored across the center point (0.5,0.5), the resulting method C′ is even better than A. This mirrored method simply reverses the predictions of whatever method or test produced the C contingency table. Although the original C method has negative predictive power, simply reversing its decisions leads to a new predictive method C′ which has positive predictive power. When the C method predicts p or n, the C′ method would predict n or p, respectively. In this manner, the C′ test would perform the best. The closer a result from a contingency table is to the upper left corner, the better it predicts, but the distance from the random guess line in either direction is the best indicator of how much predictive power a method has. If the result is below the line (i.e. the method is worse than a random guess), all of the method's predictions must be reversed in order to utilize its power, thereby moving the result above the random guess line.
Curves in ROC space
        Classifications are often based on a continuous random variable. Write the probability for belonging in the class as a function of a decision/threshold parameter  as   and the probability of not belonging to the class as    . The false positive rate FPR is given by   , dT  and the true positive rate is , dT . The ROC curve plots parametrically TPR(T) versus FPR(T) with T as the varying parameter.
Further interpretations
Sometimes, the ROC is used to generate a summary statistic. Common versions are:
        However, any attempt to summarize the ROC curve into a single number loses information about the pattern of tradeoffs of the particular discriminator algorithm.
Area under the curve
        When using normalized units, the area under the curve (often referred to as simply the AUC, or AUROC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').[5] This can be seen as follows: the area under the curve is given by (the integral boundaries are reversed as large T has a lower value on the x-axis)
  .The angular brackets denote average from the distribution of negative samples.
        It can further be shown that the AUC is closely related to the Mann–Whitney U,[6][7] which tests whether positives are ranked higher than negatives. It is also equivalent to the Wilcoxon test of ranks.[7] The AUC is related to the Gini coefficient   by the formula  , where:
        In this way, it is possible to calculate the AUC by using an average of a number of trapezoidal approximations.
        It is also common to calculate the Area Under the ROC Convex Hull (ROC AUCH = ROCH AUC) as any point on the line segment between two prediction results can be achieved by randomly using one or other system with probabilities proportional to the relative length of the opposite component of the segment.[9] Interestingly, it is also possible to invert concavities – just as in the figure the worse solution can be reflected to become a better solution; concavities can be reflected in any line segment, but this more extreme form of fusion is much more likely to overfit the data.[10]
        The machine learning community most often uses the ROC AUC statistic for model comparison.[11] However, this practice has recently been questioned based upon new machine learning research that shows that the AUC is quite noisy as a classification measure[12] and has some other significant problems in model comparison.[13][14] A reliable and valid AUC estimate can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example. However, the critical research[12][13] suggests frequent failures in obtaining reliable and valid AUC estimates. Thus, the practical value of the AUC measure has been called into question,[14] raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution. Nonetheless, the coherence of AUC as a measure of aggregated classification performance has been vindicated, in terms of a uniform rate distribution,[15] and AUC has been linked to a number of other performance metrics such as the Brier score.[16]
        One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system, as well as ignoring the possibility of concavity repair, so that related alternative measures such as Informedness[17] or DeltaP are recommended.[18] These measures are essentially equivalent to the Gini for a single prediction point with DeltaP' = Informedness = 2AUC-1, whilst DeltaP = Markedness represents the dual (viz. predicting the prediction from the real class) and their geometric mean is the Matthews correlation coefficient.[17]
Other measures
        In engineering, the area between the ROC curve and the no-discrimination line is sometimes preferred (equivalent to subtracting 0.5 from the AUC), and referred to as the discrimination.[citation needed] In psychophysics, the Sensitivity Index d' (d-prime), ΔP' or DeltaP' is the most commonly used measure[19] and is equivalent to twice the discrimination, being equal also to Informedness, deskewed WRAcc and Gini Coefficient in the single point case (single parameterization or single system).[17] These measures all have the advantage that 0 represents chance performance whilst 1 represents perfect performance, and -1 represents the "perverse" case of full informedness used to always give the wrong response.[20]
        These varying choices of scale are fairly arbitrary since chance performance always has a fixed value: for AUC it is 0.5, but these alternative scales bring chance performance to 0 and allow them to be interpreted as Kappa statistics. Informedness has been shown to have desirable characteristics for Machine Learning versus other common definitions of Kappa such as Cohen Kappa and Fleiss Kappa.[17][21]
        Sometimes it can be more useful to look at a specific region of the ROC Curve rather than at the whole curve. It is possible to compute partial AUC.[22] For example, one could focus on the region of the curve with low false positive rate, which is often of prime interest for population screening tests.[23] Another common approach for classification problems in which P ≪ N (common in bioinformatics applications) is to use a logarithmic scale for the x-axis.[24]
Detection error tradeoff graph
        An alternative to the ROC curve is the detection error tradeoff (DET) graph, which plots the false negative rate (missed detections) vs. the false positive rate (false alarms) on non-linearly transformed x- and y-axes. The transformation function is the quantile function of the normal distribution, i.e., the inverse of the cumulative normal distribution. It is, in fact, the same transformation as zROC, below, except that the complement of the hit rate, the miss rate or false negative rate, is used. This alternative spends more graph area on the region of interest. Most of the ROC area is of little interest; one primarily cares about the region tight against the y-axis and the top left corner – which, because of using miss rate instead of its complement, the hit rate, is the lower left corner in a DET plot. The DET plot is used extensively in the automatic speaker recognition community, where the name DET was first used. The analysis of the ROC performance in graphs with this warping of the axes was used by psychologists in perception studies halfway the 20th century, where this was dubbed "double probability paper".[citation needed]
        If a standard score is applied to the ROC curve, the curve will be transformed into a straight line.[25] This z-score is based on a normal distribution with a mean of zero and a standard deviation of one. In memory strength theory, one must assume that the zROC is not only linear, but has a slope of 1.0. The normal distributions of targets (studied objects that the subjects need to recall) and lures (non studied objects that the subjects attempt to recall) is the factor causing the zROC to be linear.
        The linearity of the zROC curve depends on the standard deviations of the target and lure strength distributions. If the standard deviations are equal, the slope will be 1.0. If the standard deviation of the target strength distribution is larger than the standard deviation of the lure strength distribution, then the slope will be smaller than 1.0. In most studies, it has been found that the zROC curve slopes constantly fall below 1, usually between 0.5 and 0.9.[26] Many experiments yielded a zROC slope of 0.8. A slope of 0.8 implies that the variability of the target strength distribution is 25% larger than the variability of the lure strength distribution.[27]
        Another variable used is d' (d prime) (discussed above in "Other measures"), which can easily be expressed in terms of z-values. Although d' is a commonly used parameter, it must be recognized that it is only relevant when strictly adhering to the very strong assumptions of strength theory made above.[28]
        The z-score of an ROC curve is always linear, as assumed, except in special situations. The Yonelinas familiarity-recollection model is a two-dimensional account of recognition memory. Instead of the subject simply answering yes or no to a specific input, the subject gives the input a feeling of familiarity, which operates like the original ROC curve. What changes, though, is a parameter for Recollection (R). Recollection is assumed to be all-or-none, and it trumps familiarity. If there were no recollection component, zROC would have a predicted slope of 1. However, when adding the recollection component, the zROC curve will be concave up, with a decreased slope. This difference in shape and slope result from an added element of variability due to some items being recollected. Patients with anterograde amnesia are unable to recollect, so their Yonelinas zROC curve would have a slope close to 1.0.[29]
ROC curves beyond binary classification
        The extension of ROC curves for classification problems with more than two classes has always been cumbersome, as the degrees of freedom increase quadratically with the number of classes, and the ROC space has c(c-1) dimensions, where c is the number of classes.[35] Some approaches have been made for the particular case with three classes (three-way ROC).[36] The calculation of the volume under the ROC surface (VUS) has been analyzed and studied as a performance metric for multi-class problems.[37] However, because of the complexity of approximating the true VUS, some other approaches [38] based on an extension of AUC are more popular as an evaluation metric.
        Given the success of ROC curves for the assessment of classification models, the extension of ROC curves for other supervised tasks has also been investigated. Notable proposals for regression problems are the so-called regression error characteristic (REC) Curves [39] and the Regression ROC (RROC) curves.[40] In the latter, RROC curves become extremely similar to ROC curves for classification, with the notions of asymmetry, dominance and convex hull. Also, the area under RROC curves is proportional to the error variance of the regression model.
        ROC curve is related to the lift and uplift curves,[41][42] which are used in uplift modelling. The ROC curve itself has also been used as the optimization metric in uplift modeling.[43][44]
  1. Swets, John A.; Signal detection theory and ROC analysis in psychology and diagnostics : collected papers, Lawrence Erlbaum Associates, Mahwah, NJ, 1996
  2. Fawcett, Tom (2006). "An Introduction to ROC Analysis". Pattern Recognition Letters 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010.
  3. Fogarty, James; Baker, Ryan S.; Hudson, Scott E. (2005). "Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction". ACM International Conference Proceeding Series, Proceedings of Graphics Interface 2005. Waterloo, ON: Canadian Human-Computer Communications Society.
  4. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). The elements of statistical learning: data mining, inference, and prediction (2nd ed.).
  5. Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874.
  6. anley, James A.; McNeil, Barbara J. (1982). "The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve". Radiology 143 (1): 29–36. doi:10.1148/radiology.143.1.7063747. PMID 7063747.
  7. Mason, Simon J.; Graham, Nicholas E. (2002). "Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation" . Quarterly Journal of the Royal Meteorological Society (128): 2145–2166.
  8. Hand, David J.; and Till, Robert J. (2001); A simple generalization of the area under the ROC curve for multiple class classification problems, Machine Learning, 45, 171–186.
  9. Provost, F.; Fawcett, T. (2001). "Robust classification for imprecise environments.". Machine Learning, 44: 203–231.
  10. Flach, P.A.; Wu, S. (2005). "Repairing concavities in ROC curves." . 19th International Joint Conference on Artificial Intelligence (IJCAI'05). pp. 702–707.
  11. Hanley, James A.; McNeil, Barbara J. (1983-09-01). "A method of comparing the areas under receiver operating characteristic curves derived from the same cases". Radiology 148 (3): 839–43. doi:10.1148/radiology.148.3.6878708. PMID 6878708. Retrieved 2008-12-03.
  12. Hanczar, Blaise; Hua, Jianping; Sima, Chao; Weinstein, John; Bittner, Michael; and Dougherty, Edward R. (2010); Small-sample precision of ROC-related estimates, Bioinformatics 26 (6): 822–830
  13. Lobo, Jorge M.; Jiménez-Valverde, Alberto; and Real, Raimundo (2008), AUC: a misleading measure of the performance of predictive distribution models, Global Ecology and Biogeography, 17: 145–151
  14. Hand, David J. (2009); Measuring classifier performance: A coherent alternative to the area under the ROC curve, Machine Learning, 77: 103–123
  15. Flach, P.A.; Hernandez-Orallo, J.; Ferri, C. (2011). "A coherent interpretation of AUC as a measure of aggregated classification performance." . Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 657–664.
  16. Hernandez-Orallo, J.; Flach, P.A.; Ferri, C. (2012). "A unified view of performance metrics: translating threshold choice into expected classification loss" . Journal of Machine Learning Research 13: 2813–2869.
  17. Powers, David M W (2007/2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation" . Journal of Machine Learning Technologies 2 (1): 37–63. Check date values in: |date= (help)
  18. Powers, David M.W. (2012). "The Problem of Area Under the Curve". International Conference on Information Science and Technology.
  19. Perruchet, P.; Peereman, R. (2004). "The exploitation of distributional information in syllable processing". J. Neurolinguistics 17: 97–119. doi:10.1016/s0911-6044(03)00059-9.
  20. Powers, David M. W. (2003). "Recall and Precision versus the Bookmaker". Proceedings of the International Conference on Cognitive Science (ICSC- 2003), Sydney Australia, 2003, pp.529-534.
  21. Powers, David M. W. (2012). "The Problem with Kappa" . Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop.
  22. McClish, Donna Katzman (1989-08-01). "Analyzing a Portion of the ROC Curve". Medical Decision Making 9 (3): 190–195. doi:10.1177/0272989X8900900307. PMID 2668680. Retrieved 2008-09-29.
  23. Dodd, Lori E.; Pepe, Margaret S. (2003). "Partial AUC Estimation and Regression". Biometrics 59 (3): 614–623. doi:10.1111/1541-0420.00071. PMID 14601762. Retrieved 2007-12-18.
  24. Karplus, Kevin (2011); Better than Chance: the importance of null models, University of California, Santa Cruz, in Proceedings of the First International Workshop on Pattern Recognition in Proteomics, Structural Biology and Bioinformatics (PR PS BB 2011)
  25. MacMillan, Neil A.; Creelman, C. Douglas (2005). Detection Theory: A User's Guide (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates. ISBN 1-4106-1114-0.
  26. Glanzer, Murray; Kisok, Kim; Hilford, Andy; Adams, John K. (1999). "Slope of the receiver-operating characteristic in recognition memory". Journal of Experimental Psychology: Learning, Memory, and Cognition 25 (2): 500–513. doi:10.1037/0278-7393.25.2.500.
  27. Ratcliff, Roger; McCoon, Gail; Tindall, Michael (1994). "Empirical generality of data from recognition memory ROC functions and implications for GMMs". Journal of Experimental Psychology: Learning, Memory, and Cognition 20: 763–785. doi:10.1037/0278-7393.20.4.763.
  28. Zhang, Jun; Mueller, Shane T. (2005). "A note on ROC analysis and non-parametric estimate of sensitivity". Psychometrika 70 (203-212).
  29. Yonelinas, Andrew P.; Kroll, Neal E. A.; Dobbins, Ian G.; Lazzara, Michele; Knight, Robert T. (1998). "Recollection and familiarity deficits in amnesia: Convergence of remember-know, process dissociation, and receiver operating characteristic data". Neuropsychology 12: 323–339. doi:10.1037/0894-4105.12.3.323.
  30. Green, David M.; Swets, John A. (1966). Signal detection theory and psychophysics. New York, NY: John Wiley and Sons Inc. ISBN 0-471-32420-5.
  31. Zweig, Mark H.; Campbell, Gregory (1993). "Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine" (PDF). Clinical Chemistry 39 (8): 561–577. PMID 8472349.
  32. Pepe, Margaret S. (2003). The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford. ISBN 0-19-856582-8.
  33. Obuchowski, Nancy A. (2003). "Receiver operating characteristic curves and their use in radiology". Radiology 229 (1): 3–8. doi:10.1148/radiol.2291010898. PMID 14519861.
  34. Spackman, Kent A. (1989). "Signal detection theory: Valuable tools for evaluating inductive learning". Proceedings of the Sixth International Workshop on Machine Learning. San Mateo, CA: Morgan Kaufmann. pp. 160–163.
  35. Srinivasan, A. (1999). "Note on the Location of Optimal Classifiers in N-dimensional ROC Space". Technical Report PRG-TR-2-99, Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford.
  36. Mossman, D. (1999). "Three-way ROCs". Medical Decision Making 19: 78–89. doi:10.1177/0272989x9901900110.
  37. Ferri, C.; Hernandez-Orallo, J.; Salido, M.A. (2003). "Volume under the ROC Surface for Multi-class Problems". Machine Learning: ECML 2003. pp. 108–120.
  38. Till, D.J.; Hand, R.J. (2012). "A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems". Machine Learning 45: 171–186. doi:10.1023/A:1010920819831.
  39. Bi, J.; Bennett, K.P. (2003). "Regression error characteristic curves". Twentieth International Conference on Machine Learning (ICML-2003). Washington, DC.
  40. Hernandez-Orallo, J. (2013). "ROC curves for regression". Pattern Recognition 46 (12): 3395–3411 . doi:10.1016/j.patcog.2013.06.014.
  41. Tufféry, Stéphane (2011); Data Mining and Statistics for Decision Making, Chichester, GB: John Wiley & Sons, translated from the French Data Mining et statistique décisionnelle (Éditions Technip, 2008)
  42. Kuusisto, Finn; Santos Costa, Vitor; nassif, Houssam; Burnside, Elizabeth; Page, David; Shavlik, Jude (2014). "Support Vector Machines for Differential Prediction" . European Conference on Machine Learning (ECML'14) (Nancy, France): 50–65.
  43. Nassif, Houssam; Kuusisto, Finn; Burnside, Elizabeth; Shavlik, Jude (2013). "Uplift Modeling with ROC: An SRL Case Study" (PDF). International Conference on Inductive Logic Programming (Rio de Janeiro, Brazil): 40–45 Late Breaking Papers.
Nassif, Houssam; Wu, Yirong; Page, David; Burnside, Elizabeth (2012). "Logical Differential Prediction Bayes Net, Improving Breast Cancer Diagnosis for Older Women" . American Medical Informatics Association Symposium (AMIA'12) (Chicago): 1330–1339. Retrieved 18 July 2014.
Matthews correlation coefficient
        The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975.[1] It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation. The statistic is also known as the phi coefficient. MCC is related to the chi-square statistic for a 2×2 contingency table
where n is the total number of observations.
        While there is no perfect way of describing the confusion matrix of true and false positives and negatives by a single number, the Matthews correlation coefficient is generally regarded as being one of the best such measures[citation needed]. Other measures, such as the proportion of correct predictions (also termed accuracy), are not useful when the two classes are of very different sizes. For example, assigning every object to the larger set achieves a high proportion of correct predictions, but is not generally a useful classification.
The MCC can be calculated directly from the confusion matrix using the formula:
        In this equation, TP is the number of true positives, TN the number of true negatives, FP the number of false positives and FN the number of false negatives. If any of the four sums in the denominator is zero, the denominator can be arbitrarily set to one; this results in a Matthews correlation coefficient of zero, which can be shown to be the correct limiting value.
The original formula as given by Matthews was:[1]
        This is equal to the formula given above. As a correlation coefficient, the Matthews correlation coefficient is the geometric mean of the regression coefficients of the problem and its dual. The component regression coefficients of the Matthews correlation coefficient are markedness (deltap) and informedness (deltap').[2][3]
  1. Matthews, B. W. (1975). "Comparison of the predicted and observed secondary structure of T4 phage lysozyme". Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2): 442–451. doi:10.1016/0005-2795(75)90109-9.
  2. Perruchet, P.; Peereman, R. (2004). "The exploitation of distributional information in syllable processing". J. Neurolinguistics 17: 97–119. doi:10.1016/s0911-6044(03)00059-9.
  3. Powers, David M W (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation" . Journal of Machine Learning Technologies 2 (1): 37–63.
  4. Fawcelt, Tom (2006). "An Introduction to ROC Analysis". Pattern Recognition Letters 27 (8): 861 – 874. doi:10.1016/j.patrec.2005.10.010.
The online help was made with Dr.Explain