Reference data for assessing classification accuracy of a satellite image


Earlier we have written about assessing classification accuracy. The post dealt with creating confusion matrix. But the topic is not resolved with that as it has some more interesting points. While assessing the accuracy of automated classification results one has to answer these two questions:

  • which assessment method should be selected?
  • where and how to get the reference data for assessment?

Now we are going to address the second question. Let’s try to make out its practical aspects following the example of working with ENVI software. In any case, this approach is common for any similar software.

So, there are four approaches to creating reference data (in an arbitrary order):

  1. to use the map covering the whole classified area as a reference. It can be created by manual digitizing of a satellite image or by using existing vector layers;
  2. to use the initial sample of the supervised classification;
  3. to use a map covering fragments of the classified area;
  4. to use points where a real class is defined during fieldworks or visual decoding.

These four approaches differ by labour effort required and the degree of fidelity. The system of these approaches describing their interrelations is shown in Figure 1.


Approaches to acquiring reference data


Fig. 1. The system of approaches to creating a reference set for classification


The first approach gives the most representative data since it covers the whole classified area. Such an approach is useful when a small fragment of an image is being classified. However, if one has to assess the accuracy of classification of a large image, it requires a huge labour effort. The amount of digitizing will also be big. And this effort would not make sense: if one has enough time to digitize the image there is no need for classification. Moreover, in the majority of cases, the accuracy of the manual digitizing will be higher than that of automated classification.

The second, third and fourth approaches are free of the main shortcoming of the first approach – extremely high labour effort. This is achieved by making data samples just for separate fragments of a studied area. But the problem to assure sample representativeness emerges at these approaches.

The second approach is the simplest and the least labour-consuming one. To apply it, no additional efforts are needed to create a sample for checking classification accuracy. Prior to supervised classification, a learning sample is generated that is suitable for assessment, as well. However, sufficient sample representativeness cannot be guaranteed.

As we need to assess accuracy for the entire classified map, applying the second approach we could assess accuracy only for fragments that spatially coincide with the training sample. If these fragments are representative it will be all right with assessment fidelity. But in fact, the fragments could be unrepresentative. A user creates them manually, and such a process is subjective to some extent. Besides that, such an approach may lead to overestimation of assessment accuracy. It is clear that where pixels of the training sample were present the accuracy would be higher than where they were absent. So, this approach may lack assessment fidelity.

The third and fourth approaches are free of subjectivity at sample generation for assessing classification quality because automated sample generation makes their basis. Such sample can be either random or systematic. A systematic sample is generated by using the same interval between sample locations. If we know that errors are randomly distributed in space, a systematic sample is applied to create the reference. If there is a probability of uniform distribution of classification errors in space then a random sample should be applied. Otherwise, at systematic sampling, we will constantly hit either the locations containing errors or the locations where the errors are absent.

Within the framework of the third approach the fragments of an area for which a reference is created should be outlined around the sample points. It can be done by manual digitizing of an image. Or it can be done based on a field survey when class boundaries are actually mapped by means of geodetic appliances. The more points are in a sample and the larger area of outlined fragments the more representative the reference will be. By doing this, one should define which strategy of increasing representativeness and minimising labour efforts to be chosen. This can be done by increasing either fragment size or fragment numbers. Sure, an optimal choice is a sort of compromise strategy.

The fourth approach is a further modification of the third one. In this approach, the size of fragments selected for accuracy assessment is decreased to a size of a pixel. Making such a reference is the least labour-consuming. Class to which sample points should be attributed can be defined during visual decoding of satellite images. But ideally one should check actual attribution of these points during a field survey. For doing so the point coordinates are uploaded to a satellite-based navigator, then the points are found in the field and their actual class attributes are defined. The more points in a sample the more representative the sample will be, and assessment of classification accuracy will be more adequate.

In the fourth approach, there are two variants of generating a random sample. Assessment points can be chosen altogether for entire area ignoring boundaries and class areas (a convenient sample). Or these points could be chosen separately for each class (a stratified sample). The second way is used when class areas differ significantly.

While a sample is stratified, assessment points ratio between classes may strongly correspond to the ratio of class areas. In this case, the sample is called proportionally stratified. Otherwise, if such a correspondence is not observed the sample is called disproportionally stratified. It is generated when the difference between the areas of classes is very high. In such a case it is recommended to increase the number of points for the classes with extremely small areas and to decrease the number of points for the classes with extremely large areas. An extreme case for disproportionally stratified classification is that where for all classes the same number of points is generated randomly.


Let us illustrate how assessment of classification accuracy differs at different approaches of reference selection. We will do it using an example from a previous post. The classification map and the etalon are shown in Figure 2. There are four classes: water bodies (in blue), deciduous forests (green), coniferous forests (dark-green) and other (grey). The map was generated by the parallelepiped algorithm, and the reference – by manual vectoring.

When we compare the classification map with the reference for the entire area we get the overall accuracy of 94.30%. Let’s consider that the reference map is true. If it is so in reality then a given value of overall accuracy is the most correct. It will be used for comparisons of accuracy assessment obtained by using other approaches.


Fig. 2. The classification result (on the left) and the reference for accuracy assessment created by applying the first approach (on the right)


Now let us see what the accuracy assessment will be obtained while comparing to the reference created by the second approach. In this case, we get the overall accuracy of 99.49%. This accuracy is overestimated. If we got a low assessment value it would be very bad since it would mean that the classification map badly corresponded even to the training sample. So the classes would correspond to the real landcover even worse.


Fig. 3. The reference created by applying the second approach (on the left) and the classified map with boundaries of reference fragments (outlined with a red line, on the right)


The third and fourth approaches are based on using random samples. Randomness could be disappointing since always there is a probability to create a wrong sample and thus get incorrectly assessed classification accuracy. To show that this is not so bad we have made a sort of experiment. For our classification map an array of random samples has been generated, 50 samples created by the third and 50 samples – by the fourth approach.

For applying the third approach each sample consisted of 8 circle-shaped fragments with a radius of 13 pixels. An example is shown in Figure 4.


Fig. 4. A reference set created by applying the third approach (on the left) and the classified map with boundaries of reference fragments (outlined with a red line, on the right)


Figure 5 shows accuracy estimates distribution obtained by the third approach. It ranges from 90.01% to 97.92%, with the mean value at 94.33% and a half of values ranging from 93.82% to 95.13%.


Fig. 5. Histogram of estimated accuracy values obtained by the third approach


For applying the fourth approach 50 stratified random samples of 100 points in each have been generated. Sample points were distributed disproportionally between classes: water bodies – 10 points, deciduous forests – 20, coniferous forests – 30, others – 40 points. An example is shown in Figure 6.


Fig. 6. A reference created by applying the fourth approach (on the left) and the classified map with etalon points (red dots, on the right)


Figure 7 shows accuracy estimates distribution obtained using the fourth approach. It ranges from 89.00% to 99.00%, with the mean value at 93.62% and a half of values ranging from 92.00% to 95.00%.


Fig. 7. Histogram of estimated accuracy values obtained by the fourth approach


So, here is the next question: which approach is better – the third or the fourth one? To answer this question we have to compare accuracy estimations obtained by applying each approach. Figure 8 shows whisker-plot diagrams for such assessment – one based on the point sample and other – on the fragment sample. We see that both approaches give almost the same accuracy assessment results that are very close to the true accuracy of 94.30%. But ranges of accuracy values differ between two cases: in the case of sample points, it is broader than in the case of sample fragments. Therefore, applying the third approach we have a greater probability to obtain correct accuracy estimate.

But why it is so? This happens because an accuracy estimate depends on the sample size: the bigger the sample the higher is the probability of getting correct accuracy estimate. In our example, applying the fourth approach gave the sample size of 100 points, and applying the third approach – 4238 points. If we got not 8 but, say, 30 fragments with the same total area, the estimation would be even more correct.


Fig. 8. Whisker-plot diagrams for accuracy estimations obtained by the third and the fourth approaches


To illustrate how the number of elements in a sample affects the accuracy estimation we have made another sort of experiment. For our classification example, we generated random reference samples of different sizes (20, 40, 60, 80, 100, 120 pixels). For each size, the reference 50 samples were generated. Points in each sample were distributed uniformly between classes. Classification accuracy estimations were obtained based on these samples. In Figure 9 on the left we see that the mean value of classification accuracy is always close to the true assessed value, however with different variances: with the increase in sample size, it decreases significantly (fig. 9, on the right).


Fig. 9. Dependence of differences between real and calculated classification accuracy on the sample size


In conclusion to this post, one may note that the most effective reference set to assess classification accuracy is a random sample. The bigger the sample size the higher is the probability to get correct accuracy estimation. In practice, defining the sample size and type (sample of points vs. sample of fragments) is a sort of compromise between maximising the quality of assessment and minimising labour efforts. Ideally, one should make an additional check of attributing sample objects to certain classes by doing a field survey.