CroScalar Tutorial: Multi-Scale Attribute Uncertainty (LOCAL VERSION)

Introduction

This tutorial focuses on how to represent and visualize attribute uncertainty, using wetlands as a case study. We will use two resolutions of attribute uncertainty (referred to as 'fine' and 'coarse') in wetland classification and create visualizations to assist in analyzing and interpreting the spatial patterns of uncertainty across multiple scales. We will use confusion matrices to identify areas and types of misclassification across the two attribute resolutions: the coarse resolution will identify a Boolean distinction between the presence and absence of wetlands and the fine resolution will identify agreement and misclassficiation across specific sub-classes (riverine, palustrine, and lacustrine wetlands). Using the confusion matrices, we will construct multi-scale data pyramids to aid in determining where attribute uncertainty is extreme or highly variant. Even though the pyramid contains multe-scale uncertainty in a single unified framework, we can "slice" the pyramid to view individual focal scales and compute statistical summaries of spatial patterns. The ultimate goal here is to provide uncertainty metrics and display tools to assist researchers and practitioners in identifying specific areas and scales where potential gaps in data reliability may exist and may therefore be prioiritized for additional attention or more frequent field checks. While this tutorial focuses on wetlands, this methodology can be used to assess attribute uncertainty in other categorical data such as land use and land cover.

Prerequisite Knowledge

We are assuming you have basic working knowledge of GIS, Python, and MATLAB.

Setup

Anaconda

This tutorial requires the use of Anaconda. If you do not already have Anaconda, please download and install it now. We recommend following the preset settings during installation.

Note: When choosing the install location for Anaconda, ensure that the path does not contain spaces (ex: "C:Users/K Tyler/Anaconda3") as this may prevent Anaconda from running successfully.
Anaconda Environment

In Anaconda, we will set up an environment specific to this tutorial. Open the Anaconda Navigator. On the left hand side of the window, click on the Environments tab. Here you can view and control your environments. If you just installed Anaconda, you should currently only have a base environment. We will create a new environment specific to this tutorial:

  1. Towards the bottom left of the window, click 'Create'.
  2. A small window will pop up:
    • Give your environment a name (Ex: "tutorial_env")
    • By "Packages" leave Python as the selection
    • Click the dropdown next to Python and select 3.7
    • Click 'Create' to create your new environment
  3. Your new environment is now created with a few default packages (pip, setuptools, etc.). We will now install additional packages needed to successfully run through this tutorial:
    • Right above the list of existing packages in your environment, click the dropdown that currently reads 'Installed' and select 'All'.
    • In the search bar, search for the following packages and select the checkbox to the left of their name
      • numpy
      • pandas
      • scipy
      • xlsxwriter
      • gdal
      • spyder
    • Once you've selected all of the above, click 'Apply'. After a moment, you will get a notifcation that a number of packages will be installed (including the ones you've selected plus additional dependent packages), go ahead and click 'Apply'. It may take a few minutes for all packages to be installed to your environment. Once the installation is complete you can close the Anaconda Navigator.
    • There is just one more package to install which we will install through the Anaconda Prompt. Open up the Anaconda Prompt. Activate your tutorial environment by typing conda activate [tutorial environment name] and hitting enter. Then type pip install lidar and hit enter.
Note: You may receive an error the first time you try to install lidar. If you receive an error stating that "Microsoft visual c++ 14.0 is required", you will need to download the Build Tools, install the "Desktop Development with C++", restart your machine, and then rerun the line of code to install lidar.
Tip: You want to make sure you are always activating your environment in the Anaconda Prompt before installing anything to your tutorial environment or opening Spyder to run the tutorial code.

MATLAB

This tutorial also requires the use of MATLAB. If you do not already have MATLAB, please download and install it now. Note that MATLAB is proprietary software, but you may have free access through your university or employer. Follow the preset settings during installation.

Once downloaded and installed, open MATLAB. At the top of the window, you should see that you are on the 'Home' tab. Click on the 'Add-Ons' button to open the Add-On Explorer window. In the search bar, search for 'image processing toolbox' and select the 'Image Processing Toolbox' result that comes up. Click 'Sign In to Install', sign in, then click 'Install'. Follow the installation steps.

This tutorial was created in June 2021, working with Anaconda 4.10.1, Python 3.7.10, and MATLAB R2021a. Python package versions used: lidar 0.6.1, numpy 1.20.2, pandas 1.2.4, scipy 1.6.2, XlsxWriter 1.3.8, GDAL 3.0.2

We recommend that you have one folder that contains all of the tutorial materials. Within your tutorial folder you should have Part 1, Part 2, and Part 3 folders as well as all code files.

Part 0: Data Pre-Processing

Some data pre-processing and cleaning was done for the Part 1 inputs. If you are interested in adapting this tutorial to a different area, here are the pre-processing steps you can take.

Part 1

Part 1A: Data Input

Open up Spyder from the Anaconda prompt. You can do this by activating your tutorial environment, typing spyder in the prompt, and hitting enter. Then, load the first python script: Part1_HGM_NWI_Compare.py.

Tip: The easiest way to load scripts in Spyder is to simply drag and drop them into Spyder from your file browser.

You need to change the paths in the script to where you've stored the Part 1 folders. Scroll to line 43 and update the set_path variable to point to your Part1 folder.

Once you've updated the set_path variable, you can run the script. This will input the data, generate the HGM, and create the confusion matrices all in one script.

Part 1B: HGM Generation

Understanding Wetlands Classification Systems

Cowardin System

The Cowardin classification system includes five classes: riverine, lacustrine, palustrine, marine, and estuarine (FGDC, 2013). NWI is the largest database of geospatial wetlands data within the United States and uses the Cowardin classification. This tutorial uses NWI as the data source for the Cowardin classed wetlands. Using NWI, three wetland types are found within the study area: 566,902 acres of palustrine (44.3%), 217,458 acres of riverine (17.0%), and 68,720 acres of lacustrine (5.4%) wetlands. Wetlands dominate the study area, covering about 67% of the landscape.

Note: As part of data pre-processing, the NWI data was modified to remove open water from the classification in order to make the data comparable to the HGM system, which does not account for open water.

HGM System

The HGM classification system includes seven classes reflecting landscape functionality: riverine, depressional, slope, mineral soil flats, organic soil flats, tidal fringe, and lacustrine fringe (Smith et al., 2013). In order to be comparable with the Cowardin system, the HGM classification was modified into three classes matching the Cowardin classes present within the study area (riverine, lacustrine, and palustrine). Riverine and lacustrine classes exist within both classification systems; however, the palustrine class is only in the Cowardin system. To better align the two systems a “palustrine” class was created for the HGM by combining depressional, slope wetlands, mineral soil flats, and organic soil flats definitions. The HGM palustrine class describes wetlands whose main water sources are groundwater or precipitation rather than flows from rivers or lakes. No geospatial layers using the HGM exist for this study area, so a wetlands database was created using ancillary data and the modified HGM classification system. This process is described below, and there are published examples of research that used GIS data to create an HGM dataset (Cedfeldt et al., 2000; Adamus et al., 2010; Van Deventer et al., 2016; Rivers-Moore et al., 2020).

In addition to inputting the data, the script you just ran also generates the HGM dataset. In order to do so, the code uses ancillary data including hydrogrophy, hydric soils, and elvation layers. All rivers and lakes larger than eight hectares are given a 30m buffer and then areas within this buffer with either hydric, predominantlyhydric, or partially hydric soils are classed as riverine. Areas adjacent to lakes with hydric, predominately hydric, or partially hydric soil are classed as lacustrine. Elevation is used to locate topographic depressions and slopes reater than two percent with hydric, predominately hydric, or partially hydric soils - classed as palustrine along with areas that had completely hydric soils.

The HGM dataset was intended for comparability with NWI. It was not created with the goal to be a more accurate wetlands model. Several assumptions were made in generating the HGM dataset that may not be true in every geographic case. For example, it is not correct to assume that soils classed as hydric are always wetlands, or that wetlands adjacent to rivers and lakes are always riverine and lacustrine respectively. Wetland classification is more nuanced and requires on-the-ground validation for the highest accuracy. However, the purpose of this analysis is to demonstrate a method and data framework within which to explore multi-scale patterns of spatial uncertainty, rather than to establish a spatially precise model of wetlands within the study area.

Part 1C: Confusion Matrices

Understanding Confusion Matrices

Here we are using confusion matrices to compare the NWI and HGM wetlands classes. We are using the NWI data as the validation dataset and the HGM as the test dataset; this is not to imply that the NWI dataset is more or less reliable than the HGM dataset. Both wetland datasets incur uncertainties resulting from their compilation, processing, and temporal resolution. The confusion matrices simply allow us to identify the agreement and disagreement between the two datasets. Where the two wetland classification systems agree, we have higher certainty that the classification of the wetland presence or type is reliable. Conversely, where there is disagreement, there is higher uncertainty.

Understanding Evaluation Metrics

Within the confusion matrices, there are three evaluation metrics used to quantify uncertainty: Recall, Precision, and F1.

Generating and Mapping Confusion Matrices

To check your outputs, navigate into your Part 1 Output folder and open up the Full_Keys.xlsx which contains both your coarse and fine confusion matrices. The first tab in the excel contains your coarse confusion matrix and should look like this:

This coarse resolution assesses the presence versus absence of wetlands. This matrix is shows that we have 12.04% true positives (green) in our study area and 59.94% true negatives (white) - both refer to areas where NWI and HGM agree on the presence or absence of wetlands. We also have 4.41% false positives (red) and 23.62% false negatives (blue) - both referring to areas where NWI and HGM disagree. Since these matrices use NWI as the validation dataset, false positives are areas where the NWI classifies a wetland while HGM does not and false negatives are areas where the NWI classifies no wetlands while HGM classifies the presence of wetlands.

The second tab in your excel contains the fine resolution matrix and should look like this:

This fine resolution confusion matrix compares not only the presence and absence of wetlands, but also the agreement of the specific wetlands type (riverine, lacustrine, or palustrine). The first column shows false negatives (wetlands in NWI but not in HGM) for all categories while the top row shows false positives (wetlands in HGM but not in NWI). The diagonal shows cells that agree on the presence/absence of wetlands in both data sets as well as on their specific type. The six gray cells indicate a misclassification at the finer level, namely that both classification systems agree that wetlands are present but disagree on the type. The percentage values in the matrix indicate the proportion of the pixels in each of the sixteen cases for the entire study area. The cells are color-coded, with hue (blue, green, purple) referring to wetland type. Saturation and value are used to distinguish false negatives and false positives.

Other outputs from Part 1 include:

Mapping the Confusion Matrices in ArcGIS
  1. Open ArcMap and load in coarse_matrix.tif
  2. Open up the properties of the course_matrix layer and go to the Symbology tab.
  3. On the left side of the Properties window, click 'Unique Values'.
  4. A message will pop up saying 'Raster attribute table doesn't exist. Do you want to build attribute table?'
  5. Click 'Yes'.
  6. Towards the bottom left of the Properties window, clik on the 'Colormap' button. A dropdown will appear, click on 'Import a colormap...'
  7. OIn the file explorer window that pops up, navigate to Part1>Output>Colormaps. Within the Colormaps folder, double click on Coarse Surface colors.clr.
  8. Click 'OK' to apply your changes and close the Properties window. You'll notice that the colors used correspond to the colors of your confusion matrix in Full_Keys.xlsx.
  9. To map at the fine resolution repeat the steps above, using fine_matrix.tif and Fine Surface Colors.clr.

Note: These instructions were made using ArcMap 10.8

Mapping the Confusion Martrices in QGIS
  1. Open up QGIS and loas in coarse_matrix.tif
  2. Open up the layer properties of coarse_matrix and go to symbology.
  3. Under 'Render type' change dropdown selection to 'Paletted/Unique values', then click the 'Classify' button. You should see the values '0', '1', '10', and '11' pop up.
  4. Change the colors to match the confusion matrix. Double click each color chip and change the HTML notation in the 'Select color' window that pops up. Then click 'OK' to move on to the next color.
    • Change the color chip for '0' to white (#FFFFFF).
    • Change the color chip for '1' to blue (#116DF7).
    • Change the color chip for '10' to red (#F71111).
    • Change the color chip for '11' to green (#11F737).
  5. Once you've changed all the colors, click 'Apply' to apply your changes and 'OK' to close the layer properties window.
  6. You have now mapped the coarse confusion matrix in GQIS!
  7. To map the fine matrix follow the steps above using the fine_matrix.tif and the following colors:
    • '0' - white (#FFFFFF)
    • '1' - dark blue (#0084A8)
    • '2' - dark purple (#8400A8)
    • '3' - dark green (#2D8700)
    • '10' - light blue (#BEE8FF)
    • '11' - medium blue (#00C3FF)
    • '13', '21', '23', '31', and '32' - all gray (#9C9C9C)
    • '20' - light purple (#E8BEFF)
    • '22' - medium purple (#C300FF)
    • '30' - light green (#D3FFBE)
    • '33' - medium green (#4CE600)

Note: These instructions were made using QGIS 3.16

Part 2: Building Pyramids

Understanding the Pyramid Data Framework

After creating the confusion matrix surfaces, a progressive focal window analysis is used to transform the coarse and fine uncertainty surfaces into a pyramid data framework. The area used for the scale analysis is a 200-pixel by 200-pixel section of the False River subset. Note that you can use a larger or smaller area to build the pyramids, but it must be a square area. Starting with a three-by-three-pixel size, focal windows are moved across the entire surface and the desired metric is calculated for the pixels falling within the focal window. For each window, the calculated metrics are stored within the center pixel of the window. After the entire surface has been calculated a single layer of the pyramid is complete. The analysis then iterates with increasingly larger focal window sizes, and the resulting calculations are stored in new layers of the pyramid data framework. As the focal window size increases, more of the edge areas are excluded and the number of pixels containing data decreases. This results in a pyramid shape with the full data layer on the bottom and a single cell at the top that encompasses the entire study area.

The pyramid data framework allows multi-scale summary statistics to be stored in a single unified framework that can then be visualized, summarized, and sliced. Animation is possible to slide vertically through the pyramid. Summary statistics can be calculated for each slice allowing you to easily compare the data at different scales. This allows for easy viewing of misclassification at multiple scales and determination of uncertainty patterns within the data across spatial and attribute resolutions.

Running Code to Build the Pyramids

In this section of the tutorial we will run three separate python scripts. We'll start with "Part2_HGM_NWI_Compare_subset.py". Go ahead and open this script in Spyder via Anaconda.

Before we run this script, we need to update the paths. On line 25, update the path to point to your Part 2 folder.

Once you've updated the set_path variable, go ahead and run the script. This will clip the study area to the square analysis area for creating the pyramids.

The next script we'll run is "Part2B_focal_stats_coarse_res.py". In this script, change the set_path variable on line 206 to point to your Part 2 folder.

Once you've updated the path, run the script. This will create the pyramid at the coarse attribute uncertainty level.

The last script we'll run for part 2 is "Part2C_focal_stats_fine_res.py". Similar to the last script, you just need to change the paths to point to your Part 2 folder on line 300. Once you've done that, you can run the script to create the fine pyramids. This script may take a while to run, depending on your computer speed it could take around 20 minutes.

Congratulations! You are now done with the first two parts of the tutorial, and you have completed running all of the Python scripts. For part 3, we will turn to MATLAB.

Part 3: Visualization

In this final part of the tutorial. We will use MATLAB to visualize the pyramids. You may notice that all of the .mat files that were output in Part 2 are serving as our inputs for Part 3.

Using your file browser, double click on "Part3_Pyramid_model.m" to open the script in MATLAB. You shouldn't need to make any edits to the code for it to run successfully. However, you will need to add a folder connection to the Part 3 folder. You can do this in the left panel ("Current folder") by navigating to your Part 3 folder, then right click on the folder and click Add to Path > Selected Folder and Subfolders.

In your output folder, you should now have 35 files. These outputs include visualization from the coarse attribute F1 pyramid, coarse attribute Precision pyramid, coarse attribute Recall pyramid, fine attribute pyramid showing F1 scores for the palustrine class, and another fine attribute pyramid showing F1 scores for the non-wetland class. Within each of these you have 7 files: 5 .fig files that show you a slice of the pyramid (these can be viewed in MATLAB and exported as jpgs, pngs, pdfs, etc.), one .avi file that show an animation of sliding through multiple slices in the pyramid, and one excel sheet providing more detail about the F1, Recall, or Precision metrics. Note that files ending in "_75.fig" are scratch outputs and not meant for meaningful interpretation.

Below is an example of what your F1_slice5 should look like. Note that in the MATLAB interface you can click and drag to manipulate the perspective.

Congratulations! You have now completed the Multi-Scale Attribute Uncertainty tutorial. Keep checking [website; need link] for more CroScalar tutorials and for a deeper dive into this content check out [link to Kate's paper when it's published?]

Cited References

Adamus, P., Christy, J., Jones, A., McCune, M., and Bauer, J. (2010). A Geodatabase and Digital Characterization of Wetlands Mapped in the Willamette Valley With Particular Reference to Prediction of Their Hydrogeomorphic (HGM) Class [Report to USEPA Region 10].

Cedfeldt, P. T., Watzin, M. C., and Richardson, B. D. (2000). Using GIS to Identify Functionally Significant Wetlands in the Northeastern United States. Environmental Management, 26(1), 13–24. https://doi.org/10.1007/s002670010067

FGDC, (Federal Geographic Data Committee). (2013). Classification of Wetlands and Deepwater Habitats of the United States (FGDC-STD-004-2013. Second Edition.). Wetlands Subcommittee, Federal Geographic Data Committee and U.S. Fish and Wildlife Service.

Powers, D.M.W. (2011). Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness and Correlation. International Journal of Machine Learning Technology, 2(1), 37–63.

Rivers-Moore, N. A., Kotze, D. C., Job, N., and Mohanlal, S. (2020). Prediction of Wetland Hydrogeomorphic Type Using Morphometrics and Landscape Characteristics. Frontiers in Environmental Science, 8. https://doi.org/10.3389/fenvs.2020.00058

Smith, R. D., Noble, C. V., and Berkowitz, J. F. (2013). Hydrogeomorphic (HGM) Approach to Assessing Wetland Functions: Guidelines for Developing Guidebooks (Version 2): Defense Technical Information Center.

Van Deventer, H., Nel, J., Mbona, N., Job, N., Ewart-Smith, J., Snaddon, K., and Maherry, A. (2016). Desktop classification of inland wetlands for systematic conservation planning in data-scarce countries: Mapping wetland ecosystem types, disturbance indices and threatened species associations at country-wide scale. Aquatic Conservation: Marine and Freshwater Ecosystems, 26(1), 57–75. https://doi.org/10.1002/aqc.2605