CroScalar Tutorial: Multi-Scale Attribute Uncertainty¶

Introduction¶

This tutorial focuses on how to represent and visualize attribute uncertainty, using wetlands as a case study. We will use two resolutions of attribute uncertainty (referred to as 'fine' and 'coarse') in wetland classification and create visualizations to assist in analyzing and interpreting the spatial patterns of uncertainty across multiple scales. We will use confusion matrices to identify areas and types of misclassification across the two attribute resolutions: the coarse resolution will identify a Boolean distinction between the presence and absence of wetlands and the fine resolution will identify agreement and misclassficiation across specific sub-classes (riverine, palustrine, and lacustrine wetlands). Using the confusion matrices, we will construct multi-scale data pyramids to aid in determining where attribute uncertainty is extreme or highly variant. Even though the pyramid contains multe-scale uncertainty in a single unified framework, we can "slice" the pyramid to view individual focal scales and compute statistical summaries of spatial patterns. The ultimate goal here is to provide uncertainty metrics and display tools to assist researchers and practitioners in identifying specific areas and scales where potential gaps in data reliability may exist and may therefore be prioiritized for additional attention or more frequent field checks. While this tutorial focuses on wetlands, this methodology can be used to assess attribute uncertainty in other categorical data such as land use and land cover.

Prerequisite Knowledge¶

We are assuming you have basic working knowledge of GIS, Python, and MATLAB.

Setup¶

Google Colaboratory¶

We will use Google Colaboratory (Colab) to run the python portions of the tutorial (Parts 1 and 2). To get set up with Colab you simply need access to Google Drive via a Google account. You may need to connect the Colab app to your Google Drive. TO do this, navigate to your Drive, click on 'New' in the top left corner, then cursor over 'More'. If you see Colab as an option, then you are all set, if not then click on 'Connect more apps'. Search for 'Colaboratory', click on the Colaboratory tile, and then click 'Install'.

To run the Python scripts in Colab, download the Colab Tutorial Parts 1 and 2 folder to your computer, unzip the folder, and then upload the unzipped folder to your Google Drive. Google will then open the .ipynb files with Colab.

MATLAB¶

This tutorial (Part 3) also requires the use of MATLAB. If you do not already have MATLAB, please download and install it now. Note that MATLAB is proprietary software, but you may have free access through your university or employer. Follow the preset settings during installation.

Once downloaded and installed, open MATLAB. At the top of the window, you should see that you are on the 'Home' tab. Click on the 'Add-Ons' button to open the Add-On Explorer window. In the search bar, search for 'image processing toolbox' and select the 'Image Processing Toolbox' result that comes up. Click 'Sign In to Install', sign in, then click 'Install'. Follow the installation steps.

To run Part 3 of the tutorial, download the Tutorial Part 3 folder to your computer and unzip it. There is no need to upload this portion to Google Drive since MATLAB will run locally.

This tutorial was created in June 2021, working with MATLAB R2021a.

Part 0: Data Pre-Processing¶

Some data pre-processing and cleaning was done for the Part 1 inputs. If you are interested in adapting this tutorial to a different area, here are the pre-processing steps you can take.

Projection: NAD_1983_UTM_Zone_15N
NHD Data: clean to take only the relevant classes and reclassify
- Class rivers (Fcodes 46000, 46003, 46006, and 46007) as 1
- Class lakes (Fcodes 39000, 39001, 39004, 39005, 39006, 39009, 39010, 39011, and 39012) as 4
- Class wetlands (Fcodes 46600, 46601, and 46602) as 5
- Note that not all of these classes appear in the study area for this tutorial, but these are all of the relevant Fcodes you may want to use for any study area
- Only include bodies of water over 8 hectares (based on Cowardin classification system)
- Finally, rasterize
NWI: reclassify and clean
- Class riverine (all attributes starting with R, ex:R4SBC) as 1
- Class lacustrine (attribute starts with L, ex: L1UBH) as 2
- Class palustrine (attribute starts with P - ex: PAB3H) as 3
- Working with NHD data, eliminate all water features that intersect with NWI data classes
- Finally, rasterize
Soils: reclassify
- Join hydric ratings to data
- Class hydric as 1
- Class predominantly hydric as 2
- Class partially hydric as 3
- Class predominantly nonhydric as 4
- Class nonhydric as 5
- Rasterize

Part 1¶

Part 1A: Data Input¶

Open the "Part1_HGM_NWI_Compare.ipynb" file in your Google Drive. Run the first block of code which will load the neccessary python packages.

You will also need to mount your Google Drive. The code to do so is in the second block. When you run the code, a link will pop up to verify that you want to mount your Drive. Click the link, copy the verification code, return to the Colab notebook, input the verification code, and hit enter.

In [ ]:

from google.colab import drive
drive.mount('/content/drive')

Next, double check that the paths in the next block of code point to your turoiral folder in your drive. You likely will not need to make any changes.

In [ ]:

import os
os.chdir('/content/drive/MyDrive/MSAU_Tutorial_ColabTest')

set_path = os.getcwd() + '/'

Note: If you need to change the path, the easiest way is to navigate to the File panel on the left by clicking the folder icon. Navigate through drive>MyDrive and find where you stored your tutorial folder. When you find the tutorial folder right click and select 'Copy path'. Then paste the path inside the quotes passed to the os.chdir function.

Run the next, large block of code. This will input the data, generate the HGM, and create the confusion matrices all in one script.

Part 1B: HGM Generation¶

Understanding Wetlands Classification Systems¶

Cowardin System¶

The Cowardin classification system includes five classes: riverine, lacustrine, palustrine, marine, and estuarine (FGDC, 2013). NWI is the largest database of geospatial wetlands data within the United States and uses the Cowardin classification. This tutorial uses NWI as the data source for the Cowardin classed wetlands. Using NWI, three wetland types are found within the study area: 566,902 acres of palustrine (44.3%), 217,458 acres of riverine (17.0%), and 68,720 acres of lacustrine (5.4%) wetlands. Wetlands dominate the study area, covering about 67% of the landscape.

Note: As part of data pre-processing, the NWI data was modified to remove open water from the classification in order to make the data comparable to the HGM system, which does not account for open water.

HGM System¶

The HGM classification system includes seven classes reflecting landscape functionality: riverine, depressional, slope, mineral soil flats, organic soil flats, tidal fringe, and lacustrine fringe (Smith et al., 2013). In order to be comparable with the Cowardin system, the HGM classification was modified into three classes matching the Cowardin classes present within the study area (riverine, lacustrine, and palustrine). Riverine and lacustrine classes exist within both classification systems; however, the palustrine class is only in the Cowardin system. To better align the two systems a “palustrine” class was created for the HGM by combining depressional, slope wetlands, mineral soil flats, and organic soil flats definitions. The HGM palustrine class describes wetlands whose main water sources are groundwater or precipitation rather than flows from rivers or lakes. No geospatial layers using the HGM exist for this study area, so a wetlands database was created using ancillary data and the modified HGM classification system. This process is described below, and there are published examples of research that used GIS data to create an HGM dataset (Cedfeldt et al., 2000; Adamus et al., 2010; Van Deventer et al., 2016; Rivers-Moore et al., 2020).

In addition to inputting the data, the script you just ran also generates the HGM dataset. In order to do so, the code uses ancillary data including hydrogrophy, hydric soils, and elvation layers. All rivers and lakes larger than eight hectares are given a 30m buffer and then areas within this buffer with either hydric, predominantlyhydric, or partially hydric soils are classed as riverine. Areas adjacent to lakes with hydric, predominately hydric, or partially hydric soil are classed as lacustrine. Elevation is used to locate topographic depressions and slopes reater than two percent with hydric, predominately hydric, or partially hydric soils - classed as palustrine along with areas that had completely hydric soils.

The HGM dataset was intended for comparability with NWI. It was not created with the goal to be a more accurate wetlands model. Several assumptions were made in generating the HGM dataset that may not be true in every geographic case. For example, it is not correct to assume that soils classed as hydric are always wetlands, or that wetlands adjacent to rivers and lakes are always riverine and lacustrine respectively. Wetland classification is more nuanced and requires on-the-ground validation for the highest accuracy. However, the purpose of this analysis is to demonstrate a method and data framework within which to explore multi-scale patterns of spatial uncertainty, rather than to establish a spatially precise model of wetlands within the study area.

Part 1C: Confusion Matrices¶

Understanding Confusion Matrices¶

Here we are using confusion matrices to compare the NWI and HGM wetlands classes. We are using the NWI data as the validation dataset and the HGM as the test dataset; this is not to imply that the NWI dataset is more or less reliable than the HGM dataset. Both wetland datasets incur uncertainties resulting from their compilation, processing, and temporal resolution. The confusion matrices simply allow us to identify the agreement and disagreement between the two datasets. Where the two wetland classification systems agree, we have higher certainty that the classification of the wetland presence or type is reliable. Conversely, where there is disagreement, there is higher uncertainty.

Understanding Evaluation Metrics¶

Within the confusion matrices, there are three evaluation metrics used to quantify uncertainty: Recall, Precision, and F1.

Recall defines the ratio of true positives to real positives (true positives + false negatives), measuring how often the HGM data agrees with an NWI positive classifications of wetlands presence or type. This metric ranges from 0 to 1, with lower numbers indicating areas of more disagreement.
Precision defines the rate of true positives over predicted positives (true positives + false positives), measuring how often the NWI data agrees with and HGM positive classification of wetland presence or type. Like Recall, this metric ranges from 0 to 1, with more disagreement yielding lower scores.
F1 is the weighted harmonic mean of Recall and Precision, measuring the rate of disagreement between the two classification systems. Where HGM and NWI agree, we will have an F1 score of 1. (Powers, 2011)

Generating and Mapping Confusion Matrices¶

To check your outputs, navigate into your Part 1 Output folder and open up the Full_Keys.xlsx which contains both your coarse and fine confusion matrices. The first tab in the excel contains your coarse confusion matrix and should look like this:

This coarse resolution assesses the presence versus absence of wetlands. This matrix is shows that we have 12.04% true positives (green) in our study area and 59.94% true negatives (white) - both refer to areas where NWI and HGM agree on the presence or absence of wetlands. We also have 4.41% false positives (red) and 23.62% false negatives (blue) - both referring to areas where NWI and HGM disagree. Since these matrices use NWI as the validation dataset, false positives are areas where the NWI classifies a wetland while HGM does not and false negatives are areas where the NWI classifies no wetlands while HGM classifies the presence of wetlands.

The second tab in your excel contains the fine resolution matrix and should look like this:

This fine resolution confusion matrix compares not only the presence and absence of wetlands, but also the agreement of the specific wetlands type (riverine, lacustrine, or palustrine). The first column shows false negatives (wetlands in NWI but not in HGM) for all categories while the top row shows false positives (wetlands in HGM but not in NWI). The diagonal shows cells that agree on the presence/absence of wetlands in both data sets as well as on their specific type. The six gray cells indicate a misclassification at the finer level, namely that both classification systems agree that wetlands are present but disagree on the type. The percentage values in the matrix indicate the proportion of the pixels in each of the sixteen cases for the entire study area. The cells are color-coded, with hue (blue, green, purple) referring to wetland type. Saturation and value are used to distinguish false negatives and false positives.

Other outputs from Part 1 include:

CoarseAttributes_Full.csv
- This give the count of pixels and percent of the study area that fall into true negatives, false negatives, true positives, and false positives at the coarse attribute uncertainty level.
CoarseStats_Full.csv
- This provides the Recall, Precision, and F1 scores for the study area at the coarse level of analysis.
FineAttributes_Full.csv
- Need info from Kate on what/how the IDs correspond to
FineStats_Full.csv
- Need info from Kate on how to explain the multiple scores This provides the Recall, Precision, and F1 scores for the study area at the corase level of analysis.
coarse_matrix.tif and fine_matrix.tif
- These both allow you to visualize the categories from the confusion matrices across the study area using ArcMap or QGIS; see instructions below. Note that if you simply open the tifs in a photo viewer, they will look completely black, but if you follow the steps below you can visualize the confusion matrices.
Scratch outputs within the Part1/Data/Scratch folder
- These are intermediary outputs. You don't need these, but if you want to explore more or adopt this tutorial to your own data these scratch outputs can be useful to identify patterns or isolate intermediary processes.

Mapping the Confusion Matrices in ArcGIS¶

Download the coarse_matrix.tif from your Google Drive to your computer. Open ArcMap and load in coarse_matrix.tif
Open up the properties of the course_matrix layer and go to the Symbology tab.
On the left side of the Properties window, click 'Unique Values'.
A message will pop up saying 'Raster attribute table doesn't exist. Do you want to build attribute table?'
Click 'Yes'.
Towards the bottom left of the Properties window, clik on the 'Colormap' button. A dropdown will appear, click on 'Import a colormap...'
OIn the file explorer window that pops up, navigate to Part1>Output>Colormaps. Within the Colormaps folder, double click on Coarse Surface colors.clr.
Click 'OK' to apply your changes and close the Properties window. You'll notice that the colors used correspond to the colors of your confusion matrix in Full_Keys.xlsx.
To map at the fine resolution repeat the steps above, using fine_matrix.tif and Fine Surface Colors.clr.

Note: These instructions were made using ArcMap 10.8

Mapping the Confusion Martrices in QGIS¶

Download the coarse_matrix.tif from your Google Drive to your computer. Open up QGIS and loas in coarse_matrix.tif
Open up the layer properties of coarse_matrix and go to symbology.
Under 'Render type' change dropdown selection to 'Paletted/Unique values', then click the 'Classify' button. You should see the values '0', '1', '10', and '11' pop up.
Change the colors to match the confusion matrix. Double click each color chip and change the HTML notation in the 'Select color' window that pops up. Then click 'OK' to move on to the next color.
- Change the color chip for '0' to white (#FFFFFF).
- Change the color chip for '1' to blue (#116DF7).
- Change the color chip for '10' to red (#F71111).
- Change the color chip for '11' to green (#11F737).
Once you've changed all the colors, click 'Apply' to apply your changes and 'OK' to close the layer properties window.
You have now mapped the coarse confusion matrix in GQIS!
To map the fine matrix follow the steps above using the fine_matrix.tif and the following colors:
- '0' - white (#FFFFFF)
- '1' - dark blue (#0084A8)
- '2' - dark purple (#8400A8)
- '3' - dark green (#2D8700)
- '10' - light blue (#BEE8FF)
- '11' - medium blue (#00C3FF)
- '13', '21', '23', '31', and '32' - all gray (#9C9C9C)
- '20' - light purple (#E8BEFF)
- '22' - medium purple (#C300FF)
- '30' - light green (#D3FFBE)
- '33' - medium green (#4CE600)

Note: These instructions were made using QGIS 3.16

Part 2: Building Pyramids¶

Understanding the Pyramid Data Framework¶

After creating the confusion matrix surfaces, a progressive focal window analysis is used to transform the coarse and fine uncertainty surfaces into a pyramid data framework. The area used for the scale analysis is a 200-pixel by 200-pixel section of the False River subset. Note that you can use a larger or smaller area to build the pyramids, but it must be a square area. Starting with a three-by-three-pixel size, focal windows are moved across the entire surface and the desired metric is calculated for the pixels falling within the focal window. For each window, the calculated metrics are stored within the center pixel of the window. After the entire surface has been calculated a single layer of the pyramid is complete. The analysis then iterates with increasingly larger focal window sizes, and the resulting calculations are stored in new layers of the pyramid data framework. As the focal window size increases, more of the edge areas are excluded and the number of pixels containing data decreases. This results in a pyramid shape with the full data layer on the bottom and a single cell at the top that encompasses the entire study area.

The pyramid data framework allows multi-scale summary statistics to be stored in a single unified framework that can then be visualized, summarized, and sliced. Animation is possible to slide vertically through the pyramid. Summary statistics can be calculated for each slice allowing you to easily compare the data at different scales. This allows for easy viewing of misclassification at multiple scales and determination of uncertainty patterns within the data across spatial and attribute resolutions.

Running Code to Build the Pyramids¶

In this section of the tutorial we will run three separate python scripts. We'll start with "Part2_HGM_NWI_Compare_subset.ipynb". Go ahead and open this script in your Google Drive.

Run through the blocks of code in the Colab notebook.

Note: You may need to mount your Drive again in the second code chunk and double check the paths in the third code chunk. Refer back to part 1 and follow the same steps.

Once you've run all blocks of code, you have clipped the study area to the square analysis area for creating the pyramids.

The next script we'll run is "Part2B_focal_stats_coarse_res.ipynb". Again, you may need to mount your Drive and double check the paths. Once you've done so, run all code chunks. This will create the pyramid at the coarse attribute uncertainty level.

The last script we'll run for part 2 is "Part2C_focal_stats_fine_res.ipynb". Similar to the last script, you may need to moun your Drive and double check the paths. Once you've done that, you can run the script to create the fine pyramids. This script may take a while to run, depending on your computer speed it could take around 20 minutes.

Congratulations! You are now done with the first two parts of the tutorial, and you have completed running all of the Python scripts. For part 3, we will turn to MATLAB.

Part 3: Visualization¶

In this final part of the tutorial. We will use MATLAB to visualize the pyramids. For this part we will turn from the cloud to your local files.

Using your file browser, navigate to where you stored the Part 3 folder and double click on "Part3_Pyramid_model.m" to open the script in MATLAB. You may notice that all of the .mat files that were output in Part 2 have been duplicated into the Part 3 input folder are serving as our inputs for this code. You shouldn't need to make any edits to the code for it to run successfully. However, you will need to add a folder connection to the Part 3 folder. You can do this in the left panel ("Current folder") by navigating to your Part 3 folder, then right click on the folder and click Add to Path > Selected Folder and Subfolders.

In your output folder, you should now have 35 files. These outputs include visualization from the coarse attribute F1 pyramid, coarse attribute Precision pyramid, coarse attribute Recall pyramid, fine attribute pyramid showing F1 scores for the palustrine class, and another fine attribute pyramid showing F1 scores for the non-wetland class. Within each of these you have 7 files: 5 .fig files that show you a slice of the pyramid (these can be viewed in MATLAB and exported as jpgs, pngs, pdfs, etc.), one .avi file that show an animation of sliding through multiple slices in the pyramid, and one excel sheet providing more detail about the F1, Recall, or Precision metrics. Note that files ending in "_75.fig" are scratch outputs and not meant for meaningful interpretation.

Below is an example of what your F1_slice5 should look like. Note that in the MATLAB interface you can click and drag to manipulate the perspective.

Congratulations! You have now completed the Multi-Scale Attribute Uncertainty tutorial.

Cited References

Adamus, P., Christy, J., Jones, A., McCune, M., and Bauer, J. (2010). A Geodatabase and Digital Characterization of Wetlands Mapped in the Willamette Valley With Particular Reference to Prediction of Their Hydrogeomorphic (HGM) Class [Report to USEPA Region 10].

Cedfeldt, P. T., Watzin, M. C., and Richardson, B. D. (2000). Using GIS to Identify Functionally Significant Wetlands in the Northeastern United States. Environmental Management, 26(1), 13–24. https://doi.org/10.1007/s002670010067

FGDC, (Federal Geographic Data Committee). (2013). Classification of Wetlands and Deepwater Habitats of the United States (FGDC-STD-004-2013. Second Edition.). Wetlands Subcommittee, Federal Geographic Data Committee and U.S. Fish and Wildlife Service.

Powers, D.M.W. (2011). Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness and Correlation. International Journal of Machine Learning Technology, 2(1), 37–63.

Rivers-Moore, N. A., Kotze, D. C., Job, N., and Mohanlal, S. (2020). Prediction of Wetland Hydrogeomorphic Type Using Morphometrics and Landscape Characteristics. Frontiers in Environmental Science, 8. https://doi.org/10.3389/fenvs.2020.00058

Smith, R. D., Noble, C. V., and Berkowitz, J. F. (2013). Hydrogeomorphic (HGM) Approach to Assessing Wetland Functions: Guidelines for Developing Guidebooks (Version 2): Defense Technical Information Center.

Van Deventer, H., Nel, J., Mbona, N., Job, N., Ewart-Smith, J., Snaddon, K., and Maherry, A. (2016). Desktop classification of inland wetlands for systematic conservation planning in data-scarce countries: Mapping wetland ecosystem types, disturbance indices and threatened species associations at country-wide scale. Aquatic Conservation: Marine and Freshwater Ecosystems, 26(1), 57–75. https://doi.org/10.1002/aqc.2605