IntroductionΒΆ

During one of my many procrastinating moments, I went down a rabbit hole that somehow landed me in Svalbard, the arctic archipelago that belongs to Norway. I then decided to check the degree of urbanization of the archipelago according to Copernicus' Global Human Settlement Layer (GHSL), and what was my surprise when I found out that the vast majority of settlements were missing, most notably Longyearbyen, the capital and the largest settled area of Svalbard, where the famous Global Seed Vault is located.

No description has been provided for this image

I was frankly disappointed, but didn't think much more of it until I had to do OBIA's practical exam. Then, I had an idea! What if I created classification model trained on other Norwegian settlements to assess wether or not Longyearbyen was in fact undetectable?

I opened the GHSL website and found that the "featured content" was none other than a newly released GHSL-Arctic dataset! I couldn't believe it! Someone had noticed the poor coverage of the Arctic and had fixed it! I had just lost my project... Or so I thought...

As I checked the new dataset, I was happy to discover that Svalbard was still missing! In fact, the dataset doesn't include any human settlements north of 80Β° of latitude!

The following code is my adaptation of the 03_classification Jupyter Notebook to this specific instant.

SetupΒΆ

Importing the necessary libraries

InΒ [1]:
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import rasterio
from rasterio.features import rasterize as feat_rasterize
from shapely.geometry import mapping

from scipy import stats
from skimage.color import label2rgb
from sklearn.ensemble import RandomForestClassifier
from skimage.feature import graycomatrix, graycoprops
from skimage.measure import regionprops_table
from skimage.segmentation import mark_boundaries, slic
from skimage.util import map_array
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

Data PreparationΒΆ

The GHSL website provides a zipfile with several shapefiles, each containing the cells corresponding to one of the 4 degrees of urbanization.

Considering the size of Longyearbyen, I focused on the smallest degree, Rural Clusters (RCs), which, by definition are settlements with at least 300 inhabitants per km2 of permanent land and at least 500 people, characteristics which I think Longyearbyen respects.

InΒ [2]:
rc_shape = gpd.read_file("GHS_WUP_DEGURBA_E2025_GLOBE_R2025A_54009_1000_V1_0.zip",
                         layer='GHS_WUP_DEGURBA_E2025_GLOBE_R2025A_54009_1000_RC_V1_0')
InΒ [3]:
rc_shape
Out[3]:
ID_RC_G0 UNLocID AREA_km2 POP_2025 BU_km2_2025 PWCentroidX PWCentroidY Unique_ID geometry
0 1 876 7 1888.455901 0.272887 -1.736350e+07 -1.644222e+06 RC_1 POLYGON ((-17362000 -1645000, -17363000 -16450...
1 2 772 5 950.774888 0.012179 -1.717143e+07 -1.054307e+06 RC_2 POLYGON ((-17170000 -1055000, -17171000 -10550...
2 3 882 8 1780.333170 0.213592 -1.700950e+07 -1.665500e+06 RC_3 MULTIPOLYGON (((-17005000 -1664000, -17006000 ...
3 4 776 1 628.119092 0.035781 -1.686850e+07 -2.284500e+06 RC_4 POLYGON ((-16868000 -2285000, -16869000 -22850...
4 5 16 2 711.384614 0.075298 -1.678502e+07 -1.762500e+06 RC_5 POLYGON ((-16784000 -1763000, -16786000 -17630...
... ... ... ... ... ... ... ... ... ...
344316 344317 356 8 2914.931664 0.259803 9.193372e+06 1.483242e+06 RC_344317 POLYGON ((9194000 1484000, 9195000 1484000, 91...
344317 344318 356 1 657.597469 0.030337 9.219500e+06 1.137500e+06 RC_344318 POLYGON ((9220000 1137000, 9219000 1137000, 92...
344318 344319 356 8 4368.646821 0.210701 9.225500e+06 1.138500e+06 RC_344319 POLYGON ((9225000 1140000, 9226000 1140000, 92...
344319 344320 356 2 815.620894 0.020981 9.272658e+06 1.025500e+06 RC_344320 POLYGON ((9274000 1025000, 9272000 1025000, 92...
344320 344321 356 8 3051.952244 0.077766 9.318277e+06 9.926245e+05 RC_344321 POLYGON ((9318000 994000, 9319000 994000, 9319...

344321 rows Γ— 9 columns

The following map shows the only settlement identified in Svalbard by the GHSL project.

InΒ [4]:
rc_shape[rc_shape['UNLocID']==744].explore()
Out[4]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The following map shows all the Rural Clusters in Norway identified by the GHSL project. I explored this map to locate areas with several rural cluster (and no larger clusters) to train my model.

InΒ [5]:
rc_shape[rc_shape['UNLocID']==578].explore()
Out[5]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Using the Copernicus Browser, I got the tiff files for 3 training locations, all of them close to the beginning of Spring, which I believe is when the landscape looks the closest to the Svalbard Summer.

InΒ [6]:
# Training image A
B02_train = "2025-03-23-00_00_2025-03-23-23_59_Sentinel-2_L2A_B02_(Raw).tiff"  # Blue
B03_train = "2025-03-23-00_00_2025-03-23-23_59_Sentinel-2_L2A_B03_(Raw).tiff"  # Green
B04_train = "2025-03-23-00_00_2025-03-23-23_59_Sentinel-2_L2A_B04_(Raw).tiff"  # Red
B08_train = "2025-03-23-00_00_2025-03-23-23_59_Sentinel-2_L2A_B08_(Raw).tiff"  # NIR

# Training image B
B02_train_b = "2025-04-02-00_00_2025-04-02-23_59_Sentinel-2_L2A_B02_(Raw).tiff"
B03_train_b = "2025-04-02-00_00_2025-04-02-23_59_Sentinel-2_L2A_B03_(Raw).tiff"
B04_train_b = "2025-04-02-00_00_2025-04-02-23_59_Sentinel-2_L2A_B04_(Raw).tiff"
B08_train_b = "2025-04-02-00_00_2025-04-02-23_59_Sentinel-2_L2A_B08_(Raw).tiff"

# Training image C
B02_train_c = "2025-04-25-00_00_2025-04-25-23_59_Sentinel-2_L2A_B02_(Raw).tiff"
B03_train_c = "2025-04-25-00_00_2025-04-25-23_59_Sentinel-2_L2A_B03_(Raw).tiff"
B04_train_c = "2025-04-25-00_00_2025-04-25-23_59_Sentinel-2_L2A_B04_(Raw).tiff"
B08_train_c = "2025-04-25-00_00_2025-04-25-23_59_Sentinel-2_L2A_B08_(Raw).tiff"

# Inference image
B02_infer = "2024-08-01-00_00_2024-08-01-23_59_Sentinel-2_L2A_B02_(Raw).tiff"
B03_infer = "2024-08-01-00_00_2024-08-01-23_59_Sentinel-2_L2A_B03_(Raw).tiff"
B04_infer = "2024-08-01-00_00_2024-08-01-23_59_Sentinel-2_L2A_B04_(Raw).tiff"
B08_infer = "2024-08-01-00_00_2024-08-01-23_59_Sentinel-2_L2A_B08_(Raw).tiff"
InΒ [7]:
def read_and_stack(b02, b03, b04, b08):
    """
    Reads four Sentinel-2 single-band GeoTIFFs (B02 Blue, B03 Green, B04 Red, B08 NIR).
    Returns blue, green, red, nir, rgb, rgb_mean, ndvi, full_stack, profile.
    """
    def read_band(path):
        with rasterio.open(path) as src:
            return src.read(1).astype(float), src.profile

    red,   profile = read_band(b04)
    blue,  _       = read_band(b02)
    green, _       = read_band(b03)
    nir,   _       = read_band(b08)

    rgb        = np.dstack([red, green, blue])
    rgb_mean   = np.mean(rgb, axis=2)
    ndvi       = (nir - red) / (nir + red + 1e-6)
    full_stack = np.dstack([red, green, blue, nir, rgb_mean, ndvi])

    return blue, green, red, nir, rgb, rgb_mean, ndvi, full_stack, profile
InΒ [8]:
NDVI_THRS = 0.1

def plot_rgb_overview(rgb, ndvi, title_suffix=""):
    """Plot RGB, mean reflectance, and negative-NDVI."""
    rgb_disp = np.clip(rgb * 3, 0, 1)
    rgb_d_mean = np.mean(rgb_disp, axis=2)
    
    fig, axs = plt.subplots(ncols=3, figsize=(15, 5), constrained_layout=True)
    
    axs[0].imshow(rgb_disp)
    axs[0].set_title(f"Original RGB {title_suffix}")
    
    axs[1].imshow(rgb_d_mean, cmap='gray')
    axs[1].set_title("RGB Mean Reflectance")
    
    axs[2].imshow(np.where(ndvi < NDVI_THRS, ndvi, np.nan), cmap="plasma")
    axs[2].set_title("NDVI (negative only)")
    
    for ax in axs:
        ax.set_axis_off()
        #OR ax.axis("off")
    # plt.tight_layout() not needed because of "constrained_layout=True"
    plt.show()
InΒ [9]:
blue, green, red, nir, rgb, rgb_mean, ndvi, full_stack, profile = read_and_stack(
    B02_train, B03_train, B04_train, B08_train
)
plot_rgb_overview(rgb, ndvi, title_suffix="(March 2025) - Location A")


blue_b, green_b, red_b, nir_b, rgb_b, rgb_mean_b, ndvi_b, full_stack_b, profile_b = read_and_stack(
    B02_train_b, B03_train_b, B04_train_b, B08_train_b
)
plot_rgb_overview(rgb_b, ndvi_b, title_suffix="(April 2025 - Location B)")


blue_c, green_c, red_c, nir_c, rgb_c, rgb_mean_c, ndvi_c, full_stack_c, profile_c = read_and_stack(
    B02_train_c, B03_train_c, B04_train_c, B08_train_c
)
plot_rgb_overview(rgb_c, ndvi_c, title_suffix="(April 2025 - Location C)")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Image SegmentationΒΆ

For the image segmentation, a relativelly slow number of segments was chosen with the goal of getting segments of the same aproximate size as the GHSL raster, thus preventing non-urban segments from being taken into account in the training.

InΒ [10]:
COMPACTNESS = 0.2
N_SEGMENTS  = 200

segments = slic(
    full_stack[:, :, :4],   # R, G, B, NIR
    n_segments=N_SEGMENTS,
    compactness=COMPACTNESS,
    start_label=1,
    channel_axis=-1,
)

segments_b = slic(
    full_stack_b[:, :, :4],
    n_segments=N_SEGMENTS, compactness=COMPACTNESS,
    start_label=1, channel_axis=-1,
)

segments_c = slic(
    full_stack_c[:, :, :4],
    n_segments=N_SEGMENTS, compactness=COMPACTNESS,
    start_label=1, channel_axis=-1,
)
InΒ [11]:
def plot_segmentation(rgb, segments, title_suffix=""):
    """Plot RGB, SLIC boundaries, and mean-colour segments."""
    fig, axs = plt.subplots(ncols=3, figsize=(15, 5), constrained_layout=True)
    rgb_disp = np.clip(rgb * 3, 0, 1)
    axs[0].imshow(rgb_disp)
    axs[0].set_title(f"Original RGB {title_suffix}")
    
    axs[1].imshow(mark_boundaries(rgb_disp, segments, color=(1, 0, 0), mode="thick"))
    axs[1].set_title(f"SLIC Boundaries ({N_SEGMENTS} segments)")
    
    axs[2].imshow(label2rgb(segments, rgb_disp, kind='avg'))
    axs[2].set_title("Mean RGB per Segment")
    
    for ax in axs:
        ax.set_axis_off()
    plt.show()
InΒ [12]:
plot_segmentation(rgb,   segments,   title_suffix="(Location A)")
plot_segmentation(rgb_b, segments_b, title_suffix="(Location B)")
plot_segmentation(rgb_c, segments_c, title_suffix="(Location C)")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Features ExtractionΒΆ

In the original notebook, there were Spectral Features, Shape Features and Textural Features.

That being said, I don't think shape features are relevant in this situation, so I deleted them. On the other hand, I think texture is paramount to detect settlements, so I wanted to expand them. Thus, inspired by e-cognitions's texture features based on gray-level cooccurrence matrices (GLCM), I used AI to create the function calc_glcm_feats, that calculates the GLCM and some texture-related features from it. Afterwards, I simply updated the function calc_all_feats.

InΒ [13]:
# Define custom features
def std(regionmask, intensity_img):
    vals = intensity_img[regionmask]
    std = np.std(vals)
    return std

def entropy_ndvi(regionmask, intensity_img):
    vals = intensity_img[regionmask]
    arr = stats.relfreq(vals, 100, defaultreallimits=(-1,1))[0]
    return stats.entropy(arr)

def calc_glcm_feats(seg_arr, band, n_levels=32):
    """
    Compute GLCM texture features per segment from a single float band.
    The band is quantized to `n_levels` grey levels before GLCM computation.
    Features computed (averaged over 4 angles):
      contrast, dissimilarity, homogeneity, energy, correlation.
    Returns a DataFrame with one row per segment, indexed by label.
    """
    # Quantize float band to [0, n_levels-1] integers
    band_min, band_max = np.nanmin(band), np.nanmax(band)
    band_q = ((band - band_min) / (band_max - band_min + 1e-6) * (n_levels - 1)).astype(np.uint8)

    PROPS    = ["contrast", "dissimilarity", "homogeneity", "energy", "correlation"]
    ANGLES   = [0, np.pi / 4, np.pi / 2, 3 * np.pi / 4]
    records  = []

    for lbl in np.unique(seg_arr):
        mask = seg_arr == lbl
        rows, cols = np.where(mask)
        # Extract the bounding-box patch and re-apply the segment mask
        r0, r1 = rows.min(), rows.max() + 1
        c0, c1 = cols.min(), cols.max() + 1
        patch      = band_q[r0:r1, c0:c1]
        patch_mask = mask[r0:r1, c0:c1]
        # Zero-out pixels outside the segment so they don't influence the GLCM
        patch_masked = patch * patch_mask

        glcm = graycomatrix(
            patch_masked,
            distances=[1],
            angles=ANGLES,
            levels=n_levels,
            symmetric=True,
            normed=True,
        )
        row = {"label": lbl}
        for prop in PROPS:
            # Average across the 4 directions
            row[f"glcm_{prop}"] = graycoprops(glcm, prop).mean()
        records.append(row)

    return pd.DataFrame(records).set_index("label")

# Calculate set of various features for each segment
def calc_all_feats(seg_arr, img_arr):
    spectral_feats = pd.DataFrame(
        regionprops_table(
            label_image = seg_arr,
            intensity_image = img_arr,
            properties = ["label", "intensity_mean"],
            extra_properties=(std,)
        )
    )

    glcm_ndvi = calc_glcm_feats(seg_arr, img_arr[:, :, 5])  # ndvi
    #glcm_ndvi = glcm_ndvi.add_prefix("ndvi_")

    all_feats = pd.concat([spectral_feats, glcm_ndvi.reset_index(drop=True)], axis=1)

    return all_feats

all_feats_df = calc_all_feats(segments, full_stack).rename(columns={"label": "indx"})
display(all_feats_df)

all_feats_dfb = calc_all_feats(segments_b, full_stack_b).rename(columns={"label": "indx"})
display(all_feats_dfb)

all_feats_dfc = calc_all_feats(segments_c, full_stack_c).rename(columns={"label": "indx"})
display(all_feats_dfc)
indx intensity_mean-0 intensity_mean-1 intensity_mean-2 intensity_mean-3 intensity_mean-4 intensity_mean-5 std-0 std-1 std-2 std-3 std-4 std-5 glcm_contrast glcm_dissimilarity glcm_homogeneity glcm_energy glcm_correlation
0 1 0.071640 0.052787 0.029106 0.241990 0.051178 0.566610 0.037230 0.030309 0.021343 0.070871 0.029284 0.136021 5.451725 0.737579 0.750174 0.262824 0.901578
1 2 0.064042 0.043543 0.024263 0.221678 0.043949 0.553098 0.011181 0.008178 0.006266 0.025226 0.008236 0.045069 7.840038 0.588701 0.855321 0.457301 0.901150
2 3 0.048379 0.033870 0.017354 0.184263 0.033201 0.536546 0.023256 0.015747 0.010597 0.072429 0.016319 0.277504 14.174282 1.078429 0.780053 0.307278 0.900750
3 4 0.074841 0.058074 0.031441 0.253538 0.054785 0.560902 0.038124 0.032227 0.022139 0.089632 0.030469 0.148008 11.069375 0.880109 0.779393 0.274161 0.938989
4 5 0.042044 0.037472 0.022710 0.188383 0.034075 0.705444 0.043971 0.043945 0.044740 0.106697 0.042923 0.208738 11.699771 0.963576 0.772969 0.305575 0.956395
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
183 184 0.032865 0.029869 0.014336 0.185172 0.025690 0.733661 0.031528 0.028062 0.024633 0.094294 0.027387 0.138322 14.107051 1.008230 0.767309 0.279035 0.944915
184 185 0.061046 0.053822 0.032451 0.213946 0.049106 0.591433 0.045730 0.037407 0.033315 0.076217 0.038274 0.240977 13.421409 1.154545 0.727196 0.304916 0.948953
185 186 0.197120 0.200688 0.199302 0.285928 0.199036 0.382574 0.236477 0.244044 0.254861 0.205644 0.244339 0.300363 6.950121 0.719906 0.820526 0.544054 0.970572
186 187 0.149632 0.143388 0.131838 0.302950 0.141619 0.384600 0.121546 0.124361 0.134049 0.157620 0.124565 0.194233 14.324812 1.296180 0.689059 0.270736 0.921721
187 188 0.139780 0.137266 0.130437 0.283210 0.135828 0.419480 0.116515 0.116290 0.123110 0.101638 0.117860 0.235025 14.828511 1.244614 0.696907 0.271513 0.925740

188 rows Γ— 18 columns

indx intensity_mean-0 intensity_mean-1 intensity_mean-2 intensity_mean-3 intensity_mean-4 intensity_mean-5 std-0 std-1 std-2 std-3 std-4 std-5 glcm_contrast glcm_dissimilarity glcm_homogeneity glcm_energy glcm_correlation
0 1 0.040683 0.040316 0.029939 0.140780 0.036979 0.303392 0.045674 0.037080 0.025620 0.112640 0.035834 0.613515 13.924187 1.102956 0.765986 0.318090 0.946186
1 2 0.056307 0.052449 0.036187 0.205401 0.048315 0.599618 0.042175 0.033301 0.024456 0.082319 0.032858 0.149936 10.380624 0.786067 0.808533 0.321097 0.958390
2 3 0.049596 0.044437 0.030381 0.183224 0.041471 0.603113 0.035681 0.028795 0.023388 0.060694 0.029011 0.171437 10.029002 0.808563 0.798883 0.349245 0.963234
3 4 0.009939 0.011591 0.011218 0.031605 0.010916 -0.484441 0.018631 0.015999 0.012486 0.063544 0.015461 0.680750 10.976221 0.949948 0.820357 0.672592 0.937595
4 5 0.032757 0.036561 0.030094 0.119933 0.033137 0.055771 0.060384 0.059971 0.058504 0.117896 0.059246 0.757562 19.805024 1.324068 0.799356 0.543871 0.930963
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
127 128 0.116532 0.120229 0.114426 0.211685 0.117063 0.391008 0.145178 0.149220 0.156098 0.119935 0.149592 0.258354 6.093505 0.737889 0.798760 0.487208 0.974406
128 129 0.091154 0.073181 0.054508 0.204344 0.072948 0.477542 0.097200 0.097248 0.097600 0.080541 0.096933 0.211019 8.554255 0.959591 0.744550 0.292816 0.960137
129 130 0.028163 0.030655 0.020437 0.192273 0.026419 0.761813 0.025943 0.021942 0.021197 0.047839 0.022599 0.147332 8.180814 0.844616 0.740158 0.244156 0.920213
130 131 0.868136 0.885828 0.887493 0.746162 0.880486 -0.073232 0.145006 0.149048 0.151669 0.108680 0.147934 0.025732 6.395790 0.488429 0.946890 0.647035 0.934632
131 132 0.088343 0.065940 0.045279 0.215221 0.066520 0.481556 0.067666 0.063336 0.061399 0.064789 0.063427 0.202095 8.994555 1.106779 0.696177 0.215149 0.942035

132 rows Γ— 18 columns

indx intensity_mean-0 intensity_mean-1 intensity_mean-2 intensity_mean-3 intensity_mean-4 intensity_mean-5 std-0 std-1 std-2 std-3 std-4 std-5 glcm_contrast glcm_dissimilarity glcm_homogeneity glcm_energy glcm_correlation
0 1 0.016637 0.025278 0.027147 0.056022 0.023020 0.108206 0.019296 0.012977 0.009075 0.076907 0.013254 0.409721 11.793564 1.364605 0.635602 0.215959 0.923380
1 2 0.069913 0.057683 0.047698 0.181016 0.058431 0.453877 0.022906 0.016113 0.015306 0.029184 0.017907 0.121274 10.815736 0.935726 0.758733 0.281217 0.915856
2 3 0.080284 0.070108 0.057321 0.204387 0.069238 0.442502 0.029267 0.024818 0.025919 0.044272 0.025723 0.146209 10.065181 0.862597 0.777439 0.282701 0.935144
3 4 0.007666 0.021657 0.028699 0.006080 0.019340 -0.326074 0.013616 0.012121 0.011215 0.018798 0.012094 0.284224 7.476368 1.563027 0.532214 0.183244 0.869041
4 5 0.007986 0.022393 0.029691 0.006350 0.020023 -0.179116 0.011343 0.011281 0.011388 0.013526 0.011128 0.114820 3.622412 1.203186 0.557541 0.194046 0.845637
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
194 195 0.037315 0.042579 0.032033 0.213036 0.037309 0.705778 0.016590 0.014480 0.009608 0.054728 0.013154 0.090292 10.182391 0.688629 0.839204 0.385915 0.965705
195 196 0.006306 0.017024 0.025197 0.005850 0.016176 -0.038284 0.000813 0.001002 0.001289 0.000833 0.000765 0.072839 4.129788 1.043011 0.621122 0.249697 0.891336
196 197 0.011095 0.023204 0.030226 0.009686 0.021508 -0.068352 0.001608 0.001895 0.002004 0.001479 0.001692 0.046489 3.220775 0.746875 0.718577 0.386312 0.674464
197 198 0.006460 0.017780 0.025871 0.005701 0.016704 -0.063079 0.000816 0.001018 0.001321 0.000799 0.000778 0.074421 3.952067 1.030822 0.624372 0.249532 0.908164
198 199 0.030258 0.030650 0.025012 0.138290 0.028640 0.673509 0.023077 0.017548 0.013081 0.071794 0.017708 0.137887 14.792166 1.087593 0.736914 0.241177 0.920199

199 rows Γ— 18 columns

Image ClassificationΒΆ

LabelingΒΆ

With the help of AI, I created the functions rasterize_rc and label_segments to turn the RC polygons into a raster that could then be used in the labeling process.

Furthermore, I completely excluded from the training low NDVI segments that had been identified as Rural Clusters (to, hopefully, exclude water segments erronesouly labeled), and also segments that intercepted the RC polygons, but where less than 50% of their area was included (since this would, in my opinion, introduce ambiguity).

InΒ [14]:
def rasterize_rc(rc_shapefile, profile, un_loc_id):
    """
    Burn a GeoDataFrame of RC polygons onto the raster grid defined by `profile`.
    Returns a binary uint8 array: 1 inside an RC polygon, 0 elsewhere.
    """
    rc_filtered = rc_shapefile[rc_shapefile['UNLocID'] == un_loc_id].copy().to_crs(profile['crs'])
    burn_shapes = [
        (mapping(geom), 1)
        for geom in rc_filtered.geometry
        if geom is not None and not geom.is_empty
    ]
    return feat_rasterize(
        burn_shapes,
        out_shape=(profile['height'], profile['width']),
        transform=profile['transform'],
        fill=0,
        dtype='uint8',
    )

def label_segments(segments, rc_raster, ndvi, rc_threshold=0.5, ndvi_min=NDVI_THRS):
    """
    Label segments for training with strict overlap rules to avoid ambiguity:
      - class 1 (RC)     : >= `rc_threshold` of pixels inside an RC polygon
                           AND mean NDVI >= `ndvi_min`
      - class 0 (non-RC) : exactly 0% of pixels inside an RC polygon
      - excluded (NaN)   : anything in between (ambiguous border segments)
    Returns a dict {segment_label: class}, with ambiguous segments omitted.
    """
    result = {}
    for lbl in np.unique(segments):
        mask     = (segments == lbl)
        rc_frac  = rc_raster[mask].mean()
        if rc_frac >= rc_threshold and ndvi[mask].mean() >= ndvi_min:
                result[lbl] = 1
        elif rc_frac == 0.0:
            result[lbl] = 0
        # else: ambiguous β€” omit from training
    return result

rc_raster = rasterize_rc(rc_shape, profile, 578)
rc_raster_b = rasterize_rc(rc_shape, profile_b, 578)
rc_raster_c = rasterize_rc(rc_shape, profile_c, 578)

seg_class_map = label_segments(segments, rc_raster, ndvi)
all_feats_df["class"] = all_feats_df["indx"].map(seg_class_map)

seg_class_map_b = label_segments(segments_b, rc_raster_b, ndvi_b)
all_feats_dfb["class"] = all_feats_dfb["indx"].map(seg_class_map_b)

seg_class_map_c = label_segments(segments_c, rc_raster_c, ndvi_c)
all_feats_dfc["class"] = all_feats_dfc["indx"].map(seg_class_map_c)
InΒ [15]:
print(f"  Rural cluster segments : {(all_feats_df['class'] == 1).sum()}")
print(f"  Other segments         : {(all_feats_df['class'] == 0).sum()}")
print(f"  Excluded (ambiguous)   : {(np.isnan(all_feats_df['class'])).sum()}")
print(f"  Missing                : {len(set(np.unique(segments)) - set(all_feats_df["indx"]))}")

print('\n')

print(f"  Rural cluster segments : {(all_feats_dfb['class'] == 1).sum()}")
print(f"  Other segments         : {(all_feats_dfb['class'] == 0).sum()}")
print(f"  Excluded (ambiguous)   : {(np.isnan(all_feats_df['class'])).sum()}")
print(f"  Missing                : {len(set(np.unique(segments_b)) - set(all_feats_dfb["indx"]))}")

print('\n')

print(f"  Rural cluster segments : {(all_feats_dfc['class'] == 1).sum()}")
print(f"  Other segments         : {(all_feats_dfc['class'] == 0).sum()}")
print(f"  Excluded (ambiguous)   : {(np.isnan(all_feats_df['class'])).sum()}")
print(f"  Missing                : {len(set(np.unique(segments_c)) - set(all_feats_dfc["indx"]))}")
  Rural cluster segments : 7
  Other segments         : 166
  Excluded (ambiguous)   : 15
  Missing                : 0


  Rural cluster segments : 6
  Other segments         : 114
  Excluded (ambiguous)   : 15
  Missing                : 0


  Rural cluster segments : 5
  Other segments         : 169
  Excluded (ambiguous)   : 15
  Missing                : 0
InΒ [16]:
binary_cmap = mcolors.ListedColormap(['#7f8282', '#00008b', '#00FF00'])

def plot_labels(rgb, rc_raster, segments, feats_df, title_suffix=""):
    """Plot RGB, RC Raster, and segment labels."""
    rgb_disp = np.clip(rgb * 3, 0, 1)
    mapped = map_array(
        segments,
        np.array(feats_df["indx"]),
        np.array(feats_df["class"]),
    )
    
    fig, axs = plt.subplots(1, 3, figsize=(18, 6), constrained_layout=True)
    axs[0].imshow(rgb_disp)
    axs[0].set_title(f"RGB {title_suffix}")
    
    axs[1].imshow(rc_raster, cmap="Reds")
    axs[1].set_title("RC raster (GHS)")
    
    axs[2].imshow(mapped, cmap=binary_cmap, interpolation="nearest")
    axs[2].set_title("Segment labels  green=RC  grey=other")
    
    for ax in axs:
        ax.set_axis_off()
    plt.show()
    return mapped

mapped_labels   = plot_labels(rgb,   rc_raster, segments,   all_feats_df,  title_suffix="(March 2025)")
mapped_labels_b = plot_labels(rgb_b, rc_raster_b, segments_b, all_feats_dfb, title_suffix="(April 2025)")
mapped_labels_c = plot_labels(rgb_c, rc_raster_c, segments_c, all_feats_dfc, title_suffix="(April 2025)")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InΒ [17]:
all_feats_final = pd.concat([all_feats_df.drop(columns="indx"),
                             all_feats_dfb.drop(columns="indx"),
                             all_feats_dfc.drop(columns="indx")]).dropna(subset=["class"]).copy()

all_feats_final["class"] = all_feats_final["class"].astype(int)

Train Random Forest ClassifierΒΆ

InΒ [18]:
feature_cols = [c for c in all_feats_final.columns if c !="class"]
X = all_feats_final[feature_cols]
y = all_feats_final["class"]

# Train the Random Forest Classifier using the entire dataset
rfc_clf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rfc_clf.fit(X, y)

# Evaluate the model using OOB score
print("OOB Score:", rfc_clf.oob_score_)
OOB Score: 0.9635974304068522
InΒ [19]:
feat_imp = pd.Series(rfc_clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("Top features:\n", feat_imp.head(5).to_string())
Top features:
 std-2                 0.090977
glcm_dissimilarity    0.087306
intensity_mean-1      0.083919
intensity_mean-4      0.076454
glcm_homogeneity      0.075908

I was happy to see that two of my GLCM features were in the Top 5 more important for the model. That being said, the actual results weren't as motivating...

InΒ [20]:
blue_i, green_i, red_i, nir_i, rgb_i, rgb_mean_i, ndvi_i, full_stack_i, profile_i = read_and_stack(
    B02_infer, B03_infer, B04_infer, B08_infer
)
plot_rgb_overview(rgb_i, ndvi_i)
No description has been provided for this image
InΒ [21]:
print("Running SLIC…")
segments_i = slic(
    full_stack_i[:, :, :4],
    n_segments=N_SEGMENTS, compactness=COMPACTNESS,
    start_label=1, channel_axis=-1,
)

print("Extracting features…")
feats_i = calc_all_feats(segments_i, full_stack_i).rename(columns={"label": "indx"})
feats_i["rfc_pred"] = rfc_clf.predict(feats_i[feature_cols])

seg_labels  = np.array(feats_i["indx"])

mapped_infer = map_array(segments_i, seg_labels, np.array(feats_i["rfc_pred"]))

fig, ax = plt.subplots(figsize=(6, 4), constrained_layout=True)

ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
ax.imshow(mapped_infer, cmap=binary_cmap, alpha=0.5, interpolation="nearest")
ax.set_title("RFC prediction  green=RC")

ax.set_axis_off()
plt.show()
Running SLIC…
Extracting features…
No description has been provided for this image
InΒ [22]:
feats_i["rfc_prob"] = rfc_clf.predict_proba(feats_i[feature_cols])[:, 1]

mapped_prob = map_array(segments_i, seg_labels, np.array(feats_i["rfc_prob"]))

vmax = feats_i["rfc_prob"].max()

fig, ax = plt.subplots(figsize=(7, 5), constrained_layout=True)
ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
prob_plot = ax.imshow(mapped_prob, cmap="plasma", alpha=0.5, vmin=0, vmax=vmax, interpolation="nearest")
plt.colorbar(prob_plot, ax=ax)
ax.set_title("RC probability")
ax.axis("off")
plt.show()
No description has been provided for this image

This probability heat map was especially worrying, considering the actual settlement segments aren't the ones with the highest probability!

Train Suport Vector MachinesΒΆ

Once again, taking inspiration from e-Cognition, I decided to try and implement a classifier based on Suport Vector Machines (SVMs), which always yielded better results on my other assignements. The largest difference when compared to setting up the Random Forest Classifier is that SVMs don't Β«perform well when the input numerical attributes have very different scalesΒ» , which is why I had to apply some feature scaling.

InΒ [23]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X).copy()

# Instantiate the classifier model
svm_classifier = SVC(probability=True, class_weight="balanced", random_state=42)
# Fit the model with the training data
svm_classifier.fit(X_scaled, y)
Out[23]:
SVC(class_weight='balanced', probability=True, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
CΒ  1.0
kernelΒ  'rbf'
degreeΒ  3
gammaΒ  'scale'
coef0Β  0.0
shrinkingΒ  True
probabilityΒ  True
tolΒ  0.001
cache_sizeΒ  200
class_weightΒ  'balanced'
verboseΒ  False
max_iterΒ  -1
decision_function_shapeΒ  'ovr'
break_tiesΒ  False
random_stateΒ  42
InΒ [24]:
# Scale new data the same way
X_new_scaled = scaler.transform(feats_i[feature_cols])

# Predict
feats_i["svc_pred"] = svm_classifier.predict(X_new_scaled)
feats_i["svc_prob"] = svm_classifier.predict_proba(X_new_scaled)[:, 1]
InΒ [25]:
mapped_infer = map_array(segments_i, seg_labels, np.array(feats_i["svc_pred"]))

fig, ax = plt.subplots(figsize=(6, 4), constrained_layout=True)

ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
ax.imshow(mapped_infer, cmap=binary_cmap, alpha=0.5, interpolation="nearest")
ax.set_title("SVM prediction  green=RC")

ax.set_axis_off()
plt.show()
No description has been provided for this image
InΒ [26]:
mapped_prob = map_array(segments_i, seg_labels, np.array(feats_i["svc_prob"]))

vmax = feats_i["svc_prob"].max()

fig, ax = plt.subplots(figsize=(7, 5), constrained_layout=True)
ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
prob_plot = ax.imshow(mapped_prob, cmap="plasma", alpha=0.7, vmin=0, vmax=vmax, interpolation="nearest")
plt.colorbar(prob_plot, ax=ax)
ax.set_title("RC probability")
ax.axis("off")
plt.show()
No description has been provided for this image
InΒ [27]:
feats_i["svc_pred2"] = (feats_i["svc_prob"] > 0.2).astype(int)

mapped_infer2 = map_array(segments_i, seg_labels, np.array(feats_i["svc_pred2"]))

fig, ax = plt.subplots(figsize=(6, 4), constrained_layout=True)

ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
ax.imshow(mapped_infer2, cmap=binary_cmap, alpha=0.5, interpolation="nearest")
ax.set_title("SVM prediction  green=RC")

ax.set_axis_off()
plt.show()
No description has been provided for this image

ConclusionΒΆ

Even with the SVM, the results are not perfect, considering there's a completely-unrelated-false-positive in the top-right corner of the image. Still, there's plenty of tweaks that can still be made to improve the results, such as increasing the number of training smaples, calculating the GLCM texture features for other layers besides NDVI, or exploring the non-linear kernels of the sklearn.svm library, something which honestly seems very promising. Nonetheless, it should came as proof that it is possible to detect settlements in Svalbard with a simple algorithm such as this one, and that Longyearbyen deserves better than being forgotten on a dataset whose main purpose is charting the Arctic!