IntroductionΒΆ
During one of my many procrastinating moments, I went down a rabbit hole that somehow landed me in Svalbard, the arctic archipelago that belongs to Norway. I then decided to check the degree of urbanization of the archipelago according to Copernicus' Global Human Settlement Layer (GHSL), and what was my surprise when I found out that the vast majority of settlements were missing, most notably Longyearbyen, the capital and the largest settled area of Svalbard, where the famous Global Seed Vault is located.
I was frankly disappointed, but didn't think much more of it until I had to do OBIA's practical exam. Then, I had an idea! What if I created classification model trained on other Norwegian settlements to assess wether or not Longyearbyen was in fact undetectable?
I opened the GHSL website and found that the "featured content" was none other than a newly released GHSL-Arctic dataset! I couldn't believe it! Someone had noticed the poor coverage of the Arctic and had fixed it! I had just lost my project... Or so I thought...
As I checked the new dataset, I was happy to discover that Svalbard was still missing! In fact, the dataset doesn't include any human settlements north of 80Β° of latitude!
The following code is my adaptation of the 03_classification Jupyter Notebook to this specific instant.
SetupΒΆ
Importing the necessary libraries
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import rasterio
from rasterio.features import rasterize as feat_rasterize
from shapely.geometry import mapping
from scipy import stats
from skimage.color import label2rgb
from sklearn.ensemble import RandomForestClassifier
from skimage.feature import graycomatrix, graycoprops
from skimage.measure import regionprops_table
from skimage.segmentation import mark_boundaries, slic
from skimage.util import map_array
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
Data PreparationΒΆ
The GHSL website provides a zipfile with several shapefiles, each containing the cells corresponding to one of the 4 degrees of urbanization.
Considering the size of Longyearbyen, I focused on the smallest degree, Rural Clusters (RCs), which, by definition are settlements with at least 300 inhabitants per km2 of permanent land and at least 500 people, characteristics which I think Longyearbyen respects.
rc_shape = gpd.read_file("GHS_WUP_DEGURBA_E2025_GLOBE_R2025A_54009_1000_V1_0.zip",
layer='GHS_WUP_DEGURBA_E2025_GLOBE_R2025A_54009_1000_RC_V1_0')
rc_shape
| ID_RC_G0 | UNLocID | AREA_km2 | POP_2025 | BU_km2_2025 | PWCentroidX | PWCentroidY | Unique_ID | geometry | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 876 | 7 | 1888.455901 | 0.272887 | -1.736350e+07 | -1.644222e+06 | RC_1 | POLYGON ((-17362000 -1645000, -17363000 -16450... |
| 1 | 2 | 772 | 5 | 950.774888 | 0.012179 | -1.717143e+07 | -1.054307e+06 | RC_2 | POLYGON ((-17170000 -1055000, -17171000 -10550... |
| 2 | 3 | 882 | 8 | 1780.333170 | 0.213592 | -1.700950e+07 | -1.665500e+06 | RC_3 | MULTIPOLYGON (((-17005000 -1664000, -17006000 ... |
| 3 | 4 | 776 | 1 | 628.119092 | 0.035781 | -1.686850e+07 | -2.284500e+06 | RC_4 | POLYGON ((-16868000 -2285000, -16869000 -22850... |
| 4 | 5 | 16 | 2 | 711.384614 | 0.075298 | -1.678502e+07 | -1.762500e+06 | RC_5 | POLYGON ((-16784000 -1763000, -16786000 -17630... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 344316 | 344317 | 356 | 8 | 2914.931664 | 0.259803 | 9.193372e+06 | 1.483242e+06 | RC_344317 | POLYGON ((9194000 1484000, 9195000 1484000, 91... |
| 344317 | 344318 | 356 | 1 | 657.597469 | 0.030337 | 9.219500e+06 | 1.137500e+06 | RC_344318 | POLYGON ((9220000 1137000, 9219000 1137000, 92... |
| 344318 | 344319 | 356 | 8 | 4368.646821 | 0.210701 | 9.225500e+06 | 1.138500e+06 | RC_344319 | POLYGON ((9225000 1140000, 9226000 1140000, 92... |
| 344319 | 344320 | 356 | 2 | 815.620894 | 0.020981 | 9.272658e+06 | 1.025500e+06 | RC_344320 | POLYGON ((9274000 1025000, 9272000 1025000, 92... |
| 344320 | 344321 | 356 | 8 | 3051.952244 | 0.077766 | 9.318277e+06 | 9.926245e+05 | RC_344321 | POLYGON ((9318000 994000, 9319000 994000, 9319... |
344321 rows Γ 9 columns
The following map shows the only settlement identified in Svalbard by the GHSL project.
rc_shape[rc_shape['UNLocID']==744].explore()
The following map shows all the Rural Clusters in Norway identified by the GHSL project. I explored this map to locate areas with several rural cluster (and no larger clusters) to train my model.
rc_shape[rc_shape['UNLocID']==578].explore()
Using the Copernicus Browser, I got the tiff files for 3 training locations, all of them close to the beginning of Spring, which I believe is when the landscape looks the closest to the Svalbard Summer.
# Training image A
B02_train = "2025-03-23-00_00_2025-03-23-23_59_Sentinel-2_L2A_B02_(Raw).tiff" # Blue
B03_train = "2025-03-23-00_00_2025-03-23-23_59_Sentinel-2_L2A_B03_(Raw).tiff" # Green
B04_train = "2025-03-23-00_00_2025-03-23-23_59_Sentinel-2_L2A_B04_(Raw).tiff" # Red
B08_train = "2025-03-23-00_00_2025-03-23-23_59_Sentinel-2_L2A_B08_(Raw).tiff" # NIR
# Training image B
B02_train_b = "2025-04-02-00_00_2025-04-02-23_59_Sentinel-2_L2A_B02_(Raw).tiff"
B03_train_b = "2025-04-02-00_00_2025-04-02-23_59_Sentinel-2_L2A_B03_(Raw).tiff"
B04_train_b = "2025-04-02-00_00_2025-04-02-23_59_Sentinel-2_L2A_B04_(Raw).tiff"
B08_train_b = "2025-04-02-00_00_2025-04-02-23_59_Sentinel-2_L2A_B08_(Raw).tiff"
# Training image C
B02_train_c = "2025-04-25-00_00_2025-04-25-23_59_Sentinel-2_L2A_B02_(Raw).tiff"
B03_train_c = "2025-04-25-00_00_2025-04-25-23_59_Sentinel-2_L2A_B03_(Raw).tiff"
B04_train_c = "2025-04-25-00_00_2025-04-25-23_59_Sentinel-2_L2A_B04_(Raw).tiff"
B08_train_c = "2025-04-25-00_00_2025-04-25-23_59_Sentinel-2_L2A_B08_(Raw).tiff"
# Inference image
B02_infer = "2024-08-01-00_00_2024-08-01-23_59_Sentinel-2_L2A_B02_(Raw).tiff"
B03_infer = "2024-08-01-00_00_2024-08-01-23_59_Sentinel-2_L2A_B03_(Raw).tiff"
B04_infer = "2024-08-01-00_00_2024-08-01-23_59_Sentinel-2_L2A_B04_(Raw).tiff"
B08_infer = "2024-08-01-00_00_2024-08-01-23_59_Sentinel-2_L2A_B08_(Raw).tiff"
def read_and_stack(b02, b03, b04, b08):
"""
Reads four Sentinel-2 single-band GeoTIFFs (B02 Blue, B03 Green, B04 Red, B08 NIR).
Returns blue, green, red, nir, rgb, rgb_mean, ndvi, full_stack, profile.
"""
def read_band(path):
with rasterio.open(path) as src:
return src.read(1).astype(float), src.profile
red, profile = read_band(b04)
blue, _ = read_band(b02)
green, _ = read_band(b03)
nir, _ = read_band(b08)
rgb = np.dstack([red, green, blue])
rgb_mean = np.mean(rgb, axis=2)
ndvi = (nir - red) / (nir + red + 1e-6)
full_stack = np.dstack([red, green, blue, nir, rgb_mean, ndvi])
return blue, green, red, nir, rgb, rgb_mean, ndvi, full_stack, profile
NDVI_THRS = 0.1
def plot_rgb_overview(rgb, ndvi, title_suffix=""):
"""Plot RGB, mean reflectance, and negative-NDVI."""
rgb_disp = np.clip(rgb * 3, 0, 1)
rgb_d_mean = np.mean(rgb_disp, axis=2)
fig, axs = plt.subplots(ncols=3, figsize=(15, 5), constrained_layout=True)
axs[0].imshow(rgb_disp)
axs[0].set_title(f"Original RGB {title_suffix}")
axs[1].imshow(rgb_d_mean, cmap='gray')
axs[1].set_title("RGB Mean Reflectance")
axs[2].imshow(np.where(ndvi < NDVI_THRS, ndvi, np.nan), cmap="plasma")
axs[2].set_title("NDVI (negative only)")
for ax in axs:
ax.set_axis_off()
#OR ax.axis("off")
# plt.tight_layout() not needed because of "constrained_layout=True"
plt.show()
blue, green, red, nir, rgb, rgb_mean, ndvi, full_stack, profile = read_and_stack(
B02_train, B03_train, B04_train, B08_train
)
plot_rgb_overview(rgb, ndvi, title_suffix="(March 2025) - Location A")
blue_b, green_b, red_b, nir_b, rgb_b, rgb_mean_b, ndvi_b, full_stack_b, profile_b = read_and_stack(
B02_train_b, B03_train_b, B04_train_b, B08_train_b
)
plot_rgb_overview(rgb_b, ndvi_b, title_suffix="(April 2025 - Location B)")
blue_c, green_c, red_c, nir_c, rgb_c, rgb_mean_c, ndvi_c, full_stack_c, profile_c = read_and_stack(
B02_train_c, B03_train_c, B04_train_c, B08_train_c
)
plot_rgb_overview(rgb_c, ndvi_c, title_suffix="(April 2025 - Location C)")
Image SegmentationΒΆ
For the image segmentation, a relativelly slow number of segments was chosen with the goal of getting segments of the same aproximate size as the GHSL raster, thus preventing non-urban segments from being taken into account in the training.
COMPACTNESS = 0.2
N_SEGMENTS = 200
segments = slic(
full_stack[:, :, :4], # R, G, B, NIR
n_segments=N_SEGMENTS,
compactness=COMPACTNESS,
start_label=1,
channel_axis=-1,
)
segments_b = slic(
full_stack_b[:, :, :4],
n_segments=N_SEGMENTS, compactness=COMPACTNESS,
start_label=1, channel_axis=-1,
)
segments_c = slic(
full_stack_c[:, :, :4],
n_segments=N_SEGMENTS, compactness=COMPACTNESS,
start_label=1, channel_axis=-1,
)
def plot_segmentation(rgb, segments, title_suffix=""):
"""Plot RGB, SLIC boundaries, and mean-colour segments."""
fig, axs = plt.subplots(ncols=3, figsize=(15, 5), constrained_layout=True)
rgb_disp = np.clip(rgb * 3, 0, 1)
axs[0].imshow(rgb_disp)
axs[0].set_title(f"Original RGB {title_suffix}")
axs[1].imshow(mark_boundaries(rgb_disp, segments, color=(1, 0, 0), mode="thick"))
axs[1].set_title(f"SLIC Boundaries ({N_SEGMENTS} segments)")
axs[2].imshow(label2rgb(segments, rgb_disp, kind='avg'))
axs[2].set_title("Mean RGB per Segment")
for ax in axs:
ax.set_axis_off()
plt.show()
plot_segmentation(rgb, segments, title_suffix="(Location A)")
plot_segmentation(rgb_b, segments_b, title_suffix="(Location B)")
plot_segmentation(rgb_c, segments_c, title_suffix="(Location C)")
Features ExtractionΒΆ
In the original notebook, there were Spectral Features, Shape Features and Textural Features.
That being said, I don't think shape features are relevant in this situation, so I deleted them. On the other hand, I think texture is paramount to detect settlements, so I wanted to expand them. Thus, inspired by e-cognitions's texture features based on gray-level cooccurrence matrices (GLCM), I used AI to create the function calc_glcm_feats, that calculates the GLCM and some texture-related features from it. Afterwards, I simply updated the function calc_all_feats.
# Define custom features
def std(regionmask, intensity_img):
vals = intensity_img[regionmask]
std = np.std(vals)
return std
def entropy_ndvi(regionmask, intensity_img):
vals = intensity_img[regionmask]
arr = stats.relfreq(vals, 100, defaultreallimits=(-1,1))[0]
return stats.entropy(arr)
def calc_glcm_feats(seg_arr, band, n_levels=32):
"""
Compute GLCM texture features per segment from a single float band.
The band is quantized to `n_levels` grey levels before GLCM computation.
Features computed (averaged over 4 angles):
contrast, dissimilarity, homogeneity, energy, correlation.
Returns a DataFrame with one row per segment, indexed by label.
"""
# Quantize float band to [0, n_levels-1] integers
band_min, band_max = np.nanmin(band), np.nanmax(band)
band_q = ((band - band_min) / (band_max - band_min + 1e-6) * (n_levels - 1)).astype(np.uint8)
PROPS = ["contrast", "dissimilarity", "homogeneity", "energy", "correlation"]
ANGLES = [0, np.pi / 4, np.pi / 2, 3 * np.pi / 4]
records = []
for lbl in np.unique(seg_arr):
mask = seg_arr == lbl
rows, cols = np.where(mask)
# Extract the bounding-box patch and re-apply the segment mask
r0, r1 = rows.min(), rows.max() + 1
c0, c1 = cols.min(), cols.max() + 1
patch = band_q[r0:r1, c0:c1]
patch_mask = mask[r0:r1, c0:c1]
# Zero-out pixels outside the segment so they don't influence the GLCM
patch_masked = patch * patch_mask
glcm = graycomatrix(
patch_masked,
distances=[1],
angles=ANGLES,
levels=n_levels,
symmetric=True,
normed=True,
)
row = {"label": lbl}
for prop in PROPS:
# Average across the 4 directions
row[f"glcm_{prop}"] = graycoprops(glcm, prop).mean()
records.append(row)
return pd.DataFrame(records).set_index("label")
# Calculate set of various features for each segment
def calc_all_feats(seg_arr, img_arr):
spectral_feats = pd.DataFrame(
regionprops_table(
label_image = seg_arr,
intensity_image = img_arr,
properties = ["label", "intensity_mean"],
extra_properties=(std,)
)
)
glcm_ndvi = calc_glcm_feats(seg_arr, img_arr[:, :, 5]) # ndvi
#glcm_ndvi = glcm_ndvi.add_prefix("ndvi_")
all_feats = pd.concat([spectral_feats, glcm_ndvi.reset_index(drop=True)], axis=1)
return all_feats
all_feats_df = calc_all_feats(segments, full_stack).rename(columns={"label": "indx"})
display(all_feats_df)
all_feats_dfb = calc_all_feats(segments_b, full_stack_b).rename(columns={"label": "indx"})
display(all_feats_dfb)
all_feats_dfc = calc_all_feats(segments_c, full_stack_c).rename(columns={"label": "indx"})
display(all_feats_dfc)
| indx | intensity_mean-0 | intensity_mean-1 | intensity_mean-2 | intensity_mean-3 | intensity_mean-4 | intensity_mean-5 | std-0 | std-1 | std-2 | std-3 | std-4 | std-5 | glcm_contrast | glcm_dissimilarity | glcm_homogeneity | glcm_energy | glcm_correlation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.071640 | 0.052787 | 0.029106 | 0.241990 | 0.051178 | 0.566610 | 0.037230 | 0.030309 | 0.021343 | 0.070871 | 0.029284 | 0.136021 | 5.451725 | 0.737579 | 0.750174 | 0.262824 | 0.901578 |
| 1 | 2 | 0.064042 | 0.043543 | 0.024263 | 0.221678 | 0.043949 | 0.553098 | 0.011181 | 0.008178 | 0.006266 | 0.025226 | 0.008236 | 0.045069 | 7.840038 | 0.588701 | 0.855321 | 0.457301 | 0.901150 |
| 2 | 3 | 0.048379 | 0.033870 | 0.017354 | 0.184263 | 0.033201 | 0.536546 | 0.023256 | 0.015747 | 0.010597 | 0.072429 | 0.016319 | 0.277504 | 14.174282 | 1.078429 | 0.780053 | 0.307278 | 0.900750 |
| 3 | 4 | 0.074841 | 0.058074 | 0.031441 | 0.253538 | 0.054785 | 0.560902 | 0.038124 | 0.032227 | 0.022139 | 0.089632 | 0.030469 | 0.148008 | 11.069375 | 0.880109 | 0.779393 | 0.274161 | 0.938989 |
| 4 | 5 | 0.042044 | 0.037472 | 0.022710 | 0.188383 | 0.034075 | 0.705444 | 0.043971 | 0.043945 | 0.044740 | 0.106697 | 0.042923 | 0.208738 | 11.699771 | 0.963576 | 0.772969 | 0.305575 | 0.956395 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 183 | 184 | 0.032865 | 0.029869 | 0.014336 | 0.185172 | 0.025690 | 0.733661 | 0.031528 | 0.028062 | 0.024633 | 0.094294 | 0.027387 | 0.138322 | 14.107051 | 1.008230 | 0.767309 | 0.279035 | 0.944915 |
| 184 | 185 | 0.061046 | 0.053822 | 0.032451 | 0.213946 | 0.049106 | 0.591433 | 0.045730 | 0.037407 | 0.033315 | 0.076217 | 0.038274 | 0.240977 | 13.421409 | 1.154545 | 0.727196 | 0.304916 | 0.948953 |
| 185 | 186 | 0.197120 | 0.200688 | 0.199302 | 0.285928 | 0.199036 | 0.382574 | 0.236477 | 0.244044 | 0.254861 | 0.205644 | 0.244339 | 0.300363 | 6.950121 | 0.719906 | 0.820526 | 0.544054 | 0.970572 |
| 186 | 187 | 0.149632 | 0.143388 | 0.131838 | 0.302950 | 0.141619 | 0.384600 | 0.121546 | 0.124361 | 0.134049 | 0.157620 | 0.124565 | 0.194233 | 14.324812 | 1.296180 | 0.689059 | 0.270736 | 0.921721 |
| 187 | 188 | 0.139780 | 0.137266 | 0.130437 | 0.283210 | 0.135828 | 0.419480 | 0.116515 | 0.116290 | 0.123110 | 0.101638 | 0.117860 | 0.235025 | 14.828511 | 1.244614 | 0.696907 | 0.271513 | 0.925740 |
188 rows Γ 18 columns
| indx | intensity_mean-0 | intensity_mean-1 | intensity_mean-2 | intensity_mean-3 | intensity_mean-4 | intensity_mean-5 | std-0 | std-1 | std-2 | std-3 | std-4 | std-5 | glcm_contrast | glcm_dissimilarity | glcm_homogeneity | glcm_energy | glcm_correlation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.040683 | 0.040316 | 0.029939 | 0.140780 | 0.036979 | 0.303392 | 0.045674 | 0.037080 | 0.025620 | 0.112640 | 0.035834 | 0.613515 | 13.924187 | 1.102956 | 0.765986 | 0.318090 | 0.946186 |
| 1 | 2 | 0.056307 | 0.052449 | 0.036187 | 0.205401 | 0.048315 | 0.599618 | 0.042175 | 0.033301 | 0.024456 | 0.082319 | 0.032858 | 0.149936 | 10.380624 | 0.786067 | 0.808533 | 0.321097 | 0.958390 |
| 2 | 3 | 0.049596 | 0.044437 | 0.030381 | 0.183224 | 0.041471 | 0.603113 | 0.035681 | 0.028795 | 0.023388 | 0.060694 | 0.029011 | 0.171437 | 10.029002 | 0.808563 | 0.798883 | 0.349245 | 0.963234 |
| 3 | 4 | 0.009939 | 0.011591 | 0.011218 | 0.031605 | 0.010916 | -0.484441 | 0.018631 | 0.015999 | 0.012486 | 0.063544 | 0.015461 | 0.680750 | 10.976221 | 0.949948 | 0.820357 | 0.672592 | 0.937595 |
| 4 | 5 | 0.032757 | 0.036561 | 0.030094 | 0.119933 | 0.033137 | 0.055771 | 0.060384 | 0.059971 | 0.058504 | 0.117896 | 0.059246 | 0.757562 | 19.805024 | 1.324068 | 0.799356 | 0.543871 | 0.930963 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 127 | 128 | 0.116532 | 0.120229 | 0.114426 | 0.211685 | 0.117063 | 0.391008 | 0.145178 | 0.149220 | 0.156098 | 0.119935 | 0.149592 | 0.258354 | 6.093505 | 0.737889 | 0.798760 | 0.487208 | 0.974406 |
| 128 | 129 | 0.091154 | 0.073181 | 0.054508 | 0.204344 | 0.072948 | 0.477542 | 0.097200 | 0.097248 | 0.097600 | 0.080541 | 0.096933 | 0.211019 | 8.554255 | 0.959591 | 0.744550 | 0.292816 | 0.960137 |
| 129 | 130 | 0.028163 | 0.030655 | 0.020437 | 0.192273 | 0.026419 | 0.761813 | 0.025943 | 0.021942 | 0.021197 | 0.047839 | 0.022599 | 0.147332 | 8.180814 | 0.844616 | 0.740158 | 0.244156 | 0.920213 |
| 130 | 131 | 0.868136 | 0.885828 | 0.887493 | 0.746162 | 0.880486 | -0.073232 | 0.145006 | 0.149048 | 0.151669 | 0.108680 | 0.147934 | 0.025732 | 6.395790 | 0.488429 | 0.946890 | 0.647035 | 0.934632 |
| 131 | 132 | 0.088343 | 0.065940 | 0.045279 | 0.215221 | 0.066520 | 0.481556 | 0.067666 | 0.063336 | 0.061399 | 0.064789 | 0.063427 | 0.202095 | 8.994555 | 1.106779 | 0.696177 | 0.215149 | 0.942035 |
132 rows Γ 18 columns
| indx | intensity_mean-0 | intensity_mean-1 | intensity_mean-2 | intensity_mean-3 | intensity_mean-4 | intensity_mean-5 | std-0 | std-1 | std-2 | std-3 | std-4 | std-5 | glcm_contrast | glcm_dissimilarity | glcm_homogeneity | glcm_energy | glcm_correlation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.016637 | 0.025278 | 0.027147 | 0.056022 | 0.023020 | 0.108206 | 0.019296 | 0.012977 | 0.009075 | 0.076907 | 0.013254 | 0.409721 | 11.793564 | 1.364605 | 0.635602 | 0.215959 | 0.923380 |
| 1 | 2 | 0.069913 | 0.057683 | 0.047698 | 0.181016 | 0.058431 | 0.453877 | 0.022906 | 0.016113 | 0.015306 | 0.029184 | 0.017907 | 0.121274 | 10.815736 | 0.935726 | 0.758733 | 0.281217 | 0.915856 |
| 2 | 3 | 0.080284 | 0.070108 | 0.057321 | 0.204387 | 0.069238 | 0.442502 | 0.029267 | 0.024818 | 0.025919 | 0.044272 | 0.025723 | 0.146209 | 10.065181 | 0.862597 | 0.777439 | 0.282701 | 0.935144 |
| 3 | 4 | 0.007666 | 0.021657 | 0.028699 | 0.006080 | 0.019340 | -0.326074 | 0.013616 | 0.012121 | 0.011215 | 0.018798 | 0.012094 | 0.284224 | 7.476368 | 1.563027 | 0.532214 | 0.183244 | 0.869041 |
| 4 | 5 | 0.007986 | 0.022393 | 0.029691 | 0.006350 | 0.020023 | -0.179116 | 0.011343 | 0.011281 | 0.011388 | 0.013526 | 0.011128 | 0.114820 | 3.622412 | 1.203186 | 0.557541 | 0.194046 | 0.845637 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 194 | 195 | 0.037315 | 0.042579 | 0.032033 | 0.213036 | 0.037309 | 0.705778 | 0.016590 | 0.014480 | 0.009608 | 0.054728 | 0.013154 | 0.090292 | 10.182391 | 0.688629 | 0.839204 | 0.385915 | 0.965705 |
| 195 | 196 | 0.006306 | 0.017024 | 0.025197 | 0.005850 | 0.016176 | -0.038284 | 0.000813 | 0.001002 | 0.001289 | 0.000833 | 0.000765 | 0.072839 | 4.129788 | 1.043011 | 0.621122 | 0.249697 | 0.891336 |
| 196 | 197 | 0.011095 | 0.023204 | 0.030226 | 0.009686 | 0.021508 | -0.068352 | 0.001608 | 0.001895 | 0.002004 | 0.001479 | 0.001692 | 0.046489 | 3.220775 | 0.746875 | 0.718577 | 0.386312 | 0.674464 |
| 197 | 198 | 0.006460 | 0.017780 | 0.025871 | 0.005701 | 0.016704 | -0.063079 | 0.000816 | 0.001018 | 0.001321 | 0.000799 | 0.000778 | 0.074421 | 3.952067 | 1.030822 | 0.624372 | 0.249532 | 0.908164 |
| 198 | 199 | 0.030258 | 0.030650 | 0.025012 | 0.138290 | 0.028640 | 0.673509 | 0.023077 | 0.017548 | 0.013081 | 0.071794 | 0.017708 | 0.137887 | 14.792166 | 1.087593 | 0.736914 | 0.241177 | 0.920199 |
199 rows Γ 18 columns
Image ClassificationΒΆ
LabelingΒΆ
With the help of AI, I created the functions rasterize_rc and label_segments to turn the RC polygons into a raster that could then be used in the labeling process.
Furthermore, I completely excluded from the training low NDVI segments that had been identified as Rural Clusters (to, hopefully, exclude water segments erronesouly labeled), and also segments that intercepted the RC polygons, but where less than 50% of their area was included (since this would, in my opinion, introduce ambiguity).
def rasterize_rc(rc_shapefile, profile, un_loc_id):
"""
Burn a GeoDataFrame of RC polygons onto the raster grid defined by `profile`.
Returns a binary uint8 array: 1 inside an RC polygon, 0 elsewhere.
"""
rc_filtered = rc_shapefile[rc_shapefile['UNLocID'] == un_loc_id].copy().to_crs(profile['crs'])
burn_shapes = [
(mapping(geom), 1)
for geom in rc_filtered.geometry
if geom is not None and not geom.is_empty
]
return feat_rasterize(
burn_shapes,
out_shape=(profile['height'], profile['width']),
transform=profile['transform'],
fill=0,
dtype='uint8',
)
def label_segments(segments, rc_raster, ndvi, rc_threshold=0.5, ndvi_min=NDVI_THRS):
"""
Label segments for training with strict overlap rules to avoid ambiguity:
- class 1 (RC) : >= `rc_threshold` of pixels inside an RC polygon
AND mean NDVI >= `ndvi_min`
- class 0 (non-RC) : exactly 0% of pixels inside an RC polygon
- excluded (NaN) : anything in between (ambiguous border segments)
Returns a dict {segment_label: class}, with ambiguous segments omitted.
"""
result = {}
for lbl in np.unique(segments):
mask = (segments == lbl)
rc_frac = rc_raster[mask].mean()
if rc_frac >= rc_threshold and ndvi[mask].mean() >= ndvi_min:
result[lbl] = 1
elif rc_frac == 0.0:
result[lbl] = 0
# else: ambiguous β omit from training
return result
rc_raster = rasterize_rc(rc_shape, profile, 578)
rc_raster_b = rasterize_rc(rc_shape, profile_b, 578)
rc_raster_c = rasterize_rc(rc_shape, profile_c, 578)
seg_class_map = label_segments(segments, rc_raster, ndvi)
all_feats_df["class"] = all_feats_df["indx"].map(seg_class_map)
seg_class_map_b = label_segments(segments_b, rc_raster_b, ndvi_b)
all_feats_dfb["class"] = all_feats_dfb["indx"].map(seg_class_map_b)
seg_class_map_c = label_segments(segments_c, rc_raster_c, ndvi_c)
all_feats_dfc["class"] = all_feats_dfc["indx"].map(seg_class_map_c)
print(f" Rural cluster segments : {(all_feats_df['class'] == 1).sum()}")
print(f" Other segments : {(all_feats_df['class'] == 0).sum()}")
print(f" Excluded (ambiguous) : {(np.isnan(all_feats_df['class'])).sum()}")
print(f" Missing : {len(set(np.unique(segments)) - set(all_feats_df["indx"]))}")
print('\n')
print(f" Rural cluster segments : {(all_feats_dfb['class'] == 1).sum()}")
print(f" Other segments : {(all_feats_dfb['class'] == 0).sum()}")
print(f" Excluded (ambiguous) : {(np.isnan(all_feats_df['class'])).sum()}")
print(f" Missing : {len(set(np.unique(segments_b)) - set(all_feats_dfb["indx"]))}")
print('\n')
print(f" Rural cluster segments : {(all_feats_dfc['class'] == 1).sum()}")
print(f" Other segments : {(all_feats_dfc['class'] == 0).sum()}")
print(f" Excluded (ambiguous) : {(np.isnan(all_feats_df['class'])).sum()}")
print(f" Missing : {len(set(np.unique(segments_c)) - set(all_feats_dfc["indx"]))}")
Rural cluster segments : 7 Other segments : 166 Excluded (ambiguous) : 15 Missing : 0 Rural cluster segments : 6 Other segments : 114 Excluded (ambiguous) : 15 Missing : 0 Rural cluster segments : 5 Other segments : 169 Excluded (ambiguous) : 15 Missing : 0
binary_cmap = mcolors.ListedColormap(['#7f8282', '#00008b', '#00FF00'])
def plot_labels(rgb, rc_raster, segments, feats_df, title_suffix=""):
"""Plot RGB, RC Raster, and segment labels."""
rgb_disp = np.clip(rgb * 3, 0, 1)
mapped = map_array(
segments,
np.array(feats_df["indx"]),
np.array(feats_df["class"]),
)
fig, axs = plt.subplots(1, 3, figsize=(18, 6), constrained_layout=True)
axs[0].imshow(rgb_disp)
axs[0].set_title(f"RGB {title_suffix}")
axs[1].imshow(rc_raster, cmap="Reds")
axs[1].set_title("RC raster (GHS)")
axs[2].imshow(mapped, cmap=binary_cmap, interpolation="nearest")
axs[2].set_title("Segment labels green=RC grey=other")
for ax in axs:
ax.set_axis_off()
plt.show()
return mapped
mapped_labels = plot_labels(rgb, rc_raster, segments, all_feats_df, title_suffix="(March 2025)")
mapped_labels_b = plot_labels(rgb_b, rc_raster_b, segments_b, all_feats_dfb, title_suffix="(April 2025)")
mapped_labels_c = plot_labels(rgb_c, rc_raster_c, segments_c, all_feats_dfc, title_suffix="(April 2025)")
all_feats_final = pd.concat([all_feats_df.drop(columns="indx"),
all_feats_dfb.drop(columns="indx"),
all_feats_dfc.drop(columns="indx")]).dropna(subset=["class"]).copy()
all_feats_final["class"] = all_feats_final["class"].astype(int)
Train Random Forest ClassifierΒΆ
feature_cols = [c for c in all_feats_final.columns if c !="class"]
X = all_feats_final[feature_cols]
y = all_feats_final["class"]
# Train the Random Forest Classifier using the entire dataset
rfc_clf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rfc_clf.fit(X, y)
# Evaluate the model using OOB score
print("OOB Score:", rfc_clf.oob_score_)
OOB Score: 0.9635974304068522
feat_imp = pd.Series(rfc_clf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("Top features:\n", feat_imp.head(5).to_string())
Top features: std-2 0.090977 glcm_dissimilarity 0.087306 intensity_mean-1 0.083919 intensity_mean-4 0.076454 glcm_homogeneity 0.075908
I was happy to see that two of my GLCM features were in the Top 5 more important for the model. That being said, the actual results weren't as motivating...
blue_i, green_i, red_i, nir_i, rgb_i, rgb_mean_i, ndvi_i, full_stack_i, profile_i = read_and_stack(
B02_infer, B03_infer, B04_infer, B08_infer
)
plot_rgb_overview(rgb_i, ndvi_i)
print("Running SLICβ¦")
segments_i = slic(
full_stack_i[:, :, :4],
n_segments=N_SEGMENTS, compactness=COMPACTNESS,
start_label=1, channel_axis=-1,
)
print("Extracting featuresβ¦")
feats_i = calc_all_feats(segments_i, full_stack_i).rename(columns={"label": "indx"})
feats_i["rfc_pred"] = rfc_clf.predict(feats_i[feature_cols])
seg_labels = np.array(feats_i["indx"])
mapped_infer = map_array(segments_i, seg_labels, np.array(feats_i["rfc_pred"]))
fig, ax = plt.subplots(figsize=(6, 4), constrained_layout=True)
ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
ax.imshow(mapped_infer, cmap=binary_cmap, alpha=0.5, interpolation="nearest")
ax.set_title("RFC prediction green=RC")
ax.set_axis_off()
plt.show()
Running SLICβ¦ Extracting featuresβ¦
feats_i["rfc_prob"] = rfc_clf.predict_proba(feats_i[feature_cols])[:, 1]
mapped_prob = map_array(segments_i, seg_labels, np.array(feats_i["rfc_prob"]))
vmax = feats_i["rfc_prob"].max()
fig, ax = plt.subplots(figsize=(7, 5), constrained_layout=True)
ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
prob_plot = ax.imshow(mapped_prob, cmap="plasma", alpha=0.5, vmin=0, vmax=vmax, interpolation="nearest")
plt.colorbar(prob_plot, ax=ax)
ax.set_title("RC probability")
ax.axis("off")
plt.show()
This probability heat map was especially worrying, considering the actual settlement segments aren't the ones with the highest probability!
Train Suport Vector MachinesΒΆ
Once again, taking inspiration from e-Cognition, I decided to try and implement a classifier based on Suport Vector Machines (SVMs), which always yielded better results on my other assignements. The largest difference when compared to setting up the Random Forest Classifier is that SVMs don't Β«perform well when the input numerical attributes have very different scalesΒ» , which is why I had to apply some feature scaling.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X).copy()
# Instantiate the classifier model
svm_classifier = SVC(probability=True, class_weight="balanced", random_state=42)
# Fit the model with the training data
svm_classifier.fit(X_scaled, y)
SVC(class_weight='balanced', probability=True, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| CΒ | 1.0 | |
| kernelΒ | 'rbf' | |
| degreeΒ | 3 | |
| gammaΒ | 'scale' | |
| coef0Β | 0.0 | |
| shrinkingΒ | True | |
| probabilityΒ | True | |
| tolΒ | 0.001 | |
| cache_sizeΒ | 200 | |
| class_weightΒ | 'balanced' | |
| verboseΒ | False | |
| max_iterΒ | -1 | |
| decision_function_shapeΒ | 'ovr' | |
| break_tiesΒ | False | |
| random_stateΒ | 42 |
# Scale new data the same way
X_new_scaled = scaler.transform(feats_i[feature_cols])
# Predict
feats_i["svc_pred"] = svm_classifier.predict(X_new_scaled)
feats_i["svc_prob"] = svm_classifier.predict_proba(X_new_scaled)[:, 1]
mapped_infer = map_array(segments_i, seg_labels, np.array(feats_i["svc_pred"]))
fig, ax = plt.subplots(figsize=(6, 4), constrained_layout=True)
ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
ax.imshow(mapped_infer, cmap=binary_cmap, alpha=0.5, interpolation="nearest")
ax.set_title("SVM prediction green=RC")
ax.set_axis_off()
plt.show()
mapped_prob = map_array(segments_i, seg_labels, np.array(feats_i["svc_prob"]))
vmax = feats_i["svc_prob"].max()
fig, ax = plt.subplots(figsize=(7, 5), constrained_layout=True)
ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
prob_plot = ax.imshow(mapped_prob, cmap="plasma", alpha=0.7, vmin=0, vmax=vmax, interpolation="nearest")
plt.colorbar(prob_plot, ax=ax)
ax.set_title("RC probability")
ax.axis("off")
plt.show()
feats_i["svc_pred2"] = (feats_i["svc_prob"] > 0.2).astype(int)
mapped_infer2 = map_array(segments_i, seg_labels, np.array(feats_i["svc_pred2"]))
fig, ax = plt.subplots(figsize=(6, 4), constrained_layout=True)
ax.imshow(mark_boundaries(rgb_i, segments_i, color=(1, 0, 0), mode="thick"))
ax.imshow(mapped_infer2, cmap=binary_cmap, alpha=0.5, interpolation="nearest")
ax.set_title("SVM prediction green=RC")
ax.set_axis_off()
plt.show()
ConclusionΒΆ
Even with the SVM, the results are not perfect, considering there's a completely-unrelated-false-positive in the top-right corner of the image. Still, there's plenty of tweaks that can still be made to improve the results, such as increasing the number of training smaples, calculating the GLCM texture features for other layers besides NDVI, or exploring the non-linear kernels of the sklearn.svm library, something which honestly seems very promising. Nonetheless, it should came as proof that it is possible to detect settlements in Svalbard with a simple algorithm such as this one, and that Longyearbyen deserves better than being forgotten on a dataset whose main purpose is charting the Arctic!