Supplementary materials

DEFINING SOUTHERN ETRURIA FINAL BRONZE AGE SETTLEMENT MODELS USING AN INTEGRATED GIS AND MACHINE LEARNING APPROACH.

This Jupyter Notebook contains the analyses carried out for the article ‘DEFINING SOUTHERN ETRURIA FINAL BRONZE AGE SETTLEMENT MODELS USING AN INTEGRATED GIS AND MACHINE LEARNING APPROACH’. The file includes the commented procedure, with the definition of the parameters used, and numerous references to the methods used in the form of hyperlinks to the official documentation of the module used.

Hardware and software

All the analyses in this study were performed on the area described above using the SAGA, version 8.4.1. General GIS data management, as well as geoprocessing, data handling and map creation operations were carried out using QGIS v. 3.28.0 Firenze. Statistical analyses were carried out in the Python 3.9 environment using the Scikit-learn 1.0.2 package for statistical and ML-related analyses. Matplotlib 3.6.2 and Seaborn 0.12.2 were used to create the graphs. All analyses were performed on a PC with Windows 11 operating system, Intel i7-12700k processor, RAM 32 GB DDR4. Supplementary material for the Python-based analysis is provided in HTML format via the Quarto scientific publishing platform.

Author: Lorenzo Cardarelli

Email: lorenzo.cardarelli@uniba.it

Website: https://lrncrd.github.io/

Python implementation: CPython
Python version       : 3.9.16
IPython version      : 8.8.0

matplotlib: 3.6.2
seaborn   : 0.12.2
pandas    : 1.4.4
numpy     : 1.23.5
sklearn   : 1.0.2

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
CPU cores   : 20
Architecture: 64bit

Preliminary procedures

The preliminary procedures within this analysis concern the import of the libraries required for the analysis, the definition of a support function for the correlation filter and finally the definition of a specific random seed for the reproducibility of the results.

Importing modules

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import  MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

Define custom function

def high_correlation_filter(data, method = "pearson", correlation_limit = 0.90):
    '''
    This function takes a dataframe and returns a correlation matrix and a set of correlated features.
    https://stackoverflow.com/questions/29294983/how-to-calculate-correlation-between-all-columns-and-remove-highly-correlated-on
    '''
    correlation_matrix = data.corr(method = method)
    correlated_features = set()
    for i in range(len(correlation_matrix.columns)):
        for j in range(i):
            if abs(correlation_matrix.iloc[i, j]) > correlation_limit:
                colname = correlation_matrix.columns[i]
                correlated_features.add(colname)
    print(f"Highly correlated features:  {correlated_features}")
    return correlation_matrix, correlated_features

Set a seed for reproducibility

my_seed = 1
np.random.seed(my_seed)

Analysis

During this analysis, the following steps are performed:

  • The data are imported (Load Data)

  • The data are preprocessed (Prepare data for Feature Selection)

  • Correlated features are removed (Remove correlated features (correlation filter))

  • Train and evaluate the random forest model (Train Test Split, Evaluate the model)

  • Calculate features importance (Define feature importance)

Load Data

The .xlsx file is loaded and the data are converted into a pandas DataFrame .

df = pd.read_excel('data_real_and_random_site.xlsx')
id name type Visible Sky Average View Distance Sky View Factor Positive Opennes Negative Opennes Standardized Height Slope ... Normalized height Mid-Slope position TRI Aspect DEM Channel Network Distance Valley Depth TWI TPI Landforms Geomorphon
0 15 Sorano-Castelvecchio real 89.29539 348.75 0.73322 1.50728 1.16619 315.28928 0.81019 ... 0.87201 0.74401 7.30492 5.63866 361.56799 54.38055 54.14105 2.15133 9 3
1 10 Pitigliano real 99.38343 4892.50 0.99938 1.56573 1.29106 303.11221 0.04135 ... 0.97375 0.94751 0.31347 5.48290 311.28189 97.14888 4.43402 5.25300 9 3
2 13 Monte Rosso real 90.25592 5008.75 0.96228 1.49072 1.41067 297.42883 0.37903 ... 0.76303 0.52607 2.51952 2.71155 389.79739 101.40732 12.95129 5.52274 7 6
3 16 Sovana real 98.91448 4495.00 0.99931 1.56004 1.44790 275.29483 0.02647 ... 0.94545 0.89091 0.17408 1.90709 291.17770 54.11824 3.39746 6.32303 9 3
4 4 Il Gaggio real 93.62825 708.75 0.98670 1.51222 1.56403 11.69717 0.01479 ... 0.07786 0.84428 0.09401 2.70805 150.23061 1.23438 68.94232 6.93691 4 10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
327 0 random random 99.35566 3550.00 0.99983 1.56151 1.55374 176.11900 0.01875 ... 0.86778 0.73556 0.11861 1.43253 202.95320 27.58247 2.00354 9.04514 5 6
328 0 random random 98.14186 277.50 0.99868 1.54953 1.54731 71.57639 0.04738 ... 0.34295 0.31410 0.30070 4.27997 208.70860 5.50351 4.38770 7.53747 5 6
329 0 random random 95.48648 226.25 0.99368 1.53070 1.53000 41.15727 0.11969 ... 0.26152 0.47696 0.76223 3.76224 157.37650 14.26704 12.41028 7.00283 6 6
330 0 random random 96.62697 478.75 0.99703 1.52996 1.54719 21.26929 0.06701 ... 0.09951 0.80097 0.42888 4.54766 213.73441 0.62704 45.66541 6.64814 5 6
331 0 random random 99.22222 4878.75 0.99978 1.56011 1.56970 0.02071 0.00911 ... 0.08170 0.83660 0.05905 4.29407 0.25349 0.00000 42.20563 13.45998 5 1

332 rows × 21 columns

Prepare data for Feature Selection

We can isolate the columns by macro-classes, first deleting the columns we are not interested in (id and name).

df_data = df[df.columns[3:]]
df_data.columns
Index(['Visible Sky', 'Average View Distance', 'Sky View Factor',
       'Positive Opennes', 'Negative Opennes', 'Standardized Height', 'Slope',
       'Relative Slope', 'Normalized height', 'Mid-Slope position', 'TRI',
       'Aspect', 'DEM', 'Channel Network Distance', 'Valley Depth', 'TWI',
       'TPI Landforms', 'Geomorphon'],
      dtype='object')

We divide the columns into the various macro-classes and print the results for verification purposes.

visibility = df_data.columns[:5]
morphometry = df_data.columns[5:13]
hydrology = df_data.columns[13:16]
terrain_classification = df_data.columns[16:]

print(visibility)
print(morphometry)
print(hydrology)
print(terrain_classification)
Index(['Visible Sky', 'Average View Distance', 'Sky View Factor',
       'Positive Opennes', 'Negative Opennes'],
      dtype='object')
Index(['Standardized Height', 'Slope', 'Relative Slope', 'Normalized height',
       'Mid-Slope position', 'TRI', 'Aspect', 'DEM'],
      dtype='object')
Index(['Channel Network Distance', 'Valley Depth', 'TWI'], dtype='object')
Index(['TPI Landforms', 'Geomorphon'], dtype='object')

Visualise correlation

Through the pairplot function of the seaborn module, it is possible to visualise the correlation between variables. Seaborn’s pairplot function creates a grid of plots that displays pairwise relationships between variables in a dataset. It allows you to quickly visualize the joint distribution between different pairs of variables, as well as the marginal distribution of each variable on the diagonal of the grid.

A pairplot is produced for each macro-class of variables, starting with those relating to the ‘Visibility’ macro-class.

Visibility

g = sns.pairplot(df_data[visibility], height=2, aspect=1, corner=True, plot_kws={'s': 5})
g.fig.suptitle('Visibility', y=1.05);

Morphometry

g = sns.pairplot(df_data[morphometry], height=2, aspect=1, corner=True, plot_kws={'s': 5})
g.fig.suptitle('Morphometry', y=1.05);

Hydrology

g = sns.pairplot(df_data[hydrology], height=2, aspect=1, corner=True, plot_kws={'s': 5})
g.fig.suptitle('Hydrology', y=1.05);

Remove correlated features (correlation filter)

The correlation filter is applied to the data, removing the features that are highly correlated with each other. The threshold for the correlation is set to 0.8. The function correlation_filter is defined in the Preliminary procedures section. We can use Spearman’s ρ to identify also non-linear relationships.

Correlation matrices are visualised using seaborn’s heatmap function that creates a color-coded matrix plot that allows you to visualize the relationships between two variables in a dataset. It is particularly useful for exploring correlations between numerical variables.

df_pipeline, correlated_visibility = high_correlation_filter(df_data[visibility], method ="spearman", correlation_limit=0.8)
Highly correlated features:  {'Positive Opennes', 'Sky View Factor'}
sns.heatmap(df_pipeline, annot=True)
plt.tight_layout()
plt.show()

Also a pandas DataFrame is created to store the results of the correlation filter.

Visible Sky Average View Distance Sky View Factor Positive Opennes Negative Opennes
Visible Sky 1.000000 0.530236 0.945212 0.961367 0.175468
Average View Distance 0.530236 1.000000 0.434645 0.514931 -0.236392
Sky View Factor 0.945212 0.434645 1.000000 0.877936 0.360549
Positive Opennes 0.961367 0.514931 0.877936 1.000000 0.056801
Negative Opennes 0.175468 -0.236392 0.360549 0.056801 1.000000
df_pipeline, _ = high_correlation_filter(df_data[morphometry], method ="spearman", correlation_limit=0.8)
Highly correlated features:  {'TRI', 'Normalized height'}
sns.heatmap(df_pipeline, annot=True)
plt.tight_layout()
plt.show()

Standardized Height Slope Relative Slope Normalized height Mid-Slope position TRI Aspect DEM
Standardized Height 1.000000 0.210445 0.748746 0.789282 0.207232 0.240328 -0.130692 0.689501
Slope 0.210445 1.000000 0.113977 0.060915 -0.225587 0.962062 -0.090196 0.183387
Relative Slope 0.748746 0.113977 1.000000 0.844248 0.334070 0.139885 -0.115993 0.310928
Normalized height 0.789282 0.060915 0.844248 1.000000 0.400225 0.116716 -0.016870 0.208332
Mid-Slope position 0.207232 -0.225587 0.334070 0.400225 1.000000 -0.133857 -0.046874 0.200714
TRI 0.240328 0.962062 0.139885 0.116716 -0.133857 1.000000 -0.079369 0.205405
Aspect -0.130692 -0.090196 -0.115993 -0.016870 -0.046874 -0.079369 1.000000 -0.210233
DEM 0.689501 0.183387 0.310928 0.208332 0.200714 0.205405 -0.210233 1.000000
correlated_morphometry = {'TRI', 'Normalized height'}
df_pipeline, correlated_hydrology = high_correlation_filter(df_data[hydrology], method ="spearman", correlation_limit=0.8)
Highly correlated features:  set()
sns.heatmap(df_pipeline, annot=True)
plt.tight_layout()
plt.show()

Channel Network Distance Valley Depth TWI
Channel Network Distance 1.000000 -0.547383 -0.591980
Valley Depth -0.547383 1.000000 0.261798
TWI -0.591980 0.261798 1.000000
correlated_features = correlated_visibility.union(correlated_morphometry, correlated_hydrology)

Prepare data for Random Forest

The data are prepared for the Random Forest analysis. The data are divided into quantitative and categorical variables. The quantitative variables are normalized and the categorical variables are converted into dummies variables. A random variable for threshold is also defined for each category is also defined.

df_data_quantitative = df_data.drop(terrain_classification, axis=1)
### Removing correlated features
df_data_quantitative.drop(correlated_features, axis=1, inplace=True)
df_data_quantitative["random"] = np.random.randint(0, 100, df.shape[0], dtype=int)
df_qualitative = pd.DataFrame(df_data[terrain_classification])
df_qualitative["random"] = np.random.randint(2, size =df_qualitative.shape[0])
df_qualitative = pd.get_dummies(df_qualitative, columns=terrain_classification)

Quantitative features

Quantitative features are normalized using the MinMaxScaler function of the sklearn module - see above.

xnorm=xmin(x)max(x)min(x)

In this formula, x is the original value, min(x) is the minimum value in the dataset, and max(x) is the maximum value in the dataset. The resulting xnorm value is a number between 0 and 1, representing the normalized version of x.

mms = MinMaxScaler().fit_transform(df_data_quantitative)
mms_df = pd.DataFrame(mms, columns=df_data_quantitative.columns)

This is a visualisation of the normalised quantitative variables.

Visible Sky Average View Distance Negative Opennes Standardized Height Slope Relative Slope Mid-Slope position Aspect DEM Channel Network Distance Valley Depth TWI random
0 0.542949 0.032934 0.185453 0.331296 0.985391 0.495643 0.748028 0.902676 0.348977 0.174549 0.230378 0.000000 0.373737
1 0.975338 0.551183 0.434352 0.318500 0.050176 0.945918 0.953010 0.877719 0.300408 0.311824 0.025303 0.222245 0.121212
2 0.584119 0.564443 0.672766 0.312527 0.460929 0.877078 0.528501 0.433672 0.376242 0.325493 0.060443 0.241573 0.727273
3 0.955238 0.505845 0.746975 0.289268 0.032076 0.930666 0.895998 0.304775 0.280990 0.173707 0.021027 0.298916 0.090909
4 0.728662 0.073995 0.978453 0.012270 0.017869 0.017418 0.849028 0.433111 0.144856 0.003964 0.291444 0.342902 0.757576
... ... ... ... ... ... ... ... ... ... ... ... ... ...
327 0.974148 0.398061 0.957942 0.185051 0.022686 0.922111 0.739517 0.228738 0.195778 0.088535 0.015276 0.493964 0.848485
328 0.922122 0.024808 0.945125 0.075193 0.057511 0.550339 0.314987 0.684976 0.201337 0.017667 0.025112 0.385934 0.585859
329 0.808309 0.018962 0.910622 0.043228 0.145469 0.528975 0.479033 0.602022 0.151758 0.045796 0.058211 0.347626 0.858586
330 0.857192 0.047762 0.944886 0.022329 0.081389 0.013422 0.805403 0.727868 0.206191 0.002015 0.195411 0.322211 0.292929
331 0.968428 0.549615 0.989755 0.000000 0.010960 0.000020 0.841293 0.687236 0.000000 0.000002 0.181137 0.810302 0.444444

332 rows × 13 columns

Train Test Split

The data are divided into Train and Test sets using the train_test_split function of the sklearn module.

  1. It takes the dataset (usually a Pandas DataFrame) and splits it randomly into two subsets: the training set and the testing set.

  2. It assigns a specified proportion (usually between 70-80% for training and 20-30% for testing) of the dataset to the training set and the remaining proportion to the testing set.

  3. It ensures that the random split is reproducible by setting a random seed value (which can be specified by the user).

  4. It returns four arrays: X_train, X_test, y_train, y_test, where X represents the features and y represents the target variable.

In this case we use 80-20% ratio.

X_train, X_test, y_train, y_test = train_test_split(mms_df, df["type"], stratify=df["type"], random_state=my_seed)

The parameters of the Random Forest model (sklearn’s RandomForestClassifier) are defined. We use 1000 Decision Trees, with a maximum depth of 5 and a minimum number of samples per leaf of 1.

feature_names = mms_df.columns
forest = RandomForestClassifier(random_state=my_seed, n_estimators=1000, n_jobs=-1, max_depth=5)
forest.fit(X_train, y_train)
RandomForestClassifier(max_depth=5, n_estimators=1000, n_jobs=-1,
                       random_state=1)

Evaluate the model

Here, we can print the accuracy of the model on the training set first

forest.score(X_train, y_train)
0.9437751004016064

… and then on the test set.

forest.score(X_test, y_test)
0.6506024096385542

Define feature importance

Now we can define the feature importance of the model and store them in a DataFrame.

importances = forest.feature_importances_

forest_importances = pd.DataFrame(importances, index=feature_names, columns=["importance"])

forest_importances=forest_importances.sort_values(by="importance", ascending=False)

and visualise the results.

importance
Negative Opennes 0.153278
Mid-Slope position 0.117242
Channel Network Distance 0.106651
TWI 0.106404
Visible Sky 0.082909
Valley Depth 0.063663
Slope 0.061621
Average View Distance 0.060225
Relative Slope 0.060151
random 0.052507
DEM 0.049658
Standardized Height 0.044897
Aspect 0.040795

We can use a bar plot from the module matplotlib to visualize the results. Features are ordered by importance. The black line represents the standard deviation derived from the features importance process, which uses repetitions.

fig, ax = plt.subplots()
forest_importances["importance"].plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.savefig("feature_importance_permutation.png", dpi=600)
plt.show()

We select the most important features and store them in a DataFrame and visualize the comparison between real and random site by seaborn’s boxplot.

quantitative_important = df_data_quantitative[["Negative Opennes", "Mid-Slope position", "TWI", "Channel Network Distance", "Visible Sky", "DEM"]]
quantitative_important["type" ] = df["type"]
fig, ax = plt.subplots(2, 3, figsize=(10, 5), sharex = True)
for i, feature in enumerate(quantitative_important.columns[:-1]):
    sns.boxplot(x="type", y=feature, data=quantitative_important, ax=ax.flatten()[i])
    #ax.flatten()[i].set_title(feature)
    fig.tight_layout()
    ax.flatten()[i].set_xlabel(i+1)
    plt.savefig("quantitative_box.png", dpi=600)  

Categorical features

Moving to the categorical features, we previously used the get_dummies function of the pandas module to convert the categorical variables into dummy variables. The get_dummies() function in Pandas is used to convert categorical variables into dummy/indicator variables. It creates a new DataFrame with a column for each unique category in the original categorical variable, and a binary value of 1 or 0 indicating whether or not the category is present for each row.

This is a visualization of the dummies variables.

random TPI Landforms_1 TPI Landforms_2 TPI Landforms_4 TPI Landforms_5 TPI Landforms_6 TPI Landforms_7 TPI Landforms_9 TPI Landforms_10 Geomorphon_1 Geomorphon_2 Geomorphon_3 Geomorphon_4 Geomorphon_5 Geomorphon_6 Geomorphon_7 Geomorphon_8 Geomorphon_9 Geomorphon_10
0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
327 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
328 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
329 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
330 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
331 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0

332 rows × 19 columns

Train Test Split

As before, the data are divided into training and test sets using the train_test_split function of the sklearn module (see above). The train set (80% of the whole dataset) is used to train the model, while the test set (20%) is used to evaluate the model.

X_train, X_test, y_train, y_test = train_test_split(df_qualitative, df["type"], stratify=df["type"], random_state=my_seed)

As before, we define the parameters of the RandomForestClassifier model. We use 1000 Decision Trees, with a maximum depth of 5 and a minimum number of samples per leaf of 1.

feature_names = df_qualitative.columns
forest = RandomForestClassifier(random_state=my_seed, n_estimators=1000, n_jobs=-1, max_depth=5)
forest.fit(X_train, y_train)
RandomForestClassifier(max_depth=5, n_estimators=1000, n_jobs=-1,
                       random_state=1)

Evaluate the model

We can print the accuracy of the model on the training and test set.

print(f"Train score:{forest.score(X_train, y_train):.3f}")
print(f"Test score:{forest.score(X_test, y_test):.3f}")
Train score:0.707
Test score:0.711

Define feature importance

As before, we can define the features importance of the model and store them in a DataFrame, and visualize the results.

importances = forest.feature_importances_
forest_importances = pd.DataFrame(importances, index=feature_names, columns=["importance"])

forest_importances=forest_importances.sort_values(by="importance", ascending=False)
importance
TPI Landforms_9 0.185137
TPI Landforms_5 0.147591
TPI Landforms_10 0.137261
Geomorphon_3 0.124809
random 0.084466
Geomorphon_2 0.058302
Geomorphon_7 0.047523
TPI Landforms_2 0.029439
TPI Landforms_6 0.028992
Geomorphon_6 0.025247
Geomorphon_9 0.023483
Geomorphon_1 0.021574
TPI Landforms_7 0.016498
Geomorphon_5 0.016276
TPI Landforms_4 0.015436
Geomorphon_8 0.013741
Geomorphon_4 0.010895
TPI Landforms_1 0.009214
Geomorphon_10 0.004116
fig, ax = plt.subplots()
forest_importances["importance"].plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.savefig("feature_importance_categorical_permutation.png", dpi=600)
plt.show()

In conclusion, we can use a matplotlib’s bar plot to visualize the number of sites by the most important categorical features.

fig, ax = plt.subplots(1, 4, figsize=(10, 5), sharey = True)
sns.countplot(x="type", data=df.loc[df["TPI Landforms"] == 9], ax=ax.flatten()[0])
ax.flatten()[0].set_title("TPI Landforms = 9")
sns.countplot(x="type", data=df.loc[df["TPI Landforms"] == 5], ax=ax.flatten()[1])
ax.flatten()[1].set_title("TPI Landforms = 5")
sns.countplot(x="type", data=df.loc[df["TPI Landforms"] == 10], ax=ax.flatten()[2])
ax.flatten()[2].set_title("TPI Landforms = 10")
sns.countplot(x="type", data=df.loc[df["Geomorphon"] == 3], ax=ax.flatten()[3])
ax.flatten()[3].set_title("Geomorphon = 3")

fig.tight_layout()
plt.savefig("categorical_important_count.jpg", dpi=600)