Author: Lorenzo Cardarelli
Email: lorenzo.cardarelli@uniba.it
Website: https://lrncrd.github.io/
Python implementation: CPython
Python version : 3.9.16
IPython version : 8.8.0
matplotlib: 3.6.2
seaborn : 0.12.2
pandas : 1.4.4
numpy : 1.23.5
sklearn : 1.0.2
Compiler : MSC v.1916 64 bit (AMD64)
OS : Windows
Release : 10
Machine : AMD64
Processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
CPU cores : 20
Architecture: 64bit
Supplementary materials
DEFINING SOUTHERN ETRURIA FINAL BRONZE AGE SETTLEMENT MODELS USING AN INTEGRATED GIS AND MACHINE LEARNING APPROACH.
This Jupyter Notebook
contains the analyses carried out for the article ‘DEFINING SOUTHERN ETRURIA FINAL BRONZE AGE SETTLEMENT MODELS USING AN INTEGRATED GIS AND MACHINE LEARNING APPROACH’. The file includes the commented procedure, with the definition of the parameters used, and numerous references to the methods used in the form of hyperlinks to the official documentation of the module used.
Hardware and software
All the analyses in this study were performed on the area described above using the SAGA, version 8.4.1. General GIS data management, as well as geoprocessing, data handling and map creation operations were carried out using QGIS v. 3.28.0 Firenze. Statistical analyses were carried out in the Python 3.9 environment using the Scikit-learn 1.0.2 package for statistical and ML-related analyses. Matplotlib 3.6.2 and Seaborn 0.12.2 were used to create the graphs. All analyses were performed on a PC with Windows 11 operating system, Intel i7-12700k processor, RAM 32 GB DDR4. Supplementary material for the Python-based analysis is provided in HTML format via the Quarto scientific publishing platform.
Preliminary procedures
The preliminary procedures within this analysis concern the import of the libraries required for the analysis, the definition of a support function for the correlation filter and finally the definition of a specific random seed for the reproducibility of the results.
Importing modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
Define custom function
def high_correlation_filter(data, method = "pearson", correlation_limit = 0.90):
'''
This function takes a dataframe and returns a correlation matrix and a set of correlated features.
https://stackoverflow.com/questions/29294983/how-to-calculate-correlation-between-all-columns-and-remove-highly-correlated-on
'''
= data.corr(method = method)
correlation_matrix = set()
correlated_features for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > correlation_limit:
= correlation_matrix.columns[i]
colname
correlated_features.add(colname)print(f"Highly correlated features: {correlated_features}")
return correlation_matrix, correlated_features
Set a seed for reproducibility
= 1
my_seed np.random.seed(my_seed)
Analysis
During this analysis, the following steps are performed:
The data are imported (Load Data)
The data are preprocessed (Prepare data for Feature Selection)
Correlated features are removed (Remove correlated features (correlation filter))
Train and evaluate the random forest model (Train Test Split, Evaluate the model)
Calculate features importance (Define feature importance)
Load Data
The .xlsx
file is loaded and the data are converted into a pandas DataFrame
.
= pd.read_excel('data_real_and_random_site.xlsx') df
id | name | type | Visible Sky | Average View Distance | Sky View Factor | Positive Opennes | Negative Opennes | Standardized Height | Slope | ... | Normalized height | Mid-Slope position | TRI | Aspect | DEM | Channel Network Distance | Valley Depth | TWI | TPI Landforms | Geomorphon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 15 | Sorano-Castelvecchio | real | 89.29539 | 348.75 | 0.73322 | 1.50728 | 1.16619 | 315.28928 | 0.81019 | ... | 0.87201 | 0.74401 | 7.30492 | 5.63866 | 361.56799 | 54.38055 | 54.14105 | 2.15133 | 9 | 3 |
1 | 10 | Pitigliano | real | 99.38343 | 4892.50 | 0.99938 | 1.56573 | 1.29106 | 303.11221 | 0.04135 | ... | 0.97375 | 0.94751 | 0.31347 | 5.48290 | 311.28189 | 97.14888 | 4.43402 | 5.25300 | 9 | 3 |
2 | 13 | Monte Rosso | real | 90.25592 | 5008.75 | 0.96228 | 1.49072 | 1.41067 | 297.42883 | 0.37903 | ... | 0.76303 | 0.52607 | 2.51952 | 2.71155 | 389.79739 | 101.40732 | 12.95129 | 5.52274 | 7 | 6 |
3 | 16 | Sovana | real | 98.91448 | 4495.00 | 0.99931 | 1.56004 | 1.44790 | 275.29483 | 0.02647 | ... | 0.94545 | 0.89091 | 0.17408 | 1.90709 | 291.17770 | 54.11824 | 3.39746 | 6.32303 | 9 | 3 |
4 | 4 | Il Gaggio | real | 93.62825 | 708.75 | 0.98670 | 1.51222 | 1.56403 | 11.69717 | 0.01479 | ... | 0.07786 | 0.84428 | 0.09401 | 2.70805 | 150.23061 | 1.23438 | 68.94232 | 6.93691 | 4 | 10 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
327 | 0 | random | random | 99.35566 | 3550.00 | 0.99983 | 1.56151 | 1.55374 | 176.11900 | 0.01875 | ... | 0.86778 | 0.73556 | 0.11861 | 1.43253 | 202.95320 | 27.58247 | 2.00354 | 9.04514 | 5 | 6 |
328 | 0 | random | random | 98.14186 | 277.50 | 0.99868 | 1.54953 | 1.54731 | 71.57639 | 0.04738 | ... | 0.34295 | 0.31410 | 0.30070 | 4.27997 | 208.70860 | 5.50351 | 4.38770 | 7.53747 | 5 | 6 |
329 | 0 | random | random | 95.48648 | 226.25 | 0.99368 | 1.53070 | 1.53000 | 41.15727 | 0.11969 | ... | 0.26152 | 0.47696 | 0.76223 | 3.76224 | 157.37650 | 14.26704 | 12.41028 | 7.00283 | 6 | 6 |
330 | 0 | random | random | 96.62697 | 478.75 | 0.99703 | 1.52996 | 1.54719 | 21.26929 | 0.06701 | ... | 0.09951 | 0.80097 | 0.42888 | 4.54766 | 213.73441 | 0.62704 | 45.66541 | 6.64814 | 5 | 6 |
331 | 0 | random | random | 99.22222 | 4878.75 | 0.99978 | 1.56011 | 1.56970 | 0.02071 | 0.00911 | ... | 0.08170 | 0.83660 | 0.05905 | 4.29407 | 0.25349 | 0.00000 | 42.20563 | 13.45998 | 5 | 1 |
332 rows × 21 columns
Prepare data for Feature Selection
We can isolate the columns by macro-classes, first deleting the columns we are not interested in (id and name).
= df[df.columns[3:]] df_data
df_data.columns
Index(['Visible Sky', 'Average View Distance', 'Sky View Factor',
'Positive Opennes', 'Negative Opennes', 'Standardized Height', 'Slope',
'Relative Slope', 'Normalized height', 'Mid-Slope position', 'TRI',
'Aspect', 'DEM', 'Channel Network Distance', 'Valley Depth', 'TWI',
'TPI Landforms', 'Geomorphon'],
dtype='object')
We divide the columns into the various macro-classes and print the results for verification purposes.
= df_data.columns[:5]
visibility = df_data.columns[5:13]
morphometry = df_data.columns[13:16]
hydrology = df_data.columns[16:]
terrain_classification
print(visibility)
print(morphometry)
print(hydrology)
print(terrain_classification)
Index(['Visible Sky', 'Average View Distance', 'Sky View Factor',
'Positive Opennes', 'Negative Opennes'],
dtype='object')
Index(['Standardized Height', 'Slope', 'Relative Slope', 'Normalized height',
'Mid-Slope position', 'TRI', 'Aspect', 'DEM'],
dtype='object')
Index(['Channel Network Distance', 'Valley Depth', 'TWI'], dtype='object')
Index(['TPI Landforms', 'Geomorphon'], dtype='object')
Visualise correlation
Through the pairplot
function of the seaborn module, it is possible to visualise the correlation between variables. Seaborn’s pairplot
function creates a grid of plots that displays pairwise relationships between variables in a dataset. It allows you to quickly visualize the joint distribution between different pairs of variables, as well as the marginal distribution of each variable on the diagonal of the grid.
A pairplot
is produced for each macro-class of variables, starting with those relating to the ‘Visibility’ macro-class.
Visibility
= sns.pairplot(df_data[visibility], height=2, aspect=1, corner=True, plot_kws={'s': 5})
g 'Visibility', y=1.05); g.fig.suptitle(
Morphometry
= sns.pairplot(df_data[morphometry], height=2, aspect=1, corner=True, plot_kws={'s': 5})
g 'Morphometry', y=1.05); g.fig.suptitle(
Hydrology
= sns.pairplot(df_data[hydrology], height=2, aspect=1, corner=True, plot_kws={'s': 5})
g 'Hydrology', y=1.05); g.fig.suptitle(
Prepare data for Random Forest
The data are prepared for the Random Forest analysis. The data are divided into quantitative and categorical variables. The quantitative variables are normalized and the categorical variables are converted into dummies variables. A random variable for threshold is also defined for each category is also defined.
= df_data.drop(terrain_classification, axis=1)
df_data_quantitative ### Removing correlated features
=1, inplace=True)
df_data_quantitative.drop(correlated_features, axis"random"] = np.random.randint(0, 100, df.shape[0], dtype=int) df_data_quantitative[
= pd.DataFrame(df_data[terrain_classification])
df_qualitative "random"] = np.random.randint(2, size =df_qualitative.shape[0])
df_qualitative[= pd.get_dummies(df_qualitative, columns=terrain_classification) df_qualitative
Quantitative features
Quantitative features are normalized using the MinMaxScaler
function of the sklearn module - see above.
In this formula,
= MinMaxScaler().fit_transform(df_data_quantitative) mms
= pd.DataFrame(mms, columns=df_data_quantitative.columns) mms_df
This is a visualisation of the normalised quantitative variables.
Visible Sky | Average View Distance | Negative Opennes | Standardized Height | Slope | Relative Slope | Mid-Slope position | Aspect | DEM | Channel Network Distance | Valley Depth | TWI | random | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.542949 | 0.032934 | 0.185453 | 0.331296 | 0.985391 | 0.495643 | 0.748028 | 0.902676 | 0.348977 | 0.174549 | 0.230378 | 0.000000 | 0.373737 |
1 | 0.975338 | 0.551183 | 0.434352 | 0.318500 | 0.050176 | 0.945918 | 0.953010 | 0.877719 | 0.300408 | 0.311824 | 0.025303 | 0.222245 | 0.121212 |
2 | 0.584119 | 0.564443 | 0.672766 | 0.312527 | 0.460929 | 0.877078 | 0.528501 | 0.433672 | 0.376242 | 0.325493 | 0.060443 | 0.241573 | 0.727273 |
3 | 0.955238 | 0.505845 | 0.746975 | 0.289268 | 0.032076 | 0.930666 | 0.895998 | 0.304775 | 0.280990 | 0.173707 | 0.021027 | 0.298916 | 0.090909 |
4 | 0.728662 | 0.073995 | 0.978453 | 0.012270 | 0.017869 | 0.017418 | 0.849028 | 0.433111 | 0.144856 | 0.003964 | 0.291444 | 0.342902 | 0.757576 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
327 | 0.974148 | 0.398061 | 0.957942 | 0.185051 | 0.022686 | 0.922111 | 0.739517 | 0.228738 | 0.195778 | 0.088535 | 0.015276 | 0.493964 | 0.848485 |
328 | 0.922122 | 0.024808 | 0.945125 | 0.075193 | 0.057511 | 0.550339 | 0.314987 | 0.684976 | 0.201337 | 0.017667 | 0.025112 | 0.385934 | 0.585859 |
329 | 0.808309 | 0.018962 | 0.910622 | 0.043228 | 0.145469 | 0.528975 | 0.479033 | 0.602022 | 0.151758 | 0.045796 | 0.058211 | 0.347626 | 0.858586 |
330 | 0.857192 | 0.047762 | 0.944886 | 0.022329 | 0.081389 | 0.013422 | 0.805403 | 0.727868 | 0.206191 | 0.002015 | 0.195411 | 0.322211 | 0.292929 |
331 | 0.968428 | 0.549615 | 0.989755 | 0.000000 | 0.010960 | 0.000020 | 0.841293 | 0.687236 | 0.000000 | 0.000002 | 0.181137 | 0.810302 | 0.444444 |
332 rows × 13 columns
Train Test Split
The data are divided into Train and Test sets using the train_test_split
function of the sklearn module.
It takes the dataset (usually a Pandas
DataFrame
) and splits it randomly into two subsets: the training set and the testing set.It assigns a specified proportion (usually between 70-80% for training and 20-30% for testing) of the dataset to the training set and the remaining proportion to the testing set.
It ensures that the random split is reproducible by setting a random seed value (which can be specified by the user).
It returns four arrays:
X_train
,X_test
,y_train
,y_test
, where X represents the features and y represents the target variable.
In this case we use 80-20% ratio.
= train_test_split(mms_df, df["type"], stratify=df["type"], random_state=my_seed) X_train, X_test, y_train, y_test
The parameters of the Random Forest model (sklearn’s RandomForestClassifier
) are defined. We use 1000 Decision Trees, with a maximum depth of 5 and a minimum number of samples per leaf of 1.
= mms_df.columns
feature_names = RandomForestClassifier(random_state=my_seed, n_estimators=1000, n_jobs=-1, max_depth=5)
forest forest.fit(X_train, y_train)
RandomForestClassifier(max_depth=5, n_estimators=1000, n_jobs=-1,
random_state=1)
Evaluate the model
Here, we can print the accuracy of the model on the training set first
forest.score(X_train, y_train)
0.9437751004016064
… and then on the test set.
forest.score(X_test, y_test)
0.6506024096385542
Define feature importance
Now we can define the feature importance of the model and store them in a DataFrame
.
= forest.feature_importances_
importances
= pd.DataFrame(importances, index=feature_names, columns=["importance"])
forest_importances
=forest_importances.sort_values(by="importance", ascending=False) forest_importances
and visualise the results.
importance | |
---|---|
Negative Opennes | 0.153278 |
Mid-Slope position | 0.117242 |
Channel Network Distance | 0.106651 |
TWI | 0.106404 |
Visible Sky | 0.082909 |
Valley Depth | 0.063663 |
Slope | 0.061621 |
Average View Distance | 0.060225 |
Relative Slope | 0.060151 |
random | 0.052507 |
DEM | 0.049658 |
Standardized Height | 0.044897 |
Aspect | 0.040795 |
We can use a bar
plot from the module matplotlib to visualize the results. Features are ordered by importance. The black line represents the standard deviation derived from the features importance process, which uses repetitions.
= plt.subplots()
fig, ax "importance"].plot.bar(ax=ax)
forest_importances["Feature importances using MDI")
ax.set_title("Mean accuracy decrease")
ax.set_ylabel(
fig.tight_layout()"feature_importance_permutation.png", dpi=600)
plt.savefig( plt.show()
We select the most important features and store them in a DataFrame
and visualize the comparison between real and random site by seaborn’s boxplot
.
= df_data_quantitative[["Negative Opennes", "Mid-Slope position", "TWI", "Channel Network Distance", "Visible Sky", "DEM"]]
quantitative_important "type" ] = df["type"] quantitative_important[
= plt.subplots(2, 3, figsize=(10, 5), sharex = True)
fig, ax for i, feature in enumerate(quantitative_important.columns[:-1]):
="type", y=feature, data=quantitative_important, ax=ax.flatten()[i])
sns.boxplot(x#ax.flatten()[i].set_title(feature)
fig.tight_layout()+1)
ax.flatten()[i].set_xlabel(i"quantitative_box.png", dpi=600) plt.savefig(
Categorical features
Moving to the categorical features, we previously used the get_dummies
function of the pandas module to convert the categorical variables into dummy variables. The get_dummies()
function in Pandas is used to convert categorical variables into dummy/indicator variables. It creates a new DataFrame
with a column for each unique category in the original categorical variable, and a binary value of 1 or 0 indicating whether or not the category is present for each row.
This is a visualization of the dummies variables.
random | TPI Landforms_1 | TPI Landforms_2 | TPI Landforms_4 | TPI Landforms_5 | TPI Landforms_6 | TPI Landforms_7 | TPI Landforms_9 | TPI Landforms_10 | Geomorphon_1 | Geomorphon_2 | Geomorphon_3 | Geomorphon_4 | Geomorphon_5 | Geomorphon_6 | Geomorphon_7 | Geomorphon_8 | Geomorphon_9 | Geomorphon_10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
327 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
328 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
329 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
330 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
331 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
332 rows × 19 columns
Train Test Split
As before, the data are divided into training and test sets using the train_test_split
function of the sklearn module (see above). The train set (80% of the whole dataset) is used to train the model, while the test set (20%) is used to evaluate the model.
= train_test_split(df_qualitative, df["type"], stratify=df["type"], random_state=my_seed) X_train, X_test, y_train, y_test
As before, we define the parameters of the RandomForestClassifier
model. We use 1000 Decision Trees, with a maximum depth of 5 and a minimum number of samples per leaf of 1.
= df_qualitative.columns
feature_names = RandomForestClassifier(random_state=my_seed, n_estimators=1000, n_jobs=-1, max_depth=5)
forest forest.fit(X_train, y_train)
RandomForestClassifier(max_depth=5, n_estimators=1000, n_jobs=-1,
random_state=1)
Evaluate the model
We can print the accuracy of the model on the training and test set.
print(f"Train score:{forest.score(X_train, y_train):.3f}")
print(f"Test score:{forest.score(X_test, y_test):.3f}")
Train score:0.707
Test score:0.711
Define feature importance
As before, we can define the features importance of the model and store them in a DataFrame
, and visualize the results.
= forest.feature_importances_
importances = pd.DataFrame(importances, index=feature_names, columns=["importance"])
forest_importances
=forest_importances.sort_values(by="importance", ascending=False) forest_importances
importance | |
---|---|
TPI Landforms_9 | 0.185137 |
TPI Landforms_5 | 0.147591 |
TPI Landforms_10 | 0.137261 |
Geomorphon_3 | 0.124809 |
random | 0.084466 |
Geomorphon_2 | 0.058302 |
Geomorphon_7 | 0.047523 |
TPI Landforms_2 | 0.029439 |
TPI Landforms_6 | 0.028992 |
Geomorphon_6 | 0.025247 |
Geomorphon_9 | 0.023483 |
Geomorphon_1 | 0.021574 |
TPI Landforms_7 | 0.016498 |
Geomorphon_5 | 0.016276 |
TPI Landforms_4 | 0.015436 |
Geomorphon_8 | 0.013741 |
Geomorphon_4 | 0.010895 |
TPI Landforms_1 | 0.009214 |
Geomorphon_10 | 0.004116 |
= plt.subplots()
fig, ax "importance"].plot.bar(ax=ax)
forest_importances["Feature importances using MDI")
ax.set_title("Mean accuracy decrease")
ax.set_ylabel(
fig.tight_layout()"feature_importance_categorical_permutation.png", dpi=600)
plt.savefig( plt.show()
In conclusion, we can use a matplotlib’s bar
plot to visualize the number of sites by the most important categorical features.
= plt.subplots(1, 4, figsize=(10, 5), sharey = True)
fig, ax ="type", data=df.loc[df["TPI Landforms"] == 9], ax=ax.flatten()[0])
sns.countplot(x0].set_title("TPI Landforms = 9")
ax.flatten()[="type", data=df.loc[df["TPI Landforms"] == 5], ax=ax.flatten()[1])
sns.countplot(x1].set_title("TPI Landforms = 5")
ax.flatten()[="type", data=df.loc[df["TPI Landforms"] == 10], ax=ax.flatten()[2])
sns.countplot(x2].set_title("TPI Landforms = 10")
ax.flatten()[="type", data=df.loc[df["Geomorphon"] == 3], ax=ax.flatten()[3])
sns.countplot(x3].set_title("Geomorphon = 3")
ax.flatten()[
fig.tight_layout()"categorical_important_count.jpg", dpi=600) plt.savefig(