Mapping patterns of multiple socio-economic deprivation in Edinburgh using self-organizing maps

Individual Coursework Assignment

Date: 2021.02

Brief Description

Scottish Index of Multiple Deprivation (SIMD) is a set of variables for identifying the places where people are experiencing disadvantage across these different measurements. Many past studies have proven the effectiveness of unsuper- vised learning algorithms on a similar index: Kang et al. (2018) applied the classical K-means algorithm to cluster health and social care data in Glasgow; Pikoula et al. (2019) identified a subtype of disease by applied both K-means and hierarchical clustering on health records. While K-means is a popular clustering method, it still has a challenge: it cannot preserve the topology es- pecially for higher dimensional data (Joshi & Nalwade 2013). Self Organising Map, known as Kohonen Map, is an unsupervised method of clustering data for visualization and analysis (Kohonen 1982). Not only SOM can cluster mul- tidimensional data effectively, but also can preserve topology and non-linearly on the input dataset (Maghawry et al. 2020). That is because, during SOM process, data points which are close in the output space are also close in the input space. Researches of SOM application on multiple deprivations have already conducted widely (Arcagni et al. 2019, Lucchini & Assi 2013).

The past studies on social inequalities by single socio-economic indicator. Method- ological arguments were arising when it came to deciding on the dimensions or indicators to adopt (Lucchini & Assi 2013). Traditional hierarchical weighted way may bring with policymakers errors and information lost to identify de- prived groups. Especially when the overlap of indicators is small, the less weighted indicator might be ignored in the system (Gan et al. 2017). SOM is a powerful tool for mapping high dimensional indicator in a low dimensional way.To overcome the limitations of other techniques, this study uses SOM to propose multidimensional socio-economic measures of deprivation in Edin- burgh, which are taken from the SIMD dataset. It tried to identify a certain number of clusters of data zones by deprivation variables.

The data used are drawn from the 2020 SIMD. It covers a wide range of socio- economic measures. Since SOM training requires clean numeric data, and normally distributed and representative columns are preferred, we calculated the mean and standard values of normalised indicators (see equation 1). In the exploratory analysis, we also tested variables like crime, but the most data zones have a relatively low crime rate and several outliers with high crime rate, which caused extremely leaning of SOM clustering. Therefore, in this study, we focus on 597 data zones of the City of Edinburgh measured on 6 indicators referring to the following dimensions: welfare, mental health, education, ac- cessibility, income and employment (see Table 1). Fig 1 shows the geographical distribution of depriving levels across the city. The SOM process was auto- mated using R, and visualisation was conducted by R and ESRI ArcGIS.

总平面

Methodology

The starting point is a 597 by 6 matrix (597 data zones and 6 variables). Con- sidering the size and similarities of the clusters and the processing time, we use a 10 by 10 grid, where each hexagon is a neuron. It means about 6 data points will compete over one neuron on averagely. Each neuron vector has:

  • A fixed position on the SOM grid;
  • Values of variables as the original data points;
  • Samples from the original data points and one neuron can represent multiple samples.
  • The process of one SOM iteration can be summarised from Lynn (2015):

  • Initialise neurons randomly;
  • hoose a random data point from training data and present it to the SOM;
  • Find the Best Matching Unit (BMU) in the map – the most similar neuron by Euclidean distance;
  • Determine the neighbour neurons of the BMU;
  • Adjust weights of nodes in the BMU neighbourhood towards the chosen data point.
  • this study, we extended the regression analysis with GWR. GWR allows the parameters to be derived for each location separately based on geographic context:

    Results and Discussion
    Visualisation of SOM

    Fig 2 shows that the neurons stabilized after 420 iterations: the SOM evolves rapidly at the beginning, after it reached optimum shape, the changes tend to horizontal then.

    总平面

    Fig 3 delivers an overview of the count of data points each neuron corresponded to. The darker red the area is, the more data points represented. Most of the sample size is between 4-10, which suits the aim of grid size. No empty neuron and only a few exceeds 10 also indicates the SOM grid is suitable for further analysing.

    总平面

    The SOM quality is measured with the average distance between each data point and its BMU (Vesanto et al. 2000). It shows an optimum map size with a small quantization error for most of the neurons (see Fig 4).

    总平面

    The SOM neighbour distance map, also known as U-Matrix, shows the distance between each neuron and its neighbours (see Fig 5). If the average distance is low, a darker colour is assigned to the location, which indicates the neurons are much more similar to their neighbours. We can also see that a divide separates a smaller top left cluster from a much bigger bottom right cluster.

    总平面

    The code map shows each neuron’s weight vector, a fan diagram of which represent the magnitude of each variable.

    总平面

    Since it is a high-dimensional SOM model with 6 variables, we use the heat maps to demonstrate the characteristics and relationship of variables across the SOM grid (see Fig 7). We should notice that there is a location similarity of welfare, income and employment, but also an inverse relationship between them and education. The division among mental health and accessibility clusters is relatively mighty but it is more clear for welfare, education, income and employment.

    总平面

    Clustering and Segmentation

    According to Fig 8, we can find the knee of the curve is around 7. Hence, 7 clusters will be used for further clustering.

    总平面

    The clusters are found to be contiguous in Fig 9, except that there is one outlying point at the middle bottom. According to Fig 9, we can summarise the characteristics of different clusters as:

  • Cluster 01 consists of data zones with high deprivation level of accessibility;
  • Cluster 02 consists of data zones with relatively uniform of all variables at a small level of deprivation;
  • Cluster 03 consists of data zones with high deprivation level of education;
  • Cluster 04 consists of data zones with low deprivation level of education but higher than others;
  • Cluster 05 consists of data zones with high deprivation level of mental health but relatively uniform of other variables;
  • Cluster 06 consists of data zones with high deprivation level of welfare, mental health, employment and income;
  • Cluster 07 consists of data zones with very similar deprivation level of welfare, mental health, employment and income.
  • 总平面

    Mapping clustering to geographical map

    Fig 10 demonstrate the spatial distribution of clustered data zones geographically. To better visualise the spatial distribution of different bands, we separate the map into seven maps with different cluster and compare them with urban and rural areas of the city of Edinburgh (See Fig 11). The urban rural dual classification is sourced from 2-fold classification (Scottish Government 2016).

    总平面

    总平面

    Discussion

    SOM offers us an opportunity to explore the multi-dimension deprivation of 6 socio-economic variables. According to the SOM results, we can conclude that there is no obvious deprivation difference between urban and rural area for education. But when it comes to cluster of deprived area by welfare, mental health, employment and income, urban area seems to have poor performance. The deprived areas for accessibility to retail stores distribute relatively uniform across the city. Also, with the help of clustering results, policy-makers can make targetted plan for individual or group of variables, such as improving education investment on areas near Balerno, or focusing the area near Leith, Granton and Moredun because deprivation of multiple variables are shown there.

    In different data zones, different factors may work differently. The correlation of them with final deprivation measurement is varying for each other. The SOM technique brings us a new justification for SIMD. But it should be pointed out that it was difficult to visualise too many variables in a two-dimensional plane, especially for the fan diagram. I have to acknowledge that it ignores many other variables in SIMD for simplicity. To further develop the study, we can introduce more possible variables to test different combination of performance in SOM.