############################################################################################
##########################  Population Genetic Structure
############################################################################################

## Load the fSNPS object

load('fSNPS.robj')

## We will study the population genetic structure using the Principal Component Analysis. 
## PCA requires that no missing value are present in the investigated matrix. 
## This can be problematic, since genotype matrix often have missing values. 
## There are two approaches that can be employed to tackle this issue:
## 1) Remove all the SNPs that have missing values. The drawback is that if there are a lot of missing values, the genotype matrix can become very small. 
## 2) Impute missing values from the ones that are known. The drawback here is that this procedure gives "more weight" to frequent gentotypes. 

## Our dataset contains very few missing values, so we will use option 1). 
# REMEMBER: this procedure is required ONLY for this analysis. For adaptation study we will use the fSNPS matrix containing missing values, since SamBada can handle them.  

X <- fSNPS[,apply(fSNPS,2,function(x) {return(is.na(sum(x))==F)})] # filter out SNPs with missing values

dim(X) # The population structure study will involve 27778 loci.

PCA <- prcomp(X, scale=T) # launch the PCA

## The same principles of the PCA used to describe the environemnt apply here. 

## We can have a look at how the first principal components describe our samples.

par(mfrow=c(2,2))
plot(PCA$x[,1:2], xlab='PC1', ylab='PC2')
plot(PCA$x[,3:4], xlab='PC3', ylab='PC4') 
plot(PCA$x[,5:6], xlab='PC5', ylab='PC6') 
plot(PCA$x[,7:8], xlab='PC7', ylab='PC8')

# RQ7) Comment this graph. Do you think that there is genetic structure? 

# The first PC shows a differentiation that involves many samples, which might underlie population structure. We can visualize this on the map: 

# We can reload the plot_map_gradient function.
load('plot_map_gradient.Rfun')
par(mfrow=c(1,1))
plot_map_gradient(shp=env, gradient=PCA$x[,1], legendpos = 'bottomright')

# Apparently, the major variation in the genetic data distinguish individuals in the west side from those on the east side of Morocco.
# The question is now to understand how strong this differentiation is:

plot(PCA$sdev/sum(PCA$sdev), ylab='% of Variance Explained', xlab='PCs') 

# RQ8) What can you say about the strength of the population structure? How can this information influence your adaptation study? 

## We export the first principal component and the Individual heterozygosity to visualize them later on QGIS. 

load('HE_ind.Robj')

out_env <- cbind(env[,1:4], HE_ind, PCA$x[,1])
names(out_env)[(ncol(out_env)-1):(ncol(out_env))] <- c("Obs_Het", "PC1")

shapefile(out_env, 'MOOA_env_Gen_Str.shp')

