Principal Component Analysis to analyse number of Covid19 infections?
Data is extracted from data.public.lu.
# Data import covid_data = read.csv2("https://data.public.lu/fr/datasets/r/767f8091-0591-4b04-9a6f-a9d60cd57159", sep=",",na.strings="-", fileEncoding = "Latin1")
# Replace NA values by 0 covid_data[is.na(covid_data)] = 0
# Save data.frame as tibble covid = as_tibble(covid_data) # Change variables names covid <- covid %>% rename(date = Date, hospital = Soins.normaux, int_care = Soins.intensifs, deaths = X.1.NbMorts., left_hospital = Sorties.hôpital, inf_cum = Nb.de.positifs.cumulé, infections = Nb.de.positifs,tests = Nb.de.tests.effectués, tests_cum = Nb.de.tests.effectués.cumulés) %>% select(-Soins.intensifs.1) %>% mutate(date = as.Date(1:length(date), origin="2020-02-23")) covid <- covid %>% add_column(month=months(covid$date))
Similarities and differences between different months?
We start by performing a PCA on the Luxembourgish Covid19 data. We choose the month as a qualitiative supplementary variable.
PCA_1 = PCA(covid[c("deaths","infections","hospital","int_care","tests","month")], scale.unit=TRUE, quali.sup = 6)
The PCA clearly reflects reality by showing that the pandemic started in February with low numbers of infections, death cases and hospitalisations. From February to April, however, the number of hospitalisations started to increase but then again decreased after April and were low during the summer months. After summer the number of infections, death cases and hospitalisations increased rapidly and attained its maximum during November and December. After these two months, infection numbers remained high but the mortality rate clearly decreased.
Data is extracted from www.ourworldindata.org and is originally sourced from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.
# Data import covid_data <- read.csv2("covid.csv", sep=",",na.strings="") # Replace NA values with 0 covid_data[is.na(covid_data)] = 0
# Transform datafram to tibble covid <- as_tibble(covid_data) # Select variables of interest covid <- covid %>% select(location, new_cases_per_million, new_deaths_per_million, icu_patients_per_million, hosp_patients_per_million, positive_rate) covid[,2:6] = sapply(covid[,2:6], as.numeric)
Similarities and differences between countries?
PCA for daily new identified infection cases
The following code is performing a Principal Component Analysis in order to find the most similar countries to Luxembourg in terms of daily Covid19 infection cases.
PCA_2 = PCA(covid[,c(1,2)], scale.unit=TRUE, quali.sup = 1, graph=FALSE) lux_coord = PCA_2$quali.sup$coord[rownames(PCA_2$quali.sup$coord)=="Luxembourg"] sort_index = sort(abs(PCA_2$quali.sup$coord - lux_coord), index.return=TRUE)$ix rownames(PCA_2$quali.sup$coord)[sort_index][2:6]
##  "San Marino" "Panama" "Czechia" "Slovenia" ##  "United States"
We find that San Marino, Panama and the Czech Republic have the most similar (compared to Luxembourg) infection numbers.
PCA for multiple variables
We are now performing PCA for multiple variables, namely the 4 variables previously introduced.
PCA_6 = PCA(covid[,c(1:5)], scale.unit=TRUE, quali.sup = 1, graph=FALSE) plot(PCA_6$var$coord) text(PCA_6$var$coord, labels=rownames(PCA_6$var$coord),cex=0.8)
plot(PCA_6$quali.sup$coord) text(PCA_6$quali.sup$coord, labels=rownames(PCA_6$quali.sup$coord),cex=0.8)
We can already observe that Slovenia, the Czech Republic and the United States seem to have the most similar Covid19-related numbers, compared to Luxembourg.
We use Hierarchical Clustering in order to determine 5 different clusters of states which have similar Covid19-related infection cases, death cases and hospitalizations.
# Hierarchical Clustering clusters <- hclust(dist(PCA_6$quali.sup$coord)) clusterCut <- cutree(clusters, 5) clusterdata = data.frame(PCA_6$quali.sup$coord[,1:2], cluster=clusterCut) ggplot(clusterdata, aes(x=Dim.1,y=Dim.2 ,col=factor(cluster), label=rownames(clusterdata))) + geom_point() + geom_text()
The 5 clusters can be summarized as follows:
- Cluster 1: Rest of the world
- Cluster 2: European microstates
- Cluster 3: Western & Southwestern European states (+ Romania)
- Cluster 4: Central & Eastern European states (+ United States of America)
- Cluster 5: Baltic States & Eastern European states