Principal Component Analysis to analyse number of Covid19 infections?

Image credit: www.pixabay.com

Data import

Data is extracted from data.public.lu.

# Data import
covid_data = read.csv2("https://data.public.lu/fr/datasets/r/767f8091-0591-4b04-9a6f-a9d60cd57159",
                       sep=",",na.strings="-", fileEncoding = "Latin1")

Data preparation

# Replace NA values by 0
covid_data[is.na(covid_data)] = 0
# Save data.frame as tibble
covid = as_tibble(covid_data)

# Change variables names
covid <- covid %>% 
  rename(date = Date,
         hospital = Soins.normaux,
         int_care = Soins.intensifs,
         deaths = X.1.NbMorts.,
         left_hospital = Sorties.hôpital,
         inf_cum = Nb.de.positifs.cumulé,
         infections = Nb.de.positifs,tests = Nb.de.tests.effectués,
         tests_cum = Nb.de.tests.effectués.cumulés) %>%
  select(-Soins.intensifs.1) %>%
  mutate(date = as.Date(1:length(date), origin="2020-02-23")) 

covid <- covid %>% add_column(month=months(covid$date))

Similarities and differences between different months?

We start by performing a PCA on the Luxembourgish Covid19 data. We choose the month as a qualitiative supplementary variable.

PCA_1 = PCA(covid[c("deaths","infections","hospital","int_care","tests","month")],
            scale.unit=TRUE, quali.sup = 6)

The PCA clearly reflects reality by showing that the pandemic started in February with low numbers of infections, death cases and hospitalisations. From February to April, however, the number of hospitalisations started to increase but then again decreased after April and were low during the summer months. After summer the number of infections, death cases and hospitalisations increased rapidly and attained its maximum during November and December. After these two months, infection numbers remained high but the mortality rate clearly decreased.

Data import

Data is extracted from www.ourworldindata.org and is originally sourced from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

# Data import
covid_data <- read.csv2("covid.csv", sep=",",na.strings="")
# Replace NA values with 0
covid_data[is.na(covid_data)] = 0
# Transform datafram to tibble
covid <- as_tibble(covid_data)

# Select variables of interest
covid <- covid %>% select(location, 
                          new_cases_per_million, 
                          new_deaths_per_million, 
                          icu_patients_per_million,
                          hosp_patients_per_million,
                          positive_rate)

covid[,2:6] = sapply(covid[,2:6], as.numeric)

Similarities and differences between countries?

PCA for daily new identified infection cases

The following code is performing a Principal Component Analysis in order to find the most similar countries to Luxembourg in terms of daily Covid19 infection cases.

PCA_2 = PCA(covid[,c(1,2)], scale.unit=TRUE, quali.sup = 1, graph=FALSE)
lux_coord = PCA_2$quali.sup$coord[rownames(PCA_2$quali.sup$coord)=="Luxembourg"]
sort_index = sort(abs(PCA_2$quali.sup$coord - lux_coord), index.return=TRUE)$ix
rownames(PCA_2$quali.sup$coord)[sort_index][2:6]
## [1] "San Marino"    "Panama"        "Czechia"       "Slovenia"     
## [5] "United States"

We find that San Marino, Panama and the Czech Republic have the most similar (compared to Luxembourg) infection numbers.

PCA for multiple variables

We are now performing PCA for multiple variables, namely the 4 variables previously introduced.

PCA_6 = PCA(covid[,c(1:5)], scale.unit=TRUE, quali.sup = 1, graph=FALSE)

plot(PCA_6$var$coord)
text(PCA_6$var$coord, labels=rownames(PCA_6$var$coord),cex=0.8)

plot(PCA_6$quali.sup$coord)
text(PCA_6$quali.sup$coord, labels=rownames(PCA_6$quali.sup$coord),cex=0.8)

We can already observe that Slovenia, the Czech Republic and the United States seem to have the most similar Covid19-related numbers, compared to Luxembourg.

Hierarchical Clustering

We use Hierarchical Clustering in order to determine 5 different clusters of states which have similar Covid19-related infection cases, death cases and hospitalizations.

# Hierarchical Clustering
clusters <- hclust(dist(PCA_6$quali.sup$coord))
clusterCut <- cutree(clusters, 5)
clusterdata = data.frame(PCA_6$quali.sup$coord[,1:2], cluster=clusterCut)
ggplot(clusterdata, aes(x=Dim.1,y=Dim.2 ,col=factor(cluster), label=rownames(clusterdata))) +
  geom_point() +
  geom_text()

The 5 clusters can be summarized as follows:

  • Cluster 1: Rest of the world
  • Cluster 2: European microstates
  • Cluster 3: Western & Southwestern European states (+ Romania)
  • Cluster 4: Central & Eastern European states (+ United States of America)
  • Cluster 5: Baltic States & Eastern European states

Tool to compare Covid19 data of different countries

Fred Philippy
Fred Philippy
Master’s degree student in Statistics

I am a 2nd-year Master’s degree student in Statistics with a particular interest in Machine Learning, Computational Statistics, Data Analysis and High-Dimensional Statistics.

Related