CITO is an institute in the Netherlands that support governments and schools so that they can develop world-class testing and monitoring systems to complete their educational programs. They have a lot of data regarding testing scores and it could be interesting to combine this data with public data. For example, are testing scores of children living in deprived areas worse than average?
Exploratory Analysis
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics. This is often done by using data visualization methods. The main purpose of EDA is to help look at data before making any assumptions. For me it’s one of the nicest parts of the data science! Since you don’t know yet what’s in the data and there will always be surprises. It’s like you are on holiday and exploring the area that you seeing for the first time:-)
For example, is a certain variable in the data normally distributed or not? Is there any missing data or duplicated values? In my experience, yes in most cases, there is missing and duplicated data. We need to fix these issues before we can do the real analysis. This phase is called data cleaning, you might have heard about this before.
Representativeness Analyses
In general, a representative sample is a group or set chosen from a larger statistical population that adequately replicates the larger group according to whatever characteristic or quality is under study. In case of CITO, we like to know if the sample data set has more or less the same characteristics regarding scores. For example, are the average and standard deviation of the sample data set close to the ones of the total data set. I have plotted the distributions of the 2 data sets in a single chart, in order to compare them. In the subtitle one can find the average, standard deviation and the median.
Below you can find some of the charts I made for both EDA and the representativeness Analyses. The code is available in a public repository on github. It can be run using a docker container, R and renv for library management.