The study population is based on citizens of all listed countries
The analysis of data is mostly based on the total population of each country, although some variables are restricted to a certain age group. (e.g. consumption of alcohol only refers to citizens with 15+ age)
The data set contains information for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 21 other areas, generating a total of 213 areas. That means, that there are 213 rows of data. For these countries which are considered into the study, there are up to 15 variables evaluated, where each variable has its own column. Depending on the source, the data dates from different years in the current century.
The sample is provided by Gapminder, a non-profit venture promoting sustainable global development and achievement of the United Nations Millennium Development Goals. In this sample, Gapminder visualizes the social, economic, and environmental development at level of each country, where e.g. the economic development is indicated by income per person, the social employment by overall or female employment rate, as well as other variables. The values are quantitative summary measures and either continuous or discrete.
a) Report the study design that generated that data (for example: data reporting, surveys, observation, experiment).
The data are collected from a handful of sources, including the Institute for Health Metrics
And Evaluation, US Census Bureau’s International Database, United Nations Statistics Division and the World Bank. Only data reporting would make sense in this case, because Gapminder will not be able to collect the data from any surveys or observations, which would cover all citizens from the 213 countries.
The original purpose of data collection is showcasing the current development of economic, social as well as environmental or health aspects of the overall population, to analyze potential correlations or to predict future development.
The data, that Gapminder collected from their declared sources, which are mostly officially recognized sources, were also based on data reporting from external sources. For more information, please see answer to question E below.
The data for the past, which are mainly used for the analysis, date between 2002 and 2011, which indicates that they must be collected in the same period of time, depending on the exact year.
The values for different indicators are taken from different sources, mainly from organizations like WHO, UNAIDS or World Bank, other official (research) organizations and its databases. Some sources share the same data. For example, International Labor Organization has the duty to provide high-quality data to UN. Gapminder scrapes the data not only from UN, but sometimes also from ILO directly. For different countries or different period of time, Gapminder might need to take different sources as well due to lack of provided data.
The alcohol consumption per adult measures the recorded and estimated average pure alcohol consumption of citizen with age 15+ in liter. This variable is considered as my explanatory variable, also the variable income per person, which describes the 2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been considered. While the first explanatory variable sourced from WHO, the second explanatory variable is based on World Bank.
The measured response variable for me is life expectancy at birth, which is describing the average number of years a newborn child would live if current mortality patterns were to stay the same. This data come from different databases, such as Human Mortality or Human Lifetable Database.
All three variables are quantitative. Alcohol consumptions as variable is continuous, as well as the average life expectancy in year (with decimal places) because both are physical quantities, liter and year. Although the average income per person can be discrete, I’d still consider it as a continuous variable because in the table of data, it is not restricted to the smallest unit Cent, but has more than 15 decimal places.
I considered and selected them as explanatory or response variables due to my own feeling, that there might be a reference between them. In that case, I can investigate my self-picked variables to reduce the complexity and to focus on a part of it.
If there are any questions referred to this assignment, feel free to ask. I am also looking forward to any suggestions of potential improvements. 🙂 Thank you a lot!