Assignment I for Biostats Course VHM 801 at AVC - Fall semester 2016

The assignment is worth 10% of the final course mark.

The data for the assignment are from a study on women's back pain during pregnancy. It was carried out at a single hospital during four summer months several years ago, and comprised all women giving birth in that period. The data included here for 174 women contain information about the perceived back pain during pregnancy, as well as other characteristics, as summarized in the list below:

All variables were retrieved from questionnaires the women were asked to fill out, assisted by a physiotherapist, within 24 hours of delivery; the response rate was 100%. The dataset is available in Minitab format and as a comma-separated file, for import into Stata and other statistical software.

The home assignment has five questions which should all be answered.

  1. Use the above description of the data collection to characterize the study type as well as the selection of subjects for the study. Discuss next whether the data collection procedures used imply any restrictions on the group of women (pregnancies) these data may be considered representative for. Alternatively, you may simply describe the group of women (pregnancies) these data in your view could be considered representative for.

  2. Select four variables in the dataset: two quantitative variables, one categorical variable (with more than two categories), and one dichotomous (or binary) variable. Apart from this restriction on the variable types you are free to select the variables as you like. Carry out a descriptive analysis of your four selected variables including both a graphical representation and descriptive statistics. Choose the graphical representation and the statistics you find most useful to show each distribution, in consideration of the variable's type and range of values. Where appropriate, comment specifically on the distribution's center, spread and shape. Discussion of 'outliers' may be deferred to the next question.

  3. Find and discuss at least one observation which you think is an outlier (in the sense of a 'real outlier', as opposed to a 'potential outlier' detected by automated data screening procedures). Make it clear why you think the observation(s) could be truly different than the others.

  4. Find and discuss at least two (different) instances of errors or inconsistencies in the data. Describe carefully why you think an observation is most likely an error, or why a set of observations in your view are inconsistent.

  5. Select a continuous variable (possibly one of the variables previously described) to examine whether it would seem reasonable to assume its values to be normally distributed. Describe carefully the tools you use for this, and how you arrive at your conclusions. If you conclude that the variable is not normally distributed, describe how its distribution seems to differ from a normal distribution. Additionally, compute each woman's weight gain during pregnancy, and carry out a similar analysis for this variable; make sure to include any supplementary analysis that would seem relevant to support your interpretation of the results.

Henrik Stryhn (hstryhn@upei.ca) 2016-09-28