Assignment I for Biostats Course VHM 801 at AVC - Winter Semester 2026

The assignment is worth 10% of the final course mark. Please be aware that by handing in the home assignment you implicitly acknowledge to have read and accepted the instructions for home assignments as described on the VHM 801 homepage.

This assignment is based on data collected during 2013 in a study on Bovine leukemia virus (BLV) in the Maritime provinces of Canada. The research was carried out by the Maritime Quality Milk (MQM) research centre at AVC, and it formed part of the PhD project of Omid Nekouei at AVC. One of the research objectives was to develop a predictive tool for a herd's BLV infection status (infected or not) and for the infection level (prevalence) within the herd. It was also of interest to explore whether BLV infection shows any links (associations) with milk production parameters.

For this assignment we consider data for 70 herds with Holstein cows from the Maritime region (note: some information has been modified from the original data to protect the confidentiality of the herds and their records). The following variables are included in the data, with each row in the data correponding to one herd:

The milk ELISA test quantifies the amount of BLV antibodies in the milk, but we will for simplicity (although not quite correctly) refer to it as a test for BLV infection. The test gives optical density values expressed as a percentage relative to a positive control (and that percentage can exceed 0-100%). A high ELISA test value is therefore indicative of an "infected" sample. The test was run at the MQM diagnostic laboratory. The dataset is available in Minitab format and as a comma-separated file, for import into Stata and other statistical software.

The home assignment has five questions (a)-(e) which should all be answered.

  1. Characterize the study type (e.g., experimental or another type) and the type of each of the variables in the dataset (e.g., categorical or quantitative discrete or quantitative continuous).

  2. Select three variables in the dataset: two quantitative variables and one categorical variable; apart from these restrictions you are free to select the variables as you want. Carry out a descriptive analysis of each of your three selected variables, including both a graphical representation and descriptive statistics. Choose the graphical representation and the statistics you find most useful to show each distribution, in consideration of the variable's type and range of values. Where appropriate, comment specifically on the distribution's center, spread and shape. If your descriptive analysis identifies any "suspected outliers", discuss whether these should be considered as truly outlying observations, in the sense that they don't really belong to the distribution, or whether they should be considered as part of the distribution.

  3. Determine one variable in the dataset with a distribution you think is approximated well by a normal distribution and another variable for which you think this is not the case; it is allowed (but not required) to use variable(s) from (b). Describe carefully how you quantitatively assess the agreement of a variable with a normal distribution. Additionally, describe for the variable not approximated well by a normal distribution how its distribution seems to differ from a normal distribution.

  4. The ELISA test can be used to assign herd infection status by a simple classification rule whereby herds with a value above a certain threshold (or cut-off) value are considered as infected, and herds with a value below the threshold are considered as non-infected. Using an ELISA test threshold value of 5, carry out such a classification of herds: what is the proportion of infected herds? Use descriptive graphical and numerical tools to compare the distributions for the two quantitative variables you worked with in (b) between infected and non-infected herds. Describe your findings and try to draw conclusions. Note that you are not exptected to compute any statistical tests to compare the distributions.

  5. This last part of the assignment discusses the selection of herds to participate in a part of the study that involved more extensive data collection. The 70 herds we are working with here were all included in the extensive data collection part of the study. Within each province, ELISA bulk tank values were used to divide the herds into 5 categories, ranging from lowest to highest infection levels. For example, 142 herds in New Brunswick could be divided into 5 categories as follows:

    Category12345Total
    Number of herds2315224735142

    Describe and demonstrate how to randomly select 6 herds from each of the 5 categories for the extensive data collection part of the study. You may use either statistical software or a table of random numbers, but make sure to describe clearly how you arrived at the randomly selected herds. Do you think the 30 selected accurately represent the entire population of herds? - explain your answer.


Henrik Stryhn (hstryhn@upei.ca) 2016-02-04