Assignment I for Biostats Course VHM 801 at AVC - Winter Semester 2026
The assignment is worth 10% of the final course mark. Please be aware that by handing
in the home assignment you implicitly acknowledge to have read and accepted
the instructions for home assignments as described
on the VHM 801 homepage.
This assignment is based on data collected during 2013 in a study on
Bovine leukemia virus (BLV) in the Maritime provinces of Canada. The research
was carried out by the Maritime Quality Milk (MQM) research centre at AVC, and it formed part
of the PhD project of Omid Nekouei at AVC. One of the research
objectives was to develop a predictive tool for a herd's BLV infection status
(infected or not) and for the infection level (prevalence) within the herd.
It was also of interest to explore whether BLV infection shows any links (associations)
with milk production parameters.
For this assignment we consider data for 70 herds with Holstein cows
from the Maritime region (note: some information has been modified from the original data to
protect the confidentiality of the herds and their records). The following variables are included in the
data, with each row in the data correponding to one herd:
- herd: herd (herd number, with no intrinsic meaning),
- province: the province of the herd (1=New Brunswick, 2=Nova
Scotia, 3=Prince Edward Island),
- herdsize: the number of lactating cows from the herd included
in the study,
- elisa: ELISA test result for a bulk tank milk sample (%),
- milk305: the average lactation (305-day) milk production for the cows in the herd (kg),
- fat305: the average lactation (305-day) fat yield for the cows in the herd (kg),
- prot305: the average lactation (305-day) protein yield for the cows in the herd (kg),
- parity: the average parity (lactation number) for the cows in the herd.
The milk ELISA test quantifies the amount of BLV antibodies
in the milk, but we will for simplicity (although not quite correctly) refer to it as a test for BLV infection.
The test gives optical density
values expressed as a percentage relative to a positive control (and
that percentage can exceed 0-100%). A high ELISA test value is therefore indicative
of an "infected" sample. The test was run at
the MQM diagnostic laboratory.
The dataset is available in Minitab format and as a comma-separated file, for
import into Stata and other statistical software.
The home assignment has five questions (a)-(e) which should all be answered.
- Characterize the study type (e.g., experimental or another type)
and the type of each of the variables in the dataset (e.g., categorical
or quantitative discrete or quantitative continuous).
- Select three variables in the dataset: two quantitative variables and one categorical variable;
apart from these restrictions
you are free to select the variables as you want. Carry out a
descriptive analysis of each of
your three selected variables, including both a graphical representation and
descriptive statistics. Choose the graphical representation and the statistics you find most
useful to show each distribution,
in consideration of the variable's type and range of values. Where appropriate,
comment specifically on the distribution's center, spread and shape.
If your descriptive analysis identifies any "suspected outliers", discuss
whether these should be considered as truly outlying observations, in the sense that they
don't really belong to the distribution, or whether they should be considered as part
of the distribution.
- Determine one variable in the dataset with a distribution you think is approximated well by a normal
distribution and another variable for which you think this is not the case; it is allowed (but not required) to
use variable(s) from (b). Describe carefully how you quantitatively assess
the agreement of a variable with a normal distribution. Additionally, describe for the variable not approximated well by a normal
distribution how its distribution seems to differ from a normal
distribution.
- The ELISA test can be used to assign herd infection status by a simple classification rule
whereby herds with a value above a certain threshold (or cut-off) value are considered as infected, and herds
with a value below the threshold are considered as non-infected. Using an ELISA test threshold value of 5,
carry out such a classification of herds: what is the proportion of
infected herds? Use descriptive graphical and numerical tools to compare the
distributions for the two quantitative variables you worked with in (b)
between infected and non-infected herds. Describe your findings and try to draw conclusions.
Note that you are not exptected to
compute any statistical tests to compare the distributions.
- This last part of the assignment discusses the selection of herds to participate in a part of the study
that involved more extensive data collection. The 70 herds we are working
with here were all included in the extensive data collection part of the study. Within each province, ELISA bulk tank values
were used to divide the herds into 5 categories, ranging from lowest to highest infection levels. For example,
142 herds in New Brunswick could be divided into 5 categories as
follows:
| Category | 1 | 2 | 3 | 4 | 5 | Total
|
|---|
| Number of herds | 23 | 15 | 22 | 47 | 35 | 142
|
|---|
Describe and demonstrate how to randomly select 6 herds from each of the
5 categories for the extensive data collection part of the
study. You may use either statistical software or a table of random
numbers, but make sure to describe clearly how you arrived at the randomly
selected herds. Do you think the 30 selected accurately represent the entire
population of herds? - explain your answer.
Henrik Stryhn
(hstryhn@upei.ca) 2016-02-04