[ Previous | Table of Contents | Next ]
The aim of this section is to introduce the data used within this paper and to discuss the statistical analysis procedures. The dataset is derived from a large-scale survey of Canadians. It is scaled up using weights to be representative of the Canadian population. The survey was designed and conducted in collaboration between Dr. Andersen, Industry Canada, and Decima Research in 2006. Data are analyzed using single equation regression methods.
This section is structured in the following way. The first sub-section introduces the survey, including sampling and interviewing techniques. The following sub-section discusses the dependent and independent variables developed to test our hypotheses. This section builds upon Section 2 where the variables were selected based on the theoretical approach in this paper. Finally, the last sub-section discusses the specific regression estimations used, including advantages and disadvantages of the methods vis-à-vis alternative techniques.
This research paper adds to the discussion on the extent and effects of music downloading and P2P file-sharing by using microeconomic survey data and by extending the analysis to account for a wider range of relevant variables/factors underlying music purchasing.
Most previous studies on P2P file-sharing have tended to analyze aggregated (e.g. macroeconomic) data. Thus, the analyses using those data are merely indirectly measuring the statistical relationships on which micro-assumptions and conclusions are based.
The analysis in this paper is based on direct answers (or micro-data) provided by 2,100 Canadian respondents. For example, respondents were asked about how many CDs and paid electronically-delivered tracks they purchased and the average prices paid. There are advantages from using measures of the respondents' recalled purchases and experienced average prices. A key issue here is that markets can take many forms (on-line, brick and mortar shop, second-hand, etc.) so no official music industry recorded price will capture the true demand and the true price which consumers are facing.
Moreover, our analysis is wider than previous studies, which tend to focus on P2P downloads only, as it considers a comprehensive range of ways in which music can be acquired. These are: purchasing CDs, ripping CDs and copying them onto computers, buying music tracks from online pay-sites like iTunes or Archambault, downloading free music from P2P file-sharing networks, like Kazaa, LimeWire, eDonkey, BearShare or Gnutella, downloading free music from promotional websites, downloading music from peoples' private Internet websites and copying MP3s from friends.
The demographic information in the survey, too, is very detailed, including information on gender, age, income, region in which they live, degree of music interest, Internet skills, occupation and educational level. See discussion of sampling technique below as well as Table 3.3 for an overview of such data.
The sampling technique used was quota-based random sampling, stratified by age (participants were 15 years or older), gender, geographical region and downloading status. This was done because a purely random sampling strategy would not have produced sufficient sample sizes for key segments of interest to this and other studies; e.g. youth, Francophones and P2P downloaders (i.e. persons engaged in P2P file-sharing). Therefore, stratification was introduced to allow for sufficiently robust analysis within these segments. The total number of survey responses was 2,100. For a detailed discussion on the sampling and interviewing techniques, see Decima Research (2006).
The resulting stratification across the four key demographic dimensions is detailed in Table 3.1. Both the numbers of unweighted as well as weighted observations are reported in Table 3.1. Sampling weights were constructed in order to scale the number of observations to match the actual Canadian population according to Statistics Canada 2001 Census data. As the actual proportion of downloaders in the population was unknown prior to conducting the survey, weights in relation to downloaders vs. non-downloaders reflect how the distribution occurred naturally or randomly during the survey prior to quota constraints being reached. In terms of the actual sample, the data contains 1,005 respondents who declared that they were P2P downloaders and 1,095 that declared not to have engaged in P2P downloading. With respect to the weighted data, the downloaders account for around 30 percent of the population and the non-downloaders for 70 percent. The weight attached to each survey response is the inverse of the probability of being included in the sample divided by the sample proportion. For instance, if the true proportion of female downloaders under the age of 25 living in Quebec is 1.1 percent of the population, and the sample proportion is 4.5 percent, then the applied weight to this segment is 0.244.
The first two columns in Table 3.1 give the number of observations in the survey, and, in relation to this, the final two columns are the weighted observations that are scaled up to match the Canadian population. In total there are 2,100 observations in the sample that represents a population of around 24 million. All following analyses will use weighted data to be representative of gender, age and regional distributions with respect to the Canadian population.
The remainder of this section explores different patterns of how people acquire music, e.g. buying CD albums and various ways of downloading tracks through websites. This is done to assess the extent to which various phenomena occur compared to other means of acquiring music.
Table 3.2 suggests that the dominant way of acquiring music is through purchasing CD albums. The survey data indicate that around 77.2 percent of the Canadian population purchased a CD album in 2005. This is over twice as common as alternative means of music acquisitions. 29.0 percent downloaded music through P2P networks and 29.2 percent ripped songs from CDs. 20.5 percent used friends to copy MP3s and 8.5 percent downloaded music from free music websites. 13.6 percent bought music tracks from pay-sites. 23.2 percent downloaded music for free from promotional websites. Appendix 1 explores different patterns of acquiring music depending on gender, age and region.
Table 3.3 provides an overview of the variables used in our analysis.
Our dependent variables are designed to capture purchasing of music, either in relation to CD markets or in relation to paid electronically-delivered music markets. Our first variable is the number of CD albums that respondents estimated they had purchased in 2005. Over and above the actual count data we also use two transformations of the actual data in our estimations. The variable capturing the number of CD albums bought in 2005 exhibits a positive skew with relatively more participants reporting low numbers of CD album purchases. To address this, we use two common types of data transformations in the case of OLS estimations; (i) taking the square root of the values of the dependent variable and (ii) taking the natural log. Because the log of zero values is not defined we add a value of one to the reported number of purchased CD albums prior to taking the natural log. Adding one, compared to any other value, is common practice within the area of economics and management studies (Tabachnick and Fidell, 2006) and is done because the log of one equals zero and thus the transformation does not lead to a shift in the distribution, i.e. both the untransformed and the transformed data take zero as the smallest value.
Our second set of dependent variables relates to the number of paid electronically-delivered music tracks respondents estimated they purchased in 2005. First, we use the count data. Second, we use the same data transformations as for the number of CD albums, i.e. we compute the square root and the natural log of the number of tracks purchased. Moreover, in the case of paid electronically-delivered tracks only, we also use a binary dependent variable which is coded zero if respondents purchased none and coded one if any tracks were purchased. The reason for including binary information in relation to MP3 purchases is that 85 percent (or 1,750) of responses were zeroes in this specific variable.
To test Hypothesis 1, which states that the price of music (CD albums) is negatively associated with the purchase of music (CD albums), we use a variable which reflects the price of CD albums participants purchased in 2005 as estimated by the participants, thus it is the perceived price of CDs. The variable is continuous and measured in Canadian $. This variable follows approximately a normal distribution. Hypothesis 1 also suggests that the price of paid electronically-delivered music tracks is negatively associated with the magnitude of purchases. However, because only 166 participants in the whole sample and 16 participants among the P2P file-sharers gave information on the estimated price of paid tracks in 2005, we omitted this variable in the regressions as this would have resulted in a huge drop of observations. Furthermore, when analyzing the sub-sample of P2P file-sharers, we use a variable that is called 'album too expensive'. This variable captures the percentage of P2P files that were downloaded because participants felt that the price of a music CD was too high. This variable takes values between zero and 100.
In relation to Hypothesis 2a, which states that there is a positive relationship between the price of CDs and number of songs downloaded from P2P networks, we regress the price of CD albums onto the number of reported purchases of paid electronically-delivered music tracks rather than onto CD purchases (this is an indirect way of measuring the cross-price elasticity of the two music markets).
The questionnaire contains two questions related to the number of P2P downloads. The first is a binary variable, the second is a quantitative variable giving an estimate of the number of P2P downloads in an average month in 2005. 246 respondents answered yes to being a downloader and estimated the number of downloads to be zero or did not report an answer. Decima Research (2005) states, "Normally, it is expected that 1%-3% of respondents arrive at this section, and then give a 'don't know' or non-behavioural response (i.e. zero downloads). In this instance, 246 of our 1,000 respondents gave a 'zero' response or answered 'don't know'. Given the magnitude of this proportion of respondents, additional analysis was warranted to better understand true downloading behaviour. Post-hoc analyses were conducted to determine if these individuals should be categorized as downloaders or non-downloaders." Based on their analysis, Decima Research concluded that the 246 respondents in question should be treated as downloaders. Responses for the year 2005 regarding their number of downloads were imputed, where imputed values corresponded to mean values of downloaders, based on age and gender. This variable is used in Appendices 4 and 5.
Further sources of free music included in this paper measure activities of ripping songs from CDs, downloading songs from promotional websites, downloading songs from private websites and copying MP3s. In the case of estimations based on the whole population, we use binary information for all these variables, e.g. whether or not an individual downloaded P2P files (yes is coded one and no coded zero). Although count data on the number of songs downloaded through P2P networks, ripped from CDs and files downloaded from promotional websites are available, a large proportion of the population did not engage in such activities. As a result there are very few observations different from zero and this causes problems in relation to the estimation when using count data. Thus, binary variables are presented and commented on in the paper. Results based on the relevant count data (using the natural log of the count data) are included in Appendices 4 and 5.
In the case of estimations based on the subset of P2P file-sharers, we use the natural log of the number of free songs, e.g. natural log of the number of songs ripped from a CD plus one to account for any zero observations in the variable. This is done because in the case of this particular sub-set of the data (P2P file-sharers) the proportion of zero answers is considerably lower. The equations for estimations are introduced in the last sub-section of Section 3. Furthermore, the 246 individuals who initially declared that they were P2P downloaders but subsequently did not provide a non-zero response when asked about the volume of their file-sharing were omitted from the estimations using the sub-sample of P2P downloaders.
Hypothesis 2b states that people who sample music (for example have the possibility to listen to music before purchasing) buy more CDs and paid electronically-delivered music tracks than those who do not sample music. This hypothesis is directly tested using the sub-sample of P2P file-sharers. The relevant variable is called 'hear before buying'. This variable is the percentage of P2P files that were downloaded due to the fact that people wished to hear a song prior to making a purchasing decision.
Hypothesis 2c states that people who download music and purchase paid electronically-delivered tracks tend to purchase fewer CD albums. In order to examine purchases of paid electronically-delivered tracks and their effect on the purchase of CD albums, the former measure was used as an independent variable in the estimations predicting CD purchases. In the results discussed in this paper we use the binary variable for purchases of paid electronically-delivered tracks when examining the whole sample (although estimations on the natural log of the related count data are included in Appendix 4) and the natural log of the count data plus one in the case of the sub-sample of P2P file-sharers for reasons discussed above.
Furthermore, in the case of all estimations based on the sub-sample of P2P downloaders, we use two variables labelled, 'not whole album' (capturing a respondent's decision to engage in P2P file-sharing because of an unwillingness to purchase an entire album) and 'not elsewhere available' (capturing a respondent's decision to engage in P2P file-sharing because the music being sought was not available for purchase). These variables give the percentage of P2P downloads due to these two factors and are measured on a scale from zero to 100.
Hypothesis 2d links the purchase of alternative entertainment goods to the purchase of music. We use several variables to test for a negative relationship between purchases of alternative entertainment goods and the purchase of music. These are the number of DVDs purchased, the number of videogames purchased, the number of cinema tickets and the number of concert tickets bought. For the purpose of the regressions we take the natural log of the number of DVDs, videogames and tickets purchased (after adding one to account for zeros in the variables). As discussed while developing Hypothesis 2d, the number of purchased entertainment goods (rather than their prices) is an appropriate measure for many reasons, including the fact that previous studies show that a 'time element' or 'lifestyle' choice is more important than the impact of price. (See Section 2 for elaboration of this argument.) Also, the response rate in relation to the price of goods within the survey was generally low; for example, only 583 participants gave an estimate for the price of video games.
Furthermore, we include a variable that distinguishes between people who downloaded music onto their MP3 player and those who did not. We call this relevant variable 'MP3 player ownership'. We believe that a variable capturing 'yes' responses to the question on whether the respondent stored P2P downloads on an MP3 player is a better proxy for analysis of complementary goods in music markets, than the direct measure of MP3 ownership. This is mainly due to the fact that MP3 players are still new technology and many who own MP3 players have received them as gifts but have never used them. The relevant variable is a binary variable coded one if participants declared that they downloaded on an MP3 player and coded zero if not.
In order to examine Hypothesis 3, which states that the level of income is positively associated with the magnitude of music purchases, there are five dummy variables representing five income bands.4 The first dummy is an estimated household income below 10K. This forms our base group against which the effects of all other income bands are compared. The remaining income groups are; 10 to 20K; 20 to 40K; 40 to 60K; and 60K and above. The income variable refers to household rather than individual income of participants. Moreover, household income data were also imputed to overcome a high rate of non-response and, thus, our findings in relation to this variable should be treated with some caution.
Two types of variables are used to look at Hypothesis 4, which suggests that the level of music taste matters. They are designed to capture music interest and the perception of music quality. Firstly, we use five dummies which group individuals according to their self-reported level of music interest categorized under: interest very strong, somewhat strong, moderate, somewhat low and very low. The individuals who have very low music interest form our base group against which the effects of the other categories are compared. Secondly, we use a questionnaire item that asked respondents whether they perceived an increase or a decrease in the quality of music over the last year, or whether they felt the quality of music remained unchanged. The resulting variables are three dummy variables. The base group is the dummy coded as one if a participant perceived no change in the quality of music.
Finally, Hypothesis 5 suggests that people with higher Internet skills are more likely to purchase paid electronically-delivered music. To examine this relationship we use five dummies which are the following categories of Internet skills self-ratings: very skilled, skilled, somewhat skilled, not very skilled and not at all skilled. The last category (people who reported that they were not at all skilled in the use of the Internet) is the base group.
We also test for a number of demographic factors in the regression models. First, we include seven age categories. These are 15 to 19, 20 to 24, 25 to 34, 35 to 44, 45 to 54, 55 to 64 and 65 and above. The last group, people who are 65 or older, is our comparison group. We also control for gender, coded as one for women and zero for men. Finally, we control for region (Quebec is coded as one and the rest of Canada is coded as zero).
It should be noted that the survey output does include demographic data on 'occupation' and 'education'. However, we found these data highly correlated with the other independent variables, so they were omitted from our digital divide estimations in order to avoid problems of multicollinearity.
In order to examine the impact on music purchased of our independent variables we use single equation regression methods. Weighted data are used throughout the analyses. The following equations are estimated.
Equation [1]: based on the whole sample
yi = α + β1Price of CDsi + β2P2Pi + β3Rip CDi + β4Promotional websitei + β5Private websitei + β6Copy MP3i + β7Purchase MP3i + β8Number of DVDsi + β9Number of videogamesi + β10Number of cinema ticketsi + β11Number of concert ticketsi + β12Incomei + β13Change in quality of musici + β14Interest in musici + β15Internet skillsi + β16Agei + β17Genderi + β18Regioni + εi
where yi is an indicator of music purchased which is a measure based on the number of CD albums purchased in 2005 as self-reported by the survey participants as previously.
Equation [2]: based on the whole sample
yi = α + β1Price of CDsi + β2P2Pi + β3Rip CDi + β4Promotional websitei + β5Private websitei + β6Copy MP3i + β7Number of DVDsi + β8Number of videogamesi + β9Number of cinema ticketsi + β10Number of concert ticketsi + β11Incomei + β12Change in quality of musici + β13Interest in musici + β14Internet skillsi + β15Agei + β16Genderi + β17Regioni + εi
where yi is an indicator based on the number of paid electronically-delivered music tracks purchased in an average month in 2005 as self-reported by the survey participants. With respect to the independent variables, in the case of Equation [2], β7 measuring the effects MP3 purchases on CDs albums purchases in Equation [1], is excluded as it forms the dependent variable in this equation.
We compute a second set of estimations based on the sub-sample of P2P file-sharers. This is done because some variables that we analyze are only applicable for this particular group; e.g. what is the percentage of P2P files that people downloaded due to the fact that they wanted to listen to a song before buying. (For an overview, see review of variables feeding into the various hypotheses in Sub-section 'Variables'). The 246 participants who declared that they were P2P downloaders but subsequently did not give the number of downloads or responded that they had downloaded zero tracks from P2P networks were omitted from the analyses because their responses are not reliable. The following equation is estimated both on CD albums and MP3s.
Equation [3]: based on the sub-sample of P2P downloaders
yi = α + β1Price of CDsi + β2Album too expensivei + β3Number of P2Pi + β4Number of CDs rippedi + β5Number promotional websitei + β6Number private websitei + β7Number copy MP3i + β8Number of MP3s purchasedi + β9Number of DVDsi + β10Number of videogamesi + β11Number of cinema ticketsi + β12Number of concert ticketsi + β13Not elsewhere availablei + β14Not whole albumi + β15MP3 player ownershipi + β16Hear before buyingi + β17Incomei + β18Change in quality of musici + β19Interest in musici + β20Internet skillsi + β21Agei + β22Genderi + β23Regioni + εi
where yi is measures CD album sales as discussed before.
Equation 4: based on the sub-sample of P2P downloaders
yi = α + β1Price of CDsi + β2Album too expensivei + β3Number of P2Pi + β4Number of CDs rippedi + β5Number promotional websitei + β6Number private websitei + β7Number copy MP3i + β8Number of DVDsi + β9Number of videogamesi + β10Number of cinema ticketsi + β11Number of concert ticketsi + β12Not elsewhere availablei + β13Not whole albumi + β14MP3 player ownershipi + β15Hear before buyingi + β16Incomei + β17Change in quality of musici + β18Interest in musici + β19Internet skillsi + β20Agei + β21Genderi + β22Regioni + εi
where yi is an indicator of electronically-delivered music tracks purchased.
Regressions are sensitive to misspecification of models. Such misspecifications are an issue that will almost always apply when statistical tests are carried out and they are difficult to address (e.g., Kennedy, 2003). One possibility that is adopted in this paper is to estimate and compare a number of alternative or competing models. These estimation models are described below.
The dependent variables, number of CD albums purchased and number of paid electronically-delivered music tracks purchased in 2005, represent count data, i.e. the dependent variables take non-negative integer values only. The most commonly used model to analyze count data is the Poisson model. The probability of an event occurring is e-λ λy / y!, where λ is both the mean and the variance of the distribution. Although the Poisson model is perhaps the most frequently used estimation technique to predict count data, the assumptions it makes are often not met by data. In particular, Poisson regressions assume that the variance of occurrences is equal to the mean of occurrences (Greene, 2003 and Kennedy, 2003). The assumption of equal mean and variance is unlikely to hold, and, in our case, the variance of number of CD purchases is greater than the mean, i.e. our data are overdispersed which has an adverse impact on our regression estimates. If the dependent variable is overdispersed, the most commonly used model is the negative binomial model where the mean is λ and the variance is λ + α-1 λ2 and α is the parameter of the gamma distribution (Kennedy, 2003). For the purpose of our analyses we compare the results of both the Poisson and the negative binomial models.
Furthermore, we compare the estimates of the Poisson and negative binomial models with OLS estimators. This is done because OLS estimations often compare rather favourably with the results of more complicated models. This is because the classic linear model is less prone to problems caused for example by errors in variables. In our case, errors in variables may arise in relation to all those variables where participants were asked to report on the number of albums or files acquired in a specific year (or month). We examined frequencies of such variables and found that respondents were likely to give approximations of the number of music purchases rounded to values of 10, 20, 30, and so forth.
OLS requires that the dependent variable is approximately normally distributed. The variable 'number of CD albums bought in 2005' exhibits a positive skew with relatively more participants reporting low numbers of CD album purchases. To address this, we use two common types of data transformations in the OLS model: (i) taking the square root of the values of the dependent variable, and (ii) taking the natural log. Because, and as mentioned before, the log of zero values is not defined we add a value of one to the variable CD albums prior to taking the natural log. Adding one, as opposed to any other value, is common practice within the area of economics and management studies (Tabachnick and Fidell, 2006). This is done because the log of one equals zero and thus the transformation does not lead to a shift in the distribution, i.e. both the not transformed and the transformed data take as the smallest value the zero. Three separate OLS regressions are performed with respect to CD albums and with respect to purchased electronic music tracks.
The variable 'number of MP3s purchased in an average month in 2005' exhibits an even stronger positive skew compared with the number of CD albums purchased. A large number of survey participants (1,7505 out of the total of 2,100) did declare that they purchased no electronic music tracks, thus there are 1,750 zero observations in this variable. As a result both the Poisson model and the Negative binomial model did not converge.
For the purpose of OLS estimations we use the actual data, the square root and the natural log of 'number of MP3s'. In the analysis of purchases of electronically-delivered music tracks, we also use a binary variable (coded one for persons who reported purchasing in 2005 and zero otherwise). On the basis of this variable, we estimate Logit and Probit models. Logit estimations are based on a logistic function and Probit models are based on a cumulative normal distribution, both of which follow a similar s-shape and produce highly similar results. On historical grounds the Logit is perhaps more frequently applied (pre advanced statistical software packages) as it is easier to calculate.
In the case of four variables we test the linear hypotheses of equal parameters after the negative binomial model in case of number of CD albums and the Probit model in the case of electronically-delivered music purchases. The difference in coefficients we are specifically interested in refers to the variables 'album too expensive', 'hear before buying', 'not elsewhere available' and 'not whole album' because these variables relate to sampling effects and market segmentation versus market substitution effects.
We now discuss potential issues with respect to the regressions carried out in the paper. These relate to problems of inferring causality from cross-sectional data, issues of endogeneity and omitted variables, heteroskedasticity and errors in variables.
Firstly, regressions based on cross-sectional data cannot prove causality; instead they only show an association between variables. Thus, with respect to this paper causality may only be inferred on the basis of theoretical reasoning carried out in previous sections. To this end, estimations based on panel data used, for example, in Liebowitz (2004, 2005) are advantageous, however, the clear disadvantage is that panel data with an equivalent richness of information and the same level of disaggregation (i.e. individual responses) compared to our dataset are not available.
Secondly, single equation estimations assume that all independent variables are exogenous and all important variables are included in the estimation. If, however, any of the independent variables are influenced by the dependent variable and/or any of the independent variables, or important independent variables are omitted, then the included independent variables tend to be correlated with the error term leading to inconsistent estimates (Kennedy, 2003). Problems of endogeneity are likely to affect our results. For example, the use of P2P downloads may be determined by CD purchases or, in fact, by other independent variables. Techniques designed to address issues of endogeneity are systems of simultaneous equations (e.g., Wooldridge, 2000). These are based on the use of instrumental variables in order to predict the endogenous regressors. Observed values of the endogenous variable are replaced with the predicted values in the ultimate equation, with the predicted values being uncorrelated with the error term.
Unfortunately, useful instruments are inherently difficult to find and this is why we decided not to use instrumental variable techniques. Simultaneous equations produce consistent estimates only if the instruments are uncorrelated with the error terms, i.e. they are truly exogenous to the system, and when the instruments are highly correlated with the endogenous variable. In reality almost every variable carries some degree of endogeneity. Moreover, Monte Carlo studies suggest that estimators of single equation regressions are less sensitive to the presence of other estimation problems, such as errors in variables or misspecifications of equations (Greene, 2003).
Thirdly, regression methods assume that the variances of the disturbances, in other words the errors of prediction, are approximately constant. A violation of this assumption is called heteroskedasticity (e.g., Kennedy, 2003). Heteroskedasticity may for example occur if the variation in music purchases is greater for people on a high income compared to people on low incomes. It may also arise when a variable is skewed or when a variable is correlated with an omitted variable. We tested for and found the presence of heteroskedasticity using a White test. To account for this, our regressions are based on robust standard errors.
[ Previous | Table of Contents | Next ]