A GUEST BLOG BY DR STEPHEN CLIFT
In Stephen’s last guest blog in this series, he demonstrates that a highly cited arts and health paper is a ‘fairy tale’ that has cast a collective spell over the field. Stephen wishes he had published this debunking in 2008. Now, 12 years later, here it is…
Stephen Clift (BA, PhD, PFRSPH) is Professor Emeritus, Canterbury Christ Church University, and former Director of the Sidney De Haan Research Centre for Arts and Health. He is a Professorial Fellow of the Royal Society for Public Health (RSPH) and is also Visiting Professor in the International Centre for Community Music, York St John University. Stephen has worked in the field of health promotion and public health for over thirty years, and has made contributions to research, practice and training on HIV/AIDS prevention, sex education, international travel and health and the health promoting school in Europe. His interests relate to arts and heath and particularly the potential value of group singing for health and wellbeing. Stephen is one of the founding editors of the journal Arts & Health: An international journal for research, policy and practice. He was the founding Chair of the RSPH Special Interest Group for Arts, Health and Wellbeing, and a founding trustee of Arts Enterprise with a Social Purpose (AESOP). He is also co-editor with Professor Paul Camic of the Oxford Public Health Textbook on Creative Arts, Health and Wellbeing published in November 2015. Currently, he is working on developing a series of provocations in arts and health research. a special collection of critical papers on arts and health with Frontiers in Psychology, and a special issue of the International Journal for Community Music on the impacts of the COVID-19 pandemic.
THE NEED FOR ROBUST APPRAISAL OF RESEARCH IN ARTS AND HEALTH: A CASE OF ‘THE EMPEROR’S NEW CLOTHES’
This blog starts with a judgement made by David Culter (2019) in his recent excellent report: Around the World in Eighty Creative Ageing Projects. There, Cutler outlines the main benefits that research has documented for older people from participation in creative activities. He then says:
‘Indeed the landmark study in this field comes from the USA. Published in The Gerontologist (Vol 46, no 6, pages 726–734), the study followed 300 subjects with a median age of 80. One group was involved in arts programmes and the control group was not. The study suggested that involvement in the arts led to better health, fewer visits from doctors, less medication, increased physical activity and social engagement. This led to the claim that such programmes could result in a reduction of $6.3 billion dollars at that time to the US public purse.’ (2019, p.8)
This study, (Cohen, Perlstein, Chapline, Kelly, et al., 2006, 2007) was indeed a ‘landmark’ event. The paper Cutler cites (which was concerned only with singing) was the first of two, with a further paper published a year later reporting on a follow up evaluation. However, these papers were subjected to detailed critical scrutiny by Clift, Staricoff, Hancox and Whitmore (2008) who questioned the methodology, the statistical analysis, and the accuracy of the claims made. Details of the Clift et al. (2008) critique were not published in a peer-reviewed journal, but the Cohen papers are included in a subsequently published mapping review (Clift, Nicol, Raisbeck, Whitmore and Morrison, 2010) and reservations are expressed there about the weaknesses of the study.
I have looked at these papers again, and I’ve been shocked to discover that the study is more problematic than Clift et al. (2008) revealed. In addition to the many methodological and analytical problems, the findings they report are entirely subjective and essentially trivial.
Summary of the Cohen et al. 2006-7 study on the impacts of cultural programs for older people
These papers report on the impact of active participation in group singing. The study was part of a larger ambitious investigation aimed at ‘measuring the impact of professionally conducted, community-based cultural programs on the general health, mental health and social activities of older adults aged 65 and older’ (2006, p.726). Investigations were conducted at three separate sites and the programmes of activity ranged from ‘painting, writing, poetry, jewelry making and material culture, to music in the form of singing in chorales.’ (2006, p.726). A summary report on the investigation as a whole can be found here: https://www.arts.gov/sites/default/files/NEA-Creativity-and-Aging-Cohen-study.pdf
For the ‘singing in chorales’ element of the project, 166 ‘healthy ambulatory older adults’ were recruited from the same residential areas in Washington DC to participate in a quasi-experimental study. The authors do not explain why they approached recruitment in the way they did, with two separate notices circulated in the same residential areas and programmes for older people – one to recruit to participate in a singing group, and a second to recruit to a study to monitor health and wellbeing, without involvement in any activity. A preferred strategy would have been to recruit to the study as whole and to randomise participants into the singing and control arm after baseline assessment (as in Coulton, Clift, Skingley and Rodriquez, J., 2015). Cohen et al. (2006) provide no explanation of why they did not adopt this more robust approach.
Of the people recruited, 90 served as an ‘intervention group’ and were involved in professionally facilitated singing activities and rehearsals for 30 weeks over two years, plus public performances. The remaining 76 served as a ‘comparison group’ and received no form of intervention except the assessments involved. Participants’ health and social activities were ‘measured’ before the start of the intervention and then one year and two years later. Health measures included a self-assessed overall health rating, information on health service utilisation, use of medication, number of falls and standardised questionnaires assessing morale, depression and loneliness. In addition, participants were asked to give detailed information on the ‘nature, frequency and duration’ of their social activities.
The outcomes of the study appear quite remarkable. The researchers sum up their findings after the first year of the intervention in the following way:
‘Results obtained from utilizing established assessment questionnaires and self-reported measures, controlling for any baseline differences, revealed positive findings for the intervention such that the intervention group (chorale) reported a higher overall rating of physical health, fewer doctor visits, less medication use, fewer instances of falls, and fewer other health problems than the comparison group. The intervention group also evidenced better morale and less loneliness than the comparison group. In terms of activity level, the comparison group experienced a significant decline in total number of activities, whereas the intervention group reported a trend toward increased activity.’ (2006 p.726)
Findings continued to be positive after two years, and are summed up as follows:
‘Results revealed positive intervention effects in relation to physical health, number of doctor visits, medication usage, depression, morale, and activity level.’ (2007, p.5)
Cohen et al. (2007) highlight the important implications of the improvements seen, given the high average age of the participants:
‘Moreover, the actual improvement reported in general health and the sustained level of involvement in overall activities 2 years into the study among subjects with an average age greater than life expectancy, reflects a reduction in risk factors driving the need for long-term care, through continuing involvement in a high-quality participatory art program – in this case, in an ongoing chorale directed by a professional conductor.’ (2007, p.20)
I will now give detailed attention to the 2006 paper, and ignore the results given in the 2007 paper as there are substantial flaws. Clift et al., (2008) point out, for example, that the mental health data for the intervention group reported in the 2006 and 2007 papers are identical to the first decimal place. Given the 37% attrition rate between the end of year one and end of year two, this degree of similarity is highly implausible and indicate reporting errors in the 2007 paper.
The main problems with this study
There are problems with the presentation of data, the analytic strategy employed, the use of significance levels in drawing conclusions from the results, and with the results themselves.
- The results are consistently reported as means and standard deviations, when for some of the measures this was inappropriate. This is particularly clear for the falls data for the previous 12 months, where one might have expected to see simple frequencies reported. For the intervention group at base line, for example, the mean falls are given as 0.4 with a standard deviation of 0.93. This mean could indicate that 40% of the sample had fallen once during the previous year and the remainder had not fallen at all. However, the standard deviation is nearly one, and if we assume that most of the data would fall within two standard deviations of the mean, the likelihood is that some respondents had fallen at least twice and maybe even three times. Clearly, a more appropriate form of analysis would have been to report the incidence of falling or not falling over the 12 months prior to the study, and during the first twelve months of the study itself.
- The analysis at baseline involves the total sample of people recruited into the study (N=166), whereas at 12-month follow up, data are presented for 141 participants, which represents a loss of 25 participants (i.e. a 15% attrition rate). In cases where no significant difference between the groups was apparent at baseline, and a significant difference emerges at follow-up, it is possible that the difference could be attributable to attrition and this is not ruled out. It is also impossible to assess whether this issue affected the results obtained from using analysis of covariance.
- The researchers adopt a liberal significance level of 10% in performing their statistical analyses, because of the ‘exploratory nature of this study.’ It is somewhat surprising that they characterise the study in this way given that the investigation is theoretically grounded, fairly-well controlled, involves a range of standardised instruments and uses statistical techniques to reject a series of implicit, but not clearly stated, null hypotheses. On the other hand, where a study is exploratory, a case could be made for using a significance level more stringent than the conventional 5% level, in order to reduce the likelihood of type I errors and gain greater confidence in conclusions that the intervention had an effect. In Table 2, nine significant effects are indicated with p values between 0.1 and 0.01, but only two measures achieve significance at the 0.01 level (overall health rating and over-the-counter medications).
- Cohen et al. are not explicit as to whether one-tailed or two-tailed tests are applied when testing mean differences, although the t-values reported for some post-test comparisons suggests that lower one-tailed values were employed. Given the view that the study is ‘exploratory’ however, it would have been more appropriate to consistently employ two-tailed values in assessing significance.
- The researchers give no attention to possible ceiling and floor effects in their measures nor to the issue of skewed distributions. The possibility that these issues are a real consideration is particularly clear for the ‘depression’ scale employed – the Geriatric Depression Scale Short Form. In describing this measure, they report that the data for the entire sample had a range of 0-10, a mean of 1.73 and a standard deviation of 1.97. As is typical for most self-assessment questionnaires, possible scores on this scale are discrete – 0, 1, 2, 3 etc., and a mean of 1.73 indicates that on average participants scored themselves between 1 and 2 on an 11-point scale. In other words, on average, the study participants give no indication of being depressed. However, the standard deviation is close to two, and if we assume that a large majority of respondents fall within two standard deviations either side of the mean, this give a range of 8 points. Clearly, respondents cannot score more than one standard deviation below the observed mean as this is the start of the scale, and so there is likely to be a distinct positive skew in the data with a few respondents scoring at the midpoint on the scale or beyond. From a mental health point of view, respondents scoring relatively highly may be experiencing problems with depression. It would have been relevant to know whether this was the case and whether those people who had higher depression scores initially, showed any change in their scores on the depression scale that might have indicated improvement. No such data are reported by Cohen et al. This example also highlights the limitations of relying upon tests of statistical significance to judge the impact of an intervention, when the real interest lies in whether clinically significant changes have taken place on an individual level.
- It is surprising, given the controlled and quantitative nature of this study, that Cohen et al. give no attention to the power of their investigation and do not report effect sizes for the various measures employed. Given the use of well-known standardised measures it would have been a simple matter to estimate the size of likely changes (and even clinically significant changes) and the minimum sample size required for a satisfactory level of power.
- In addition to these technical issues, Cohen et al. do not address the possibility of bias due to study demand characteristics, particularly among the participants in the chorale intervention. The likelihood of such bias is clear from the account Cohen et al. give of how people were recruited for the chorale arm of the study: ‘The notice for the intervention group differed only in that it sought singers for a chorale; no singing experience was required, and the study’s purpose was to explore the impact of this activity on general health and mental health…’ (2006, p.728) It is of course essential from an ethical point of view that participants are given information about the nature of a study before they agree to participate. The issue here, however, is the likelihood that most participants would assume, even if it were not explicitly stated by the investigators, that the activity of singing is thought to be beneficial. Insomuch as participants enjoy and value the activity of singing, they are likely either consciously or unconsciously to adopt a bias in favour of the hypothesis under test. Such a bias may well have affected the way in which they responded to the questionnaires and scales employed and their preparedness to disclose information about health service utilisation, medication and social activities.
Findings reported in the 2006 paper
Turning now to the results reported by Cohen et al. (2006). The title of the paper indicates three areas of interest – physical health, mental health and social functioning. Detailed attention will be given here to selected results on physical and mental health as these measures are most pertinent to the hypothesis that singing has health and wellbeing benefits. There are problems will all aspects of the findings.
- Overall health ratings were made on a scale of 0-10, and for the comparison group at follow-up the mean was 7.29 and for the intervention groups the mean rating was 7.97 – so a difference of 0.68. What does this difference mean, and does it have any substantive importance? It is impossible to say, other than it reflects a difference in subjective ratings on the scale provided, and that the difference is very small and may well be trivial. In addition, for the intervention group, the mean ratings effectively remained unchanged, while the comparison group which showed a small reduction in means.
- No significant difference in health ratings were found at baseline, and so Cohen et al. use this to justify the use of a simple t-test on the follow up data and report a significant difference at the 1% level. It is clear, however, that they are employing a one-tailed critical value in this test. Given the claimed ‘exploratory’ nature of the study a one-tailed test is probably inappropriate, and the use of a two-tailed test would be more cautious. Caution would also dictate using a more stringent probability value for judging significance than the conventional 5% level. If both are applied, the difference found at follow up is not statistically significant. It is therefore academic to consider the possible effects of attrition and other biases operating in the study. As this is the strongest finding reported, it follows that a more cautious approach to statistical inference means that the difference at follow-up is both trivial and statistically non-significant.
- Over-the-counter medications data are simple means of such medication use reported by participants. The results appear to show that for both intervention and control groups average medication levels increased from pre-test to post-test. Cohen et al. report that no significant difference was apparent at baseline and their analytical strategy would suggest that a t-test would again be used to compare means at follow up. Somewhat surprisingly, however, an F value is reported indicating the use of analysis of covariance. The results point to greater use of medication among the comparison group than the intervention group, but there is no way of estimating the changes involved as the baseline data for the reduced sample given attrition are not reported. In any event, the data are again subjective and there are serious question marks over the accuracy with which anyone could recall over-the-counter purchases of medicine over a previous year.
- Number of falls is reported as means and it was suggested above that this is inappropriate. Nevertheless, Cohen et al. indicate that no significant difference was found at baseline and mean number of falls was reduced in the intervention group and increased in the comparison group. At follow up, a t-test is used to compare the groups and a statistically significant difference is reported at the 5% level. Again, however, a one-tailed test appears to have been applied, and the use of a more stringent two-tailed criterion means that the difference at follow up is no longer significant. In addition, the idea that group singing could reduce falls is implausible, and this result has never been replicated.
- The overall picture obtained from the mental health measures, for morale, depression and loneliness, is very disappointing as none of the differences reported are significant at even the one-tailed 5% level. A large factor accounting for the lack of measured effects may well be the operation of floor and ceiling effects. Overall, the measures indicate that participants at baseline had good morale, were not depressed and were not lonely, and there may well have been little real scope for seeing substantial movement given the high level of wellbeing at the outset of the study. On the other hand, there is clearly variation on each of the scales, and it may well be that a few people involved in the study were indeed low in morale, depressed and lonely. Unfortunately, Cohen et al. provide no information on the number of such people in their study, and whether the intervention had any positive effects for them.
Lack of critique in reviews of the Cohen study
As noted earlier, Cutler describes the Cohen research as ‘the landmark study’ in the field of creative arts and the health and wellbeing of older people. And the Cohen study has been repeatedly cited, without qualification, as a major contribution to the field. This, for example, is what the APPG (2017) report ‘Creative Health’ has this to say about the work:
‘In the USA, the late Dr Gene Cohen led the Creativity and Ageing Study, supported by the NEA at George Washington University, which looked at the impact of weekly participatory arts programmes over two years. This involved 300 ethnically diverse participants (half of whom formed a control group) aged between 65 and 103 and dispersed across three states. Activities included painting, pottery, dance, music, poetry and drama. The study found ‘true health promotion and disease prevention effects’, including increases in self-reported health and ‘reducing risk factors that drive the need for long term care’, including falls. Dr Cohen later reviewed research suggesting that social, psychological, and neurobiological mechanisms were at play.’ (2017, p.124)
Thus, the conclusions drawn by Cohen and his team are ‘recycled’ without critical commentary, (and even without specifying that the peer-reviewed published papers concerned only singing), including the finding that the intervention reduced falls.
Daykin, Mansfield, Meads, Julier et al. (2018) reports on a systematic review of ‘wellbeing outcomes for music and singing in adults’ which includes the Cohen report (wrongly cited as published in 2016). The review missed the second 2007 report, despite the systematic review process, and expresses no concerns regarding the methodological and reporting weaknesses (Clift et al., 2008). This is all that is said about the study and its findings:
‘… [a] study of 166 participants … compared a 30-week choral singing project with usual activity. This showed significant differences after 12months in morale, depression and loneliness for intervention groups compared with controls. While both groups evidenced a decline in morale and loneliness, this was slighter for the comparison group who showed a reduced risk of depression after 12months.’ (2018, p.42)
Other reviews of the literature on music and older people, are similarly ‘descriptive’ and simply state the conclusions reached by Cohen et al. Creech, Hallam, McQueen and Varvarigou (2013) do, at least, cite both the 2006 and 2007 papers:
‘In the USA Cohen et al. (2006, 2007) carried out non-randomized controlled studies with 166 participants with a mean age of 80. Over the course of one year these participants were involved in 30 singing workshops and ten performances. The participants, in comparison with control groups, reported fewer health issues, fewer falls, fewer doctor visits and less use of medication.’ (2013, p.5)
Other sources citing Cohen et al (2006), are content to use the research to support catch-all statements regarding the increasing body of evidence in support of the impacts of creative arts programmes. Camic and Chatterjee (2013), for example, give the Cohen report as the last of seven sources cited in support of the following claim:
‘There is increasing evidence from quantitative and qualitative studies in different countries that arts-based and other cultural programmes can reduce adverse psychological and physiological symptoms and are positive determinants for survival, well-being and quality of life and self-reported health.’ (2013, p.111)
Cohen et al. (2006) continues to be cited into 2019/2020 publications, but with no critical commentary. Fancourt and Finn (2019) for example, in the WHO Scoping Review ‘What is the Evidence on the Role of Arts in Improving Health and Wellbeing?’ follow the approach of Camic and Chatterjee and cite the Cohen paper (along with several others) in support of the following claims:
‘There is a large body of research showing how arts engagement can enhance multidimensional subjective well-being, including affective well-being (positive emotions in our daily lives), evaluative well-being (our life satisfaction) and eudemonic well-being (our sense of meaning, control, autonomy and purpose in our lives). For example, studies of specific arts interventions (including singing, group drumming, arts and crafts, magic, dancing, daily photography and visiting cultural heritage sites) have shown increases in all types of individual and social well-being.’ (2019, p.21)
A recent systematic review by Ghiga, Pitchforth, Lepetit, Miani et al. (2020), gives no details of the Cohen et al. (2006) study in the text, but they can be found in the Supplementary Tables, where the study is listed as one among many studies said to report positive findings. Again, there is no indication that the authors were aware of the study’s substantial weaknesses.
Since publication, Gene Cohen’s ‘ground-breaking’ study has been the object of uncritical admiration. Collectively, the field of arts and health has wanted to believe the fairy story he wove for us.
But we have all been taken in by The Emperor’s New Clothes. Sadly, Cohen and his colleagues were naked all along!
To academics who have conducted research on singing, health and wellbeing:
- Before reading the critique above, would you have agreed with David Cutler (Baring Foundation) that the work of Cohen et al. was ‘the landmark study’ in the field of creativity and healthy aging?
- Do you accept the criticisms made here of the Cohen et al. study?
- Why is it that the study appears to have been cited in so many reviews without explicit recognition of its weaknesses?
- Despite the weaknesses, do you believe that the key findings from the Cohen et al. study have been ‘replicated’ through subsequent research?
APPG (2017) Creative Health: The Arts for Health and Wellbeing. London: All Party Parliamentary Group for Arts, Health and Wellbeing. https://www.culturehealthandwellbeing.org.uk/appg-inquiry/
Camic, P. and Chatterjee, H. (2013) Museums and art galleries as partners for public health interventions, Perspectives in Public Health, 133, 1, 66-71. https://journals.sagepub.com/doi/pdf/10.1177/1757913912468523
Cohen, G.D., Perlstein, S., Chapline, J., Kelly, J. et al. (2006) The impact of professionally conducted cultural programs on the physical health, mental health, and social functioning of older adults, The Gerontologist, 46, 6, 726–734. https://doi.org/10.1093/geront/46.6.726
Cohen, G.D., Perlstein, S., Chapline, J., Kelly, J. et al. (2007) The impact of professionally conducted cultural programs on the physical health, mental health, and social functioning of older adults, Journal of Aging, Humanities and the Arts, 1, 1-2, 5-22. https://www.tandfonline.com/doi/abs/10.1080/19325610701410791
Cutler, D. (2019) Around the World in 80 Creative Ageing Projects. London: Baring Foundation. https://www.artshealthresources.org.uk/docs/around-the-world-in-80-creative-ageing-projects/
Clift, S., Nicol, J. Raisbeck, M., Whitmore, C. and Morrison, I. (2010) Group singing, well-being and health: A systematic mapping of research evidence. UNESCO Observatory Interdisciplinary Research in the Arts e-journal, 2, 1, 1-15. Available from: https://www.unescoejournal.com/wp-content/uploads/2020/03/2-1-11-clift-paper.pdf
Clift, S., Staricoff, R., Hancox, G., Whitmore, C. (2008) A Systematic Mapping of Non-Clinical Research on Singing and Health. Canterbury: Canterbury Christ Church University. https://www.artshealthresources.org.uk/docs/singing-health-a-systematic-mapping-review-of-non-clinical-research/
Coulton, S., Clift, S., Skingley, A. and Rodriquez, J. (2015) Effectiveness and cost-effectiveness of community singing on mental health-related quality of life of older people: randomised controlled trial, British Journal of Psychiatry, 2017, 3, 250-5. 10.1192/bjp.bp.113.129908
Creech, A., Hallam, S., McQueen, H. and Varvarigou, M. (2013) the power of music in the lives of older adults, Research Studies in Music Education, 35, 1, 87-102. https://journals.sagepub.com/doi/abs/10.1177/1321103X13478862
Daykin, N., Mansfield, L., Meads, C., Julier, G. et al. (2018) What works for wellbeing? A systematic review of wellbeing outcomes for music and singing in adults, Perspectives in Public Health, 138, 1, 39-46. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753835/
Fancourt, D. and Finn, S. (2019) What is the evidence on the role of the arts in improving health and well-being? A scoping review. Copenhagen: World Health Organization. https://www.euro.who.int/en/publications/abstracts/what-is-the-evidence-on-the-role-of-the-arts-in-improving-health-and-well-being-a-scoping-review-2019
Ghiga, I., Pitchforth, E., Lepetit, L., Miani, C. et al. (2020) The effectiveness of community-based social innovations for healthy ageing in middle- and high-income countries: a systematic review, Journal of Health Services Research and Policy, 25, 3, 202-210. https://journals.sagepub.com/doi/pdf/10.1177/1355819619888244