By Alan F. Blackwell
MRC Applied Psychology Unit
The best introduction to the research methods used in Psychology of Programming (PoP for the sake of levity) is David Gilmore's excellent chapter in the PoP book (Gilmore 1990). One area that David did not cover in his introduction is the use of survey methods, including questionnaires. This is no reflection on his contribution, because survey methods are not a normal part of the research repertoire of experimental psychology, which is after all our theoretical context. I, however, have been sufficiently unwise to include several surveys in the work carried out for my psychology PhD (Blackwell 1996a, 1996b, Whitley & Blackwell 1997).
At the last PPIG workshop, I discovered a co-offender in Helen Sharp, who with Jacqui Griffyth reported on a survey of programmers taking a course in object-oriented programming (Sharp & Griffyth 1998). A coffee-break conversation with Helen and Marian Petre revealed that Marian had a guilty stash of unanalysed, unpublished survey data - who knows how many others there are? Since then, I have tried to find out a little more about how we can use surveys in PoP - this is the (rather informal) result. I am indebted to Jackie Scott, a research methods specialist in the Social and Political Sciences Faculty in Cambridge, who has advised me and helped to clarify these ideas.
David Gilmore did, in his introduction to PoP methods, discuss interview studies. These are a type of survey, although the interviews he described are informal ones, and are usually conducted with very small sample sizes - this kind of survey is not intended as a basis for drawing general conclusions. When it is necessary to draw general conclusions, the following precautions are normally taken: a large sample is used, and the interviewer follows a question script (perhaps including options based on previous responses), from which they do not deviate. More exploratory interviews can be valuable (as in the study published by Marian Petre and me, Petre & Blackwell 1997), but it is important to recognise the potential influence that the interviewer has had on individual responses.
If the interviewer is reading questions from a script, the study starts to seem pretty much the same as a written questionnaire - once one has decided to carry out a formal survey, the choice of delivery method is not as important as the initial choice between formal and informal methods. Some of the analysis methods described below are applicable to informal surveys, however - the techniques used for coding of open format responses can also be used for analysis of informal protocols. They can even be used for text collected in non-research settings. In the first opinion survey I conducted (Blackwell 1996a), the opinions about programming that I analysed were collected from published computer science research papers. Kirsten Whitley and I later evaluated these alongside questionnaire responses, using identical techniques.
As in experimental design, the most straightforward survey design compares the responses of two or more sample groups. The groups can either be identified in advance (eg. regular Windows programmers versus subscribers to a visual programming mailing list in Whitley & Blackwell 1997), or from a response to a forced choice question (eg. "are you a professional programmer, or do you just write computer programs as part of your job?"). Once groups have been identified, all other responses can be compared between them, assuming that the measures are equivalent for each group. The main alternative to simple comparison of groups is to identify correlations between responses. This is a little more complicated, as described below. As in experimental design, surveys should be designed with research hypotheses in mind - there are other techniques (interviews, focus groups) that are more suitable for exploratory opinion investigations.
In questionnaire design, the most fundamental choice is whether each question should be open or closed. Closed format responses are hugely easier to analyse, but they have the disadvantage that one must anticipate what the respondents want to say - and you rarely succeed in doing this. You will quite likely miss the most interesting (because unexpected) findings. One solution is to run a pilot survey with open format responses, and use those results to create the set of response options for the main survey.
Closed format questions might ask respondents to agree or disagree with a statement, to select one of several categories as corresponding most closely to their opinion, or to express their opinion as a position on an ordered scale. The design of scales is the most complex of these, particularly in the choice of whether or not the scale should have a midpoint. The standard British social attitude scale is a 5-point scale; the standard American scale has 4 points. If you use a scale without a midpoint, this may be forcing an artefactual response - the respondent genuinely has no opinion, but they are forced to make something up on the spot (quite likely on the basis of preceding questions). If a midpoint is provided, you can't tell the difference between someone who has never thought about this question, or someone who has mixed opinions. One approach is to include qualifying questions ("do you have an opinion about ...") before requesting opinion measures. Another is to include a specific don't know option - the relative advantages and disadvantages are discussed at length by Schuman & Presser (1981). A further problem in the use of rating scales is to account for the fact that each respondent will interpret the scale differently. The standard approach is to include a neutral "anchor" question as the first item in a series, and evaluate all other responses relative to the anchor.
Where the response to different questions is to be compared, it is very important to minimise unnecessary differences between them (as in any controlled experiment). Some traps that I have nearly or actually fallen into include a) ordered scales in which the direction of the ordering changes from one question to the next and b) changes of tense between questions (implicitly asking for the respondent's current opinion in some questions, while others ask about their opinion in the past, or at some time in the future).
A well-known trap in survey design is induced bias. One should, of course, avoid introducing the survey, or phrasing individual questions, in a way that will bias the respondent toward a particular response.
I mentioned above the alternative survey administration methods of interviews and questionnaires. A third alternative, used in exploratory research, is the focus group. This has further advantages in discovering unexpected opinions. A valuable approach to pilot studies is to ask members of a focus group to complete a questionnaire that will be used in the main study, then ask them to discuss what they meant by their responses. This can identify problems in questionnaire design, as well as providing a framework for coding open format responses.
Three advantages of interviews over questionnaires (beyond the fact that they can be carried out on the telephone) are that questions will not be missed accidentally, validity of responses can be checked, and probe questions can be added in the case that a respondent makes a particular response or combination of responses. These advantages are also available in automated surveys. In the study described by Whitley & Blackwell (1997), we created an HTML form, and asked respondents to complete the form. When they submitted the form, their response was immediately checked by a CGI Perl script. If they had failed to respond to any question, the script generated a feedback page, suggesting that they complete the missing question. This script could also have asked probe questions.
If paper questionnaires are used, it is more difficult to guarantee a complete (or any) response. My experience is that personal contact is a great help. For the survey described in (Blackwell 1996b) I dressed in my best suit, and stood at the door of a trade show for two days. Each person who arrived was politely invited to participate, and given a questionnaire form. I had arranged with the trade show organisers that completed questionnaires could be left in a box at their prominent desk. The questionnaire was a single page, and was clearly headed with the amount of time that it would take to complete (three minutes). These precautions resulted in a response rate approaching 50%, far higher than would be expected when people receive unsolicited questionnaires, or are asked to return them by post.
Analysis of closed questions is relatively easy. Many survey analyses use a Chi-square test to assess differences in response between groups. Where ordered scales are used, they are often treated as approximations to a normally distributed range of opinion. Used with caution, t-tests may therefore be sufficient for hypothesis testing that compares different groups. Similarly, pairwise t-tests can be used to compare systematic differences between the responses to two different questions. There are, of course, more sophisticated alternatives - but I won't report on them, because I haven't used them. A further alternative for the analysis of ordered rating responses is the use of correlation analysis to find out whether opinions on different questions are related.
Analysis of open questions is far more foreign to most PPIG people, although it has some similarity to protocol analysis techniques. It proceeds in two phases. The first is to create a coding frame - a set of rules for allocating responses to defined categories. The coding frame would ideally be based on the results of a pilot study, although I have always used a sample of the main survey in my own work. The coding frame is structured either to answer one or more specific questions derived from the study hypotheses, or to capture a range of response. The latter is perhaps a more likely application of open questions, and is more typical of my projects. In this case, the coding frame would be structured in a hierarchical manner, with broad response topics, each of which is usually subdivided.
Assessing whether open responses are in favour or not in favour of some position within the coding frame is complicated. One should err on the side of non-classification, and then remember that the absence of a statement on a particular topic may reflect a firm opinion just as much as making an extreme statement does. People will often not mention their most deeply held beliefs, on the basis that these are obvious and do not need to be stated. Statistical comparisons should therefore always treat respondents who didn't make a statement on a given theme as a separate group.
Coding of open responses according to the coding frame would normally be done by a coding panel of at least three people who are not aware of the study hypotheses. Reliability between coders should be "in the high 90% range" according to Jackie Scott. If this is not achieved, the coding frame should be refined. I should note that Dr. Scott usually deals with surveys having thousands of respondents. The largest study that I have been involved with attracted 227 responses (Whitley & Blackwell 1997).
A difficult point for computer scientists to accept is that surveys measure opinions, rather than observable behaviour (as is the case in most HCI or PoP research). This makes them seem even more "soft" than experimental psychology, and perhaps less scientific. Nevertheless, research papers are still published in which computer scientists evaluate new user interfaces or programming systems by asking a small sample of raters to compare it to whatever they were using previously. ("Is my new system better, much better, or phenomenally better than the one you were using before?"). In the face of such habits, we can improve things substantially by at least taking a scientific approach to opinion analysis.
Things are slightly confused in some of my studies, where I was purposely measuring uninformed, naive, opinion (Blackwell 1996b). Discussion of these results requires a careful distinction regarding the object of study: for what purpose is it useful to know about uninformed opinions? They tell us about people's theories of programming, but not about how people actually do programming.
Some distinctions that are likely to be important include:
These are only some cautions that have already become clear to me. I'm still learning about this - if you are interested in using survey techniques, please feel free to contact me. If I get enough interest, I would like to run a tutorial session at the next PPIG workshop. In the meantime, I must read some more introductory texts on survey methods: Jackie Scott recommends Moser & Kalton (1979).
This report is cobbled together from other people's contributions - I knew nothing before I started. Jackie Scott was very generous in giving her time to someone so ignorant. Pat Wright at the APU helped design my first questionnaire. Kirsten Whitley suffered through my discovery process, having approached me as a collaborator because I knew more about surveys than she did! Thomas Green has been a stimulating and tolerant PhD supervisor. My research is funded by the Advanced Software Centre of Hitachi Europe.
Blackwell, A.F. (1996a). Metacognitive Theories of Visual Programming: What do we think we are doing? In Proceedings IEEE Symposium on Visual Languages, pp. 240-246.
Blackwell, A.F. (1996b). Do Programmers Agree with Computer Scientists on the Value of Visual Programming? In A. Blandford & H. Thimbleby (Eds.), Adjunct Proceedings of the 11th British Computer Society Annual Conference on Human Computer Interaction, HCI'96, pp. 44-47.
Gilmore, D.J. (1990). Methodological issues in the study of programming. In J.-M. Hoc, T.R.G. Green, R. Samurçay & D.J. Gilmore, Psychology of Programming. London: Academic Press, pp. 83-98.
Glenberg, A.M., Wilkinson, A.C. & Epstein, W. (1982). The illusion of knowing: Failure in the self-assessment of comprehension. Memory & Cognition, 10(6), 597-602.
Metcalfe, J. & Wiebe, D. (1987). Intuition in insight and noninsight problem solving. Memory and Cognition, 15(3), 238-246.
Moser, C. & Kalton, G. (1979). Survey methods in social investigation (2nd ed.). Aldershot: Gower.
Payne, S.J. (1995). Naive judgements of stimulus-response compatibility. Human Factors, 37(3), 473-494.
Petre, M. and Blackwell, A.F. (1997). A glimpse of expert programmer's mental imagery. In S. Wiedenbeck & J. Scholtz (Eds.), Proceedings of the 7th Workshop on Empirical Studies of Programmers, pp. 109-123.
Schuman, H. & Presser, S. (1981). Questions and answers in attitude surveys: Experiments on question form, wording and content. New York: Academic Press.
Sharp, H. & Griffyth, J. (1998). Acquiring object technology concepts: the role of previous software development experience. In J. Domingue & P. Mulholland (Eds.) Proceedings of the 10th Annual Workshop of the Psychology of Programming Interest Group. pp. 134-157.
Whitley, K.N. and Blackwell, A.F. (1997). Visual programming: the outlook from academia and industry. In S. Wiedenbeck & J. Scholtz (Eds.), Proceedings of the 7th Workshop on Empirical Studies of Programmers, pp. 180-208