Humans differ in how they perceive, assess, and measure animal behaviour. This is problematic because strong observer bias can reduce statistical power, accuracy of scientific inference, and in the worst cases, lead to spurious results. Unfortunately, reports and studies of measurement reliability in animal behaviour studies are rare. Here, we investigated two aspects of measurement reliability in working dogs: inter-observer agreement and criterion validity (comparing novice ratings with those given by experts). Here, we extend for the first time a powerful framework used in human psychological studies to investigate three potential aspects of (dis)agreement in nonhuman animal behaviour research: (a) that some behaviours are easier to observe than others; (b) that some subjects are easier to observe than others; and (c) that observers with different levels of experience with the subject animal give the same or different ratings. We found that novice observers with the same level of experience agreed upon measures of a wide range of behaviours. We found no evidence that age of the dogs affected agreement between these same novice observers. However, when observers with different levels of experience (i.e., novices vs. a working dog expert) assessed the same dogs, agreement appeared to be strongly affected by the measurement instrument used to assess behaviour. Given that animal behaviour research often utilizes different observers with different levels of experience, our results suggest that further tests of how different observers may measure behaviour in different ways are needed across a wider variety of organisms and measurement instruments.