Background: Beef carcass conformation and fat cover scores are measured by subjective grading performed by trained technicians. The discrete nature of these scores is taken into account in genetic evaluations using a threshold model, which assumes an underlying continuous distribution called liability that can be modelled by different methods. Methods. Five threshold models were compared in this study: three threshold linear models, one including slaughterhouse and sex effects, along with other systematic effects, with homogeneous thresholds and two extensions with heterogeneous thresholds that vary across slaughterhouses and across slaughterhouse and sex and a generalised linear model with reverse extreme value errors. For this last model, the underlying variable followed a Weibull distribution and was both a log-linear model and a grouped data model. The fifth model was an extension of grouped data models with score-dependent effects in order to allow for heterogeneous thresholds that vary across slaughterhouse and sex. Goodness-of-fit of these models was tested using the bootstrap methodology. Field data included 2,539 carcasses of the Bruna dels Pirineus beef cattle breed. Results: Differences in carcass conformation and fat cover scores among slaughterhouses could not be totally captured by a systematic slaughterhouse effect, as fitted in the threshold linear model with homogeneous thresholds, and different thresholds per slaughterhouse were estimated using a slaughterhouse-specific threshold model. This model fixed most of the deficiencies when stratification by slaughterhouse was done, but it still failed to correctly fit frequencies stratified by sex, especially for fat cover, as 5 of the 8 current percentages were not included within the bootstrap interval. This indicates that scoring varied with sex and a specific sex per slaughterhouse threshold linear model should be used in order to guarantee the goodness-of-fit of the genetic evaluation model. This was also observed in grouped data models that avoided fitting deficiencies when slaughterhouse and sex effects were score-dependent. Conclusions: Both threshold linear models and grouped data models can guarantee the goodness-of-fit of the genetic evaluation for carcass conformation and fat cover, but our results highlight the need for specific thresholds by sex and slaughterhouse in order to avoid fitting deficiencies. © 2011 Tarrés et al; licensee BioMed Central Ltd.