What is the Required Level of Data Cleaning? A Research Evaluation Case
Bibliometric methods depend heavily on the quality of data, and cleaning and disambiguating data are very timeconsuming. Therefore, quite some effort is devoted to the development of better and faster tools for disambiguating of the data (e.g., Gurney et al. 2012). Parallel to this, one may ask to what extent data cleaning is needed, given the intended use of the data. To what extent is there a trade-off between the type of questions asked and the level of cleaning and disambiguating required? When evaluating individuals, a very high level of data cleaning is required, but for other types of research questions, one may accept certain levels of error, as long as these errors do not correlate with the variables under study. In this paper, we present an earlier case study with a rather crude way of data handling as it was expected that the unavoidable error would even out. In this paper, we do a sophisticated data cleaning and disambiguation of the same dataset, and then do the same analysis as before. We compare the results and discuss conclusions about required data cleaning.