What is the Required Level of Data Cleaning? A Research Evaluation Case

Journal of Scientometric Research, 2016, 5, 1, 07-12.
DOI: 10.5530/jscires.5.1.3
Published: May 2016
Type: Research Article

Peter van den Besselaar^1*, Ulf Sandström²
¹Department of Organization Sciences and Network Institute, VU University Amsterdam, Amsterdam, The Netherlands.

²Department of INDEK, KTH Royal Institute of Technology, Stockholm.

Abstract:

Bibliometric methods depend heavily on the quality of data, and cleaning and disambiguating data are very timeconsuming. Therefore, quite some effort is devoted to the development of better and faster tools for disambiguating of the data (e.g., Gurney et al. 2012). Parallel to this, one may ask to what extent data cleaning is needed, given the intended use of the data. To what extent is there a trade-off between the type of questions asked and the level of cleaning and disambiguating required? When evaluating individuals, a very high level of data cleaning is required, but for other types of research questions, one may accept certain levels of error, as long as these errors do not correlate with the variables under study. In this paper, we present an earlier case study with a rather crude way of data handling as it was expected that the unavoidable error would even out. In this paper, we do a sophisticated data cleaning and disambiguation of the same dataset, and then do the same analysis as before. We compare the results and discuss conclusions about required data cleaning.

Keywords: Coupling data sets, data cleaning disambiguation, data error.

Subscribe to Updates

What is the Required Level of Data Cleaning? A Research Evaluation Case

Related Articles

Citation Output

Article Metrics