Predicting Collaborations among Research Scientists: A Datathon Experience

Maria Del Pilar Angeles; Helena Gomez-Adorno; Sinuhe David Hernandez-Guevara; Victor Manuel Corza-Vargas

doi:10.5530/jscires.20241453

Maria Del Pilar Angeles^{1, *, #}, Helena Gomez-Adorno¹, Sinuhe David Hernandez-Guevara² and Victor Manuel Corza-Vargas^{2, #}

Author information PDF Citations

¹Departamento de Ingeniería de Sistemas Computacionales y Automatización, Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de Mexico (UNAM), MEXICO

²Divulgación Académica, Centro de Estudios en Computación Avanzada, Universidad Nacional Autónoma de Mexico (UNAM), MEXICO

Corresponding author.

Correspondence: Maria Del Pilar Angeles Departamento de Ingeniería de Sistemas Computacionales y Automatización, Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de Mexico (UNAM), MEXICO. Email: [email protected]

Author Notes

July 25, 2024; September 08, 2024; November 20, 2024.

Copyright and License information

This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License, which allows others to remix, tweak, and build upon the work non-commercially, as long as the author is credited and the new creations are licensed under the identical terms.

Download PDF

Cite this Article

Read in Readcube

Citations & Metrics

Citation

Resource not found.

Copy to clipboard

Published in: Journal of Scientometric Research, 2024; 13(3s): s2-s10. Published online: 26 December 2024DOI: 10.5530/jscires.20241453

Contents

ABSTRACT
INTRODUCTION
CONCLUSION
ACKNOWLEDGEMENT
ABBREVIATIONS
References

ABSTRACT

This paper presents the challenge of predicting collaborations between research scientists as a datathon experience. The focus of the challenge task is determining whether or not the author of a research paper is keen to collaborate with another author in the future. The main aims of the datathon challenge are: (i) to show the feasibility of automatically identifying potential collaboration in a research network as a link prediction task; (ii) to propose a methodology for the environment configuration that covers data collection, selection and preparation stages required for link prediction in a massive event. (iii) to join the efforts of students from different fields of study in solving the task from a multi-disciplinary perspective. For this purpose, we created a corpus with DBLP data covering publications from 1990 to 2004. The created dataset has been made available for further research. Altogether, the datathon attracted 78 registered students, yielding 13 submissions of teams composed of 6 students each. In this paper, we compare their approaches and analyze their performance.

Keywords: Link prediction problem, Topological features, Authorship predictions, Imbalanced classification model, Datathon

INTRODUCTION

A datathon is an event where groups of multidisciplinary students and early career researchers can work together on a new interdisciplinary project for a concentrated period, usually over the course of around three days. The aim is to bring together students with complementary skills and knowledge; they then work together to create an initial plan to solve a specific link prediction problem. This allows the participants to establish new and completely independent collaborations with minimal intrusion into their normal duties and, in the process, create a basis for the development of interdisciplinary data management projects. The number of datathons has increased dramatically since its initiation. It is used by industries, scientists and others to solve a wide variety of data management problems and to develop new strategies within a short period of time.^[¹^,²^]

Datathons have emerged recently as a new way of developing solutions in any application area, promoting rapid learning, innovation and collaboration. The organization of a datathon is an excellent opportunity for a data science challenge. So, the Institute for Research in Applied Mathematics and Systems (IIMAS) of the National Autonomous University of Mexico (UNAM) organized a Datathon in January 2020 that lasted three days. The first day was dedicated to delivering three workshops: Introduction to Network Analysis: Statistical Measurements and Network Connectivity; Influence and Centrality in Networks and Prediction of Links. During the second day, the challenge statement was explained, the teams were formed and they were given access to the programming environment and the data. On the third day, the teams focused on the link prediction modeling. After the evaluation, the University awards the best predictive models. Such an event involved students with no experience in data science and experts in the area as mentors, which resulted in an enriching and formative experience for all the participants.

The challenge presented in the IIMAS-UNAM datathon 2020 consists of predicting co-authoring collaborative social networks in the DBLP database and applying graph analysis, statistics and machine learning techniques. The presented paper is organized as follows: The Datathon Challenge section describes the main issues concerning creating a predictive model on the collaboration of university scientists. The next two sections detail the steps the Datathon organizers took in preparing the data and selecting the metrics to evaluate the teams. Subsequently, the outcomes obtained from the generated models are analyzed, the findings are concluded and future work is identified.

The Datathon Challenge: Predicting Collaborations among Scientists

The National Autonomous University of Mexico (UNAM) has 24 scientific research institutes, 7 research centers, 12 humanities institutes, 25 schools, 5 foreign campuses within the country and 14 offices abroad that carry out teaching projects, research and technical development related to computer science.

UNAM is a distinguished university worldwide for its teaching, scientific and technological contributions. Furthermore, these contributions have enormous potential for growth, specialization and application if collaboration is strengthened across all entities and university sites.^[³^]

Several researchers are working on similar or complementing research projects within the university. Due to the lack of contact or collaboration among such colleagues, many academics might spend more time obtaining significant results.

The challenge consists of analyzing scientific papers to predict future collaborations between academics that produce more and better scientific results. Regarding the data science challenge, the approach focuses on identifying, comparing and improving algorithms for link prediction in a social-scientific network. Previous work has already been done in.^[⁴^]

To predict new collaborations between computer science paper authors, co-authorships are represented as edges in a graph (or social network). A graph consists of a set of nodes (in this case, author) that relate to each other to coauthor a scientific article.

The problem of predicting links in a graph corresponds to finding the probability of a future association (co-authorship) between two nodes, knowing that there is no association between them in the current state of the network.

Considering an undirected graph G=V, E where each edge e=(u, v) E represents an interaction between nodes u and v at a particular time t. Such interaction, in the domain of our problem, is defined as the co-authorship of a research article.

There is link prediction algorithms designed to estimate different influence rates within the links in a graph. They assume that a node with multiple connections may be more likely to receive additional link influences.^[⁵^]

For example, predicting co-authorship among scientists involves understanding how the authors relate to each other and, for instance, measuring the tendency of scientists who share connections in a research group to connect with each other to achieve their goals and publish new articles.^[⁶^] The sample dataset utilized in the IIMAS-UNAM Datathon 2020 corresponds to the Digital Bibliography and Library Project (DBLP) database,^[⁷^] specifically the DBLP monthly release from January 2019.^[⁸^] We decided to use this dataset since it is a free resource that provides open bibliographic information on major journals and reports of computer science conferences.

The following subsection analyzes the challenge to be presented to the teams and what considerations should be taken into account during the environment’s configuration to provide the competitors with sufficiently prepared data so that they only focus on modeling. As well as identify which metrics would be in accordance with the data and the model.

Analysis of the Challenge

Given the short duration of the datathon, there were many issues that we tackled before it began:

The information stored in the DBLP database was not in optimal size, content and format conditions to be analyzed directly. Typically, the most time-consuming stage in a data science project is pre-processing and transformation. Consequently, an initial pre-processing was required to let the participant teams concentrate on the analysis, generation and comparison of models, to achieve better performance in the prediction.

The link prediction problem can be approached by binary classification. However, in this case, the imbalanced nature of the classification model^[⁹^] arises because the sample datasets are not balanced; that is, there is a majority class represented by the number of negative outcomes covering 75% of possible cases and a minority class, represented by the number of positive outcomes the remaining 25% of possible cases.

Authors with less than three publications are irrelevant to productive research groups, so they were not considered for prediction.

The nature of the link prediction problem in a social network requires supervised learning. So, to evaluate the performance of the prediction of the participant’s models, data from past and present collaborations must be gathered to compose training and validation sets. Using data from future collaborations to test the participant’s models is also necessary. It is, therefore, necessary to divide the DBLP database by periods of publication years.

There are various indicators or metrics for evaluating the performance of the models.^[¹⁰^] They depend on the type of learning to be carried out, the algorithms used, etc. Thus, it was necessary to identify the more suitable performance metric that would be applied during the model assessment.

The time required for the correct and fair evaluation of the proposals would be directly proportional to the number of teams formed. Therefore, a mechanism must be established to automate the model evaluation.

The following section presents the main steps carried out for setting the link prediction environment to guarantee a successful event in terms of time and achievements. Setting the environment: the current state graph, the training and test Datasets

Three data sets were designed containing co-authorships carried out during different periods to address the link prediction problem through an imbalanced binary classification model.^[⁹^,¹¹^]

The first data set included authorships that occurred from 1990 to 2000 and corresponded to the current authorship graph. The second data set is focused on co-authorships carried out during 2001 and 2002 and it will be required to train the model. The third data set contains those co-authorships established from 2003 to 2004 and it will be used to evaluate the link prediction models. As the goal of the models is to predict links in a data set by successfully distinguishing positive classes, this problem was considered a binary classification problem that can be solved using effective features in a supervised learning framework.

Figure 1 shows the temporality and the overall flow of the process of extracting co-authorship sets from the DBLP database. These sets correspond to several-year intervals that do not overlap. The three data sets serve as a framework for addressing the link prediction problem through a binary classification model.

The prediction consists of finding the set of links formed at time t+Δ, called E(t+Δ), given the current state of a network (co-authorship graph) at time t, G(t)={V (t), E(t)}.

Generating the current state of the graph (from 1990 to 2000)

The Current state of the Graph is a co-authorship network that can be used as the reference point to extract the features that allow the classification model to identify the positive classes in the classification data sets. The positive classes in this problem are the authors who will collaborate in the future. This section describes the preprocessing performed on the original DBLP dataset to create the current state of the graph with publications from the years 1990 to 2000.

The DBLP database is a free Extensible Markup Language (XML) structure resource that provides open bibliographic information on major journals and reports of computer science conferences worldwide. It was originally created at Trier University in 1993. Nowadays, the DBLP database is operated and developed by Schloss Dagstuhl. Further information can be found in.^[⁸^]

The XML dataset provided by DBLP consists of a series of structured Tags. The root element is the Tag <dblp> which contains a sequence of bibliographic records. The DBLP dataset contains publications of articles in a newspaper, magazine, or conference, proceedings, books, in-collection (a part or chapter in a monograph) and master and doctorate theses. The DTD document of the DBLP dataset is shown in Figure 2.

The DBLP elements used to build the current state of the graph G(t) are:

Authors: Represent the nodes of the network and are denoted by V (t).
Co-authoring: Represent the edges of the network and are denoted by E(t).
Absence of co-authoring: Datapoint, a pair of nodes that are not related in the current state of the network.

One feature that significantly impacts link prediction is the sum of articles that the pair of authors has published. The importance of this feature arises from the fact that authors with higher article counts are more prolific. If one or both authors are prolific, the probability that this pair will collaborate in a co-authorship is greater than the case of unprolific authors.^[⁴^] Considering the above, to rule out unprolific authors, the current state of the graph should include only authors with at least three publications. This minimum number of publications was considered empirically.

The generation of the current state of the graph was structured as a pipeline composed of 4 steps. Figure 3 describes the flow of the pipeline:

Step 1-Extracting authorships from DBLP

In this step, the authors’ data were extracted from the DBLP data set. The original first and last names were restricted to ISO-8859-1 characters, so they were converted to UTF-8. To optimize time processing during the programming and test phases of the datathon, the input data set was reduced to articles and conferences and included only information from 1990 to 2000. The data cleansing and pre-processing is summarized as follows:

Translation of the Latin-1 XML document ISO-8859-1 to UTF-8.
Selection of the specific period of time (1990-2000).
Obtaining articles and conference records.
The output of this step is the authorships.csv file with the fields’ id article, author and year.

Step 2-Generation of Nodes Catalog

In this step, a node catalog was generated with the following considerations:

Each author is a node.
The ID of each author is their full name.
Duplicate records were deleted.
The authors with 2 publications or less were discarded.
Generation of authors catalog with fields: author and id article.
The authorships.csv file with 482998 authors yielded a catalog with 43912 nodes named
nodes.csv.

Step 3-Authorships filtering

In this step, publications were filtered to preserve only those whose authors appear in the node catalog, according to the following:

Read nodes catalog (nodes.csv ).
Read authorships file (authorships.csv ).
Output authorships that have authors with more than two articles to the filtered authorships file (filteredAutorships. csv ) with fields: id article, author and year.
Once the publications were filtered to preserve only those whose authors appear in the node catalog, the number of coauthorships decreased to 306197, corresponding to 43912 authors.

Step 4-Generation of Edges Set (Current state of the graph)

This step generates a list of edges that represent the co-authorship network in its current state. Edge weights are not taken into account, so duplicate edges were removed. The per-mutations of the edges were also discarded, for example, edge (author1, author2), edgePermutation (author2, author1). The following process is performed to generate the co-authorship network:

Read filtered authorships (filteredAutorships.csv ).
Create a pair of the authors that appear in the same article.
Generate the corresponding edges with the authors’ pairs.
Delete duplicate authors’ pairs.
Finally, the current state of the graph contains 43912 nodes and 95703 edges, which is undirected and unweighted.

Generating the Training and Testing Datasets

The training set contains data from 2001 to 2002 and the testing set contains data from 2003 to 2004. These files were generated following a pipeline of seven steps, where the first four steps were already explained in Figure 3 and each set of author pairs must meet the following conditions:

Both authors must appear in the current state of the graph (1990-2000).
The authors did not publish any articles together in the current state of the graph
(1990-2000).
If both conditions are met and the pair of authors published in one of these datasets, there is a positive sample. Otherwise, there is a negative sample.
Figure 4 describes the pipeline of the additional three steps required to generate the training and testing datasets.

The entire process for generating the training and testing sets is described through seven steps:

Step 1-Authorship Extraction

This step follows the same process as in the co-authorship network generation. The only difference is the considered periods, 2001-2002 for the training set and 2003-2004 for the testing set. After this process, the total authorship extracted for the training set is 167611 and for the testing set are 222186.

Step 2-Generation of Nodes Catalog

As a result of selecting authors from the current state catalog node and authorships, the training data set comprised 66936 nodes. The testing data set corresponding to the years 2003-2004 contained 22663 nodes.

Step 3-Authorship Filtering

As a result of filtering authors from the training set node catalog and the authorships during the years 2001-2002, there are 66936 authorships. In the case of testing data set (2003-2004) there are 69994 authorships.

Step 4-Generation of the Edges Set

A full set of author pairs (edges) was generated by filtering the authors within a specific period of time, 2001-2002 for training with 23246 edges and 2003-2004 for testing with 20678.

Step 5-Positives sample generation

In this step, the positive samples of the training and testing datasets were obtained. The minority class of the training set represents 25% of the possible cases. To do this, the edges in the network’s current state were removed from the set generated in step 4. The resulting edges were labeled with the letter P to identify them as positive samples. The new 11753 edges become the positive samples for the training set. In the case of the testing dataset (2003-2004), 12764 edges were obtained as positive samples.

Step 6-Negative sample generation

The set of negative samples was randomly obtained. The number of negative samples maintained the proportion established for these sets. The resulting edges were labeled with the letter N. The training data results in 35259 negative samples. The testing data set contained 38292 negative samples.

Step 7-Yielding the final training and test datasets

The last step joined the positive and negative samples into a single dataset to generate the final training and testing sets separately. Both sets followed the proportion of 25% for positive samples and 75% for negative samples. The final training dataset consists of 23606 nodes, 23246 edges, 11753 positive samples and 35259 negative samples. The testing dataset contains 22663 nodes, 20678 edges, 12764 positive samples and 38292 negative samples.

A sample of the data sources generated during the present work is available in the repository of the event.^[¹^]

Table 1 shows the number of authors and authorships obtained at each step of the process to generate the current state, training and testing datasets.

Table 1:
Generation process of current state, training and test datasets.
Dataset	Authorships	Nodes catalog	Authorships filtering	Edges set	Positive samples	Negative samples
Current state	482998	43912	306107	95703	–	–
Training	167611	23606	66936	23246	11753	35259
Testing	222186	22663	69994	20678	12764	38292

Evaluation Framework

After the training data was released to the participants, the time period to submit the results was 16 hr. Each team received an identification number to evaluate the methods developed by the participating teams during the datathon. All teams submitted their link prediction files to a server. A sample of the results is also available in the repository of the event.^[²^] Each file was evaluated using the metrics described in subsection.

Evaluation Metrics

As was mentioned in section on the datathon challenge description and it can be observed in Table 1, both training and testing sets have a majority class represented by the number of negative links instances, which covers 75% of possible classes. Thus, we must be careful in how to evaluate the models presented during the competition.

The statistical performance measures of a binary classification model are called rates: True Positives (TP) correspond to the number of instances the classifier predicted correctly in the positive class; False Negatives (FN) correspond to the number of instances incorrectly classified in the negative class, also known type II error; False Positives (FP) correspond to the number of instances incorrectly classified in the positive class, also known type I error; True Negatives (TN) correspond to the number of instances the classifier predicted correctly in the negative class. This section will present the most popular metrics for evaluating link prediction methods.

Precision

The precision metric^[¹⁰^] calculates the precision of the minority class. It is the ratio of the number of correctly predicted positive samples (TP) to the total number of samples predicted in the positive class (which corresponds to the sum of True Positives (TP) and False Positives (FP)).

Precision is a good measure to consider when the cost of false positives is high. For example, in spam detection, a false positive means that an email that is not spam (actual negative) has been identified as spam (positive). The mail user may lose important emails if the accuracy of the spam prediction model is not high.

Recall

The recall metric^[¹⁰^] is the ratio of the number of samples correctly predicted in the positive class to the number of all positive samples. Unlike precision, recall indicates the positive samples that the model failed to detect. In other words, recall provides a notion of positive class coverage.

Recall should be used when a high cost is associated with false negatives. For example, in fraud detection, if a fraudulent transaction (actual positive) is predicted as non-fraudulent (predicted negative), then the risk of losing large amounts of money in a financial institution would be very high. Precision and recall cannot fully describe a model’s predictability. A model can have very high precision and very low recall, or vice versa.

F1-score

Provides a combined measure between precision and recall. It corresponds to the harmonic mean between precision and recall. Thus, it is calculated through the following formula:

Example: Consider a dataset containing 12975 positive samples and 38925 negative samples and the results of a prediction model are presented by a confusion matrix in Table 2, there is no value of true negative because it is not used for the metrics in question:

Table 2:
Confusion matrix for a dataset with 12975 positive and 38925 negative samples.
Predicted
		Negative	Positive
	Negative	TN	–	FP	7136
Actual
	Positive	FN	648	TP	12327

From the previous measures, we could say that the model has low precision but excellent recall. Furthermore, the F-score balances precision and recall, providing sufficient information to describe the model’s predictability when trained in an imbalanced dataset.

F₁-Score was chosen to assess the methods for the co-authorship prediction problem addressed in the UNAM datathon, because it was the most conveniently used metric in imbalanced classification issues. The steps carried out during the evaluation process are explained in the following section.

Evaluation process

The evaluation process goes through three stages:

Delivering results

The participant teams received an unlabeled version of the test set to evaluate their co-authorship prediction model. The predictions obtained by their models are sent to the competition judges.

Generating the Confusion Matrix

For each prediction file submitted by the participants, the judges generate a confusion matrix; the judges used the labeled version of the test set to count the number of positive and negative samples correctly classified taking into account the equivalences listed in Table 3.

Table 3:
Confusion matrix generated.
Outcome	Meaning
TP	Co-authors labeled as positive that do exist in the test set
FP	Co-authors labeled as positive that do not exist in the test set
TN	Co-authors labeled as negative that do not exist in the test set
FN	Co-authors labeled as negative that do exist in the test set

Calculation of the evaluation metric

The F₁-Score is calculated for each prediction file of the participating teams, with the values of the confusion matrix.

Overview of the Challenge Results

Each team chose its data science strategy and working plan and competed through a long, cheerful programming night and day against their opponents. Sixteen teams made it to the end of the competition and uploaded the results to the file system. Nine teams predicted the link with a reasonable F₁-Score. Most teams approached the link prediction through hand-crafted rule-based methods and others programmed machine-learning-based methods. The rule-based methods focused on computing the similarities of disconnected pairs of nodes by analyzing the proximity of nodes, where every potential node pair would be assigned a score. A higher score means a higher probability of establishing a link in the future. The machine-learning-based methods focused on a binary classification task^[⁴^] using as feature set the same similarity metrics as the rule-based systems. If there is a potential link connecting a pair of nodes, this pair is labeled as positive, otherwise it is negative.

Table 4 shows for each participating team, the team number, the language in which they developed the solution, the methodology followed (rules or machine learning algorithm), the features computed (link prediction metrics) and the F₁-score value achieved in descending order, it can be observed the highest F₁-score value was 0.62 by Team #16. The submissions of the rest of the teams failed to be evaluated, for different reasons. Table 5 describes the reasons why these submissions failed. Team #3 provided numerical indexes instead of author names in their prediction files. Team #8 and Team #20 did not provide a prediction field. Team #9 did not provide any file for evaluation. Team #29 provided a file with a different character set. Team #15 provided a file with many predictions different from those expected.

Table 4:
Ranking of F₁-score values along with the details of submitted methods.
Team	Language	Method	Computed Metrics	F₁-score
16	Python	Rules	Common neighbours, preferential attachment and shortest path.	0.62
14	Python	Rules	Shortest path, secondary neighbours and Jaccard coefficient.	0.59
30	R	Log.Regr.	Jaccard coefficient and preferential attachment.	0.56
4	Python	Na¨ıve	Jaccard coefficient, resource allocation, Adamic adar index.	0.47
		Bayes	Soundarajan Hopcroft index and within inter cluster.
10	Python	XGboost	Common neighbours, Jaccard coefficient, resource allocation.	0.35
			Index, Adamic adar index, preferential attachment, triadic.
			Closure left, triadic closure right, node centrality left, node centriality
			right and vector similarity.
5	Python	XGBoost	Common neighbours, jaccard coefficient, adamic adar index.	0.28
			Preferential attachment, eccentricity and shortest path length.
6	Python	Rules	Shortest path, node connectivity, minimum node cut, edge connectivity	0.25
			and minimum edge cut. Distance measures based on
			eccentricity: diameter, radius, periphery and center.

Table 5:
Analysis of outcomes per team.
Team	F₁-score	Observations
3	0	The output format was incorrect (numerical indexes).
8	0	The output format was incorrect (no prediction field existed).
19	0	Not finished.
20	0	The output format was incorrect (no prediction field existed).
29	0	The character set must be UTF8.
15	0	Huge dataset with 43000000 million.

The best-participating team (team #16) developed a rule-based system using three similarity methods and combined them to predict the co-authorship of two authors. The first method is to compute the common neighbors^[¹²^] between the pair of nodes. Nodes with more neighbors in common are more likely to form an edge in the future.

The second method calculates the preferential attachment,^[¹²^] which gives the probability of co-authorship of x and y by computing the product of the number of collaborators of x and y. The third method is to find the length of the shortest path between the pair of nodes; if there is no path between the nodes a large number was assigned. Once the three values were computed for each pair of nodes in the training and testing set, they performed a grid search on the training set and obtained the threshold for each metric. With the obtained threshold, a binary value was assigned to each metric as follows: a) the number of common neighbours should be greater or equal to 2, b) the preferential attachment probability should be greater or equal to 200 and c) the shortest path should be less than 5. Finally, if any of the pairs of nodes meet any of the three conditions, they are classified as possible future co-authors.

The second-best approach (team #14) also used a similarity-based method with a rules hierarchy for predicting the nodes (authors) that will form a connection (co-authorship). The first rule is that there must be a path between a pair of nodes to form a relation in the future. If there is no path, the pair of nodes is automatically rejected to form a relationship. The second rule is that the common neighbours between the nodes must be less than 4. If such a condition is fulfilled, the pair of nodes is automatically classified as a possible relationship. The third rule computes the Jaccard coefficient, which is added to the 1/400 of the common neighbor’s measure. If this value exceeds a threshold, the pair of nodes is classified as a possible relationship.^[¹³^]

The third-best approach (team #30) used a machine-learning approach. They also computed similarity measures between nodes, but instead of programming fixed rules, they used these values as features to train a logistic regression algorithm. The computed similarity metrics were the Jaccard coefficient, preferential attachment and resource allocation index.^[¹⁴^]

Regarding the programming languages used by the participants in the datathon, most of the teams chose Python to develop their solutions and one team developed its solution with R.

It is important to remember that the results obtained were the work of approximately 18 hr by students who had no experience in data science projects and who took three workshops in one day to identify and apply appropriate actions required to develop and implement a solution, allowing learners to try out new methods.

CONCLUSION

The IIMAS-UNAM data science datathon 2020 was organized by data scientists and researchers from this institution. The event aimed at strengthening and disseminating data science among the university community and was, therefore, raised as a training and development activity for future data scientists. The datathon brought together people working in the industry as data scientists or programmers and actuarial science, physics, mathematics and computing students. The planning and implementation of the datathon achieved all its objectives by integrating enthusiastic youth groups with a single academic purpose: the development of future data scientists.

Datathons have become a new way of creating predictive models, used by industries, scientists and now by universities to solve a wide variety of problems and to develop new strategies within a short period of time. The datathon format undertaken here, which focuses on analyzing a specific data set, is a good strategy for obtaining a deep data analysis and promotes learning based on problem-solving. The participants have the opportunity to put their knowledge into practice by working in a collaborative environment within a multidisciplinary team, allowing the combination of skills from young researchers, industry programmers and data scientists, who might not have had the chance to work closely together to build solutions through integrating small contributions from participants with different backgrounds.

As the original dataset contained many years and was in XML format, the organizers prepared the training and testing datasets for the teams to avoid wasting time preparing data, so the teams were focused on analysis. In this paper, we describe the data preparation performed for the co-authorship prediction problem, which is freely available in the repository of the event¹ in.^[¹⁵^] We explain the evaluation process and the metrics used to evaluate the models. The results of the developed models show that they can build a reasonable solution in a really short time period. Therefore, the proposed methodology for the environment configuration that covers the data collection, selection and preparation stages required for link prediction in a massive event was successful. The results allowed us to test some models and show the feasibility of automatically identifying potential collaboration in a research network as a link prediction task.

From an educational point of view, we can conclude that the promotion of data science among students and young researchers and the knowledge transfer from experienced data scientists were important parts of the event, which was enabled by working with peers across domains.

Two main alternatives have been considered to lay the groundwork for future work: the first corresponds to the reutilization of the data source and the preparation process of the training and testing samples explained throughout this paper, but widening the current span to cover a broader year range, adding other bibliographical sources such as IEEE, Web of Science, SCOPUS, etc. The second alternative is to increase the complexity of the datathon challenge by asking the competitors to design a prediction model to automatically learn the network’s topological characteristics using various machine learning and deep learning models. Furthermore, a wide range of topological features can provide information regarding emergent properties of a social network through their predictive importance.^[⁴^] Regarding the data source to be utilized, some keywords mentioned in the academic articles can be incorporated to predict collaborations based on the topics covered in the articles.

Cite this article:

Angeles MDP, Adorno HG, Hernández-Guevara SD, Corza-Vargas VM. Predicting Collaborations among Research Scientists: A Datathon Experience. J Scientometric Res. 2024;13(3s):s2-s10.

ACKNOWLEDGEMENT

This research was partially funded by DGAPA-UNAM through PAPIIT projects TA101722, IN104424 and IN100719.

ABBREVIATIONS

DGAPA	Dirección General de Asuntos del Personal Académico
UNAM	Universidad Nacional Autónoma de México
PAPIIT	Programa de Apoyos de Investigación e Innovación Tecnológica

References

Flus M, Hurst A. Design at hackathons: new opportunities for design research. Des Sci. 2022:7 [CrossRef] | [Google Scholar]
Piza FM, Celi LA, Deliberato RO, Bulgarelli L, de Carvalho FR, Filho RR, et al. Assessing team effectiveness and affective learning in a datathon. Int J Med Inform. 2011;112:40-4. [PubMed] | [CrossRef] | [Google Scholar]
Lane C. Top universities. Retrieved. 2001 https://www. Available from: http://topuniversities.com/university-rankings-articles/world-university-rankings/top-universities-world-2021 [PubMed] | [CrossRef] | [Google Scholar]
Hasan MA, Zaki MJ. A survey of link prediction in social networks. Soc Netw Data Anal. 2011;9:243-75. [CrossRef] | [Google Scholar]
Namata Getoor GL. Link prediction. Encyclopedia of machine learning and data mining. 2017:753-8. https://doi.org/10.1007/978-1-4899-7687-1486 [CrossRef] | [Google Scholar]
Huang H, Tang J, Liu L, Luo J, Fu X. Triadic closure pattern analysis and prediction in social networks. IEEE Trans Knowl Data Eng. 2011;27(12):3374-89. [CrossRef] | [Google Scholar]
Ley M. DBLP: some lessons learned. Proc VLDB Endow. 2000;2(2):1493-500. [CrossRef] | [Google Scholar]
The Dblp Team: DBLP Computer Science Bibliography. Available from: https://dblp.org/
Brownlee J. Making developers awesome at machine learning. 2020 [retrieved Mar 19, 2021 from]. Available from: https://machinelearningma stery.com/precision-recall-and-f-measure-for-imbalanced-classification/Computer Imbalanced classification.
Koo PS. Towards data science Canada. 2018 accuracy-precision-recall-or-f1-331fb37c5cb9. Available from: https://towardsdatas cience.com/ Accuracy, precision, recall or F1?.
Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2011;5(4):221-32. [CrossRef] | [Google Scholar]
Newman ME. [Retrieved from:. 2001];Clustering and preferential attachment in growing networks. Phys Rev E Stat Nonlin Soft Matter Phys.Ping, S. K.(2018,march 15). Towards Data Science. accuracy-precision-recall-or-f1-331fb37c5cb9;64(2 Pt 2):025102 [PubMed] | [CrossRef] | [Google Scholar]
Salton G, McGill M. Introduction to modern information retrieval. 1983 [PubMed] | [CrossRef] | [Google Scholar]
Zhou T, Lü L, Zhang YC. Predicting missing links via local information. Eur Phys J B. 2000;71(4):623-30. [CrossRef] | [Google Scholar]
Angeles MdP, Gomez-Adorno H, Corza-Vargas V, Hernandez-Guevara S. Current State of the graph, training and testing datasets for Link prediction problem. 2020 Available from: https://github.com/pilarang/Dathaton

[R1] Flus M, Hurst A. Design at hackathons: new opportunities for design research. Des Sci. 2022:7 [CrossRef] | [Google Scholar]

[R2] Piza FM, Celi LA, Deliberato RO, Bulgarelli L, de Carvalho FR, Filho RR, et al. Assessing team effectiveness and affective learning in a datathon. Int J Med Inform. 2011;112:40-4. [PubMed] | [CrossRef] | [Google Scholar]

[R3] Lane C. Top universities. Retrieved. 2001 https://www. Available from: http://topuniversities.com/university-rankings-articles/world-university-rankings/top-universities-world-2021 [PubMed] | [CrossRef] | [Google Scholar]

[R4] Hasan MA, Zaki MJ. A survey of link prediction in social networks. Soc Netw Data Anal. 2011;9:243-75. [CrossRef] | [Google Scholar]

[R5] Namata Getoor GL. Link prediction. Encyclopedia of machine learning and data mining. 2017:753-8. https://doi.org/10.1007/978-1-4899-7687-1486 [CrossRef] | [Google Scholar]

[R6] Huang H, Tang J, Liu L, Luo J, Fu X. Triadic closure pattern analysis and prediction in social networks. IEEE Trans Knowl Data Eng. 2011;27(12):3374-89. [CrossRef] | [Google Scholar]

[R7] Ley M. DBLP: some lessons learned. Proc VLDB Endow. 2000;2(2):1493-500. [CrossRef] | [Google Scholar]

[R8] The Dblp Team: DBLP Computer Science Bibliography. Available from: https://dblp.org/

[R9] Brownlee J. Making developers awesome at machine learning. 2020 [retrieved Mar 19, 2021 from]. Available from: https://machinelearningma stery.com/precision-recall-and-f-measure-for-imbalanced-classification/Computer Imbalanced classification.

[R10] Koo PS. Towards data science Canada. 2018 accuracy-precision-recall-or-f1-331fb37c5cb9. Available from: https://towardsdatas cience.com/ Accuracy, precision, recall or F1?.

[R11] Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2011;5(4):221-32. [CrossRef] | [Google Scholar]

[R12] Newman ME. [Retrieved from:. 2001];Clustering and preferential attachment in growing networks. Phys Rev E Stat Nonlin Soft Matter Phys.Ping, S. K.(2018,march 15). Towards Data Science. accuracy-precision-recall-or-f1-331fb37c5cb9;64(2 Pt 2):025102 [PubMed] | [CrossRef] | [Google Scholar]

[R13] Salton G, McGill M. Introduction to modern information retrieval. 1983 [PubMed] | [CrossRef] | [Google Scholar]

[R14] Zhou T, Lü L, Zhang YC. Predicting missing links via local information. Eur Phys J B. 2000;71(4):623-30. [CrossRef] | [Google Scholar]

[R15] Angeles MdP, Gomez-Adorno H, Corza-Vargas V, Hernandez-Guevara S. Current State of the graph, training and testing datasets for Link prediction problem. 2020 Available from: https://github.com/pilarang/Dathaton

Subscribe to Updates

Predicting Collaborations among Research Scientists: A Datathon Experience

Author Notes

Copyright and License information

Citation Output

Article Metrics

Citation

ABSTRACT

INTRODUCTION

The Datathon Challenge: Predicting Collaborations among Scientists

Analysis of the Challenge

Generating the current state of the graph (from 1990 to 2000)

Step 1-Extracting authorships from DBLP

Step 2-Generation of Nodes Catalog

Step 3-Authorships filtering

Step 4-Generation of Edges Set (Current state of the graph)

Generating the Training and Testing Datasets

Step 1-Authorship Extraction

Step 2-Generation of Nodes Catalog

Step 3-Authorship Filtering

Step 4-Generation of the Edges Set

Step 5-Positives sample generation

Step 6-Negative sample generation

Step 7-Yielding the final training and test datasets

Evaluation Framework

Evaluation Metrics

Precision

Recall

F1-score

Evaluation process

Delivering results

Generating the Confusion Matrix

Calculation of the evaluation metric

Overview of the Challenge Results

CONCLUSION

Cite this article:

ACKNOWLEDGEMENT

ABBREVIATIONS

References

Related Articles