Use of Research Organizations Registry (ROR) identifiers in author academic profiles: the case of Google Scholar Profiles

Research organizations’ persistent identifiers allow for reducing affiliation ambiguities, enable accurate institutional analyses and favor the design of modern online scholarly databases suited for research discovery and research evaluation. However, few studies have attempted to quantify their degree of use. The purpose of this work is precisely to determine the use of Research Organizations Registry (ROR) IDs in author academic profiles, specifically in Google Scholar Profiles (GSP). To do this, all the Google Scholar profiles including the term ROR in any of the public descriptive fields were collected and analyzed. The results evidence a low use of ROR IDs (1,033 profiles), mainly from a few institutions (e.g. Pontificia Universidad Javeriana in Colombia, and Escuela Superior Politécnica del Litoral in Ecuador hold 55.7% of all profiles), from low citation-based impact authors (45.1% of profiles attain less than 100 citations each), belonging mainly to Social Sciences (26.3%), Engineering fields (25.3%), and Natural Sciences (22.2%). Although Google Scholar does not facilitate the inclusion of identifiers, it seems that the world’s leading research institutions are not recommending their researchers include these identifiers in their profiles yet.


Introduction
ROR (Research Organization Registry) 1 is a community-led project launched in 2019 that attempts to create standard identifiers for research organizations (Meadows, 2019).
Its origins date back to series of meetings held between 2016 and 2018 by a large (17) number of organizations, in which the need for a solution providing "resolvable, persistent and unique identifiers for organizations involved in research that can be used to describe researcher affiliations" was raised (Demeranville et al., 2016). Both a top-down (a working group defining governance recommendations and product principles) 2 and a bottom-up (a Request for Information) 3 processes were initiated to define a proposal. Finally, the California Digital Library, Crossref, DataCite, and Digital Science shaped a new steering group to implement the proposal, with a donation of seed data from Digital Science's GRID proprietary database.
Since the public retirement of GRID in September 2021, from whose database was ROR initially fed, ROR started to be maintained independently as a leading "open, stakeholder-governed infrastructure for research organization identifiers and their associated metadata", 4 covering more than 102,815 registries as of 28 August 2022.
As other research organizations identifiers (e.g., Crossref Funder ID, GRID, ISNI, Ringgold Org ID, Wikidata), ROR helps reducing affiliation ambiguities while allow accounting and characterizing research organizations worldwide. Moreover, the fact that ROR IDs are also embedded in URLs 5 eases online navigation between research objects (e.g., journal articles, bibliographic records) and organizations' metadata, and enables webometric studies and link-based analyses.
Future full open integration between publications (e.g., DOIs and Handles), authors (e.g., ORCIDs) and organizations identifiers (e.g., RORs) would also allow the design of modern hybrid scholarly databases, academic platforms and value-added online services which will favor both research discoverability and research evaluation exercises.
Studies related to the degree of use of ROR IDs are needed to understand how this standard is being incorporated in the ecosystem of science. Specifically, discovering the main objects incorporating ROR IDs (e.g., publications, authors, publishers, and institutions) and testing new ROR-based indicators would facilitate a better understanding of the (economic, technical, and academic) benefits of this new standard, and would help calibrating the feasibility of new academic-related products design.

PARAULES CLAU
However, link-based analyses of ROR IDs show a sort of methodological limitations. First, the ROR ID should be embedded in a URL, which prevents the identification of raw ROR IDs in texts. Second, link analysis is constrained by the features of the bots used to collect the linking webpages. Commercial link analysis tools leave behind webpages with specific bot exclusion policies or web accessibility problems as well as social media platforms (e.g., Tweets, Facebook posts, etc.), which are generally closed to external crawlers when it comes to index internal pages (i.e., each profile). This limitation includes links and mentions from author profiles such as ResearchGate or Google Scholar Profiles (also referred to as Google Scholar Citations).
This study aims to fill this gap by analyzing the use of ROR IDs in author academic profiles via mention-related identification, instead of link-related. To do this, the following research questions are set: • RQ1. How many authors do include ROR IDs in their author academic profiles?
• RQ2. How are authors including their ROR IDs in their academic profiles?
• RQ3. Do highly cited authors include ROR IDs in their academic profiles?
• RQ4. Which research organizations the authors who include ROR IDs in their academic profiles belong to?
• RQ5. Which research fields the authors who include ROR IDs in their academic profiles belong to?
In order to response properly to the research questions set above, Google Scholar Profiles will be used as an author profile case study.

Research background
Google Scholar Profiles (GSP) is a free author academic profile service created by Google in 2011 (Jacsó, 2012). It allows users to collect their publications among the bibliographic data already indexed in Google Scholar, to edit these records (e.g., merging different versions, fixing bibliographic data errors) and to add basic personal information (name, picture, affiliation, topics of interest, coauthors). The profile automatically computes basic author-level metrics (h-index, i10, citations).
Due to its huge coverage and refreshment velocity (the same as Google Scholar), GSP shows advantages for meta-research and Science studies. However, the interactive features offered by this author academic profile are basic. Since 2011 a few updates have been carried out. At the private level, GSP has introduced over the years automatic recommendations profiles, improved the follow colleagues' profiles button (October 2017) 6 and enhanced options for articles' recommendation (e.g., save articles) (February 2021) 7 , or automatic suggestions to fix bibliographic errors. At the public level, GSP updated the visual interface in August 2014, 8 included citation histograms over the years (September 2014), 9 disclosed information about public access of publications and allowed a procedure to use Google Drive to upload documents (March 2021), 10 and recently, the possibility of including alternative author names (August 2022).
Among the public-level updates, GSP introduced in August 2015 an affiliation link. While this feature allowed facilitating the identification of authors' affiliation by enabling a hyperlink and creating institutional lists of authors, several errors and shortcomings were pointed out (Orduña-Malea et al., 2017). Precisely, the inclusion of ROR IDs in the GSP profiles would enhance the affiliation information offered, disambiguate institutions, and fix errors.
At present, that possibility is not integrated in the profile, but authors can include the ROR ID within the description field text box. Precisely, this study is intended to shed light on how and to what extend ROR IDs are being used in GSP.

Method
To accomplish with the objectives of this study, the GSP search feature was directly used to collect all author profiles including the 'ROR' text chain in their profiles. All the profiles were extracted by 19 August 2021 via web scraping and exported to a spreadsheet to further analysis. For each profile, the name (and position description, when available), verified email domain, description, areas of interest (i.e., keywords added by the author) and total number of citations received were gathered.
GSP search does not allow to search in specific fields of each profile (e.g., name, description, verified address), except for author keywords (e.g., label=example). Therefore, a data cleansing process was carried out to filter out false positives.
Only 16 profiles were found to be false positives, a low value. While on some occasions the 'ROR' appeared in the author's name (e.g., Shivani Ror), in other occasions we detected that the name initials do correspond to the chain text (e.g., Rafael Olivera Rondón Muniz). In these cases, Google Scholar wrongly creates a false author name (i.e., ROR Muniz), which contains the text searched. Curiously enough, 'ROR Muniz' (this same happens for other similar cases) does not appear in the author information, but in the publications' bibliographic records included in the profile (Figure 1).
Once the dataset was cleansed, the genre was manually determined for each author profile by using the Genderize app 11 . This tool allows inferring the gender (female/male/ unknown) by using the author's first name (and optionally, indicating the country). Unfortunately, as the information provided by GSP is not accurate enough, only a binary categorization (female/male) was adopted through the author's name.
In case of ambiguity, the author picture, when available, was checked (e.g., F. Xavier).
Areas of interest (i.e., keywords) included by each author were subsequently used to determine the authors' disciplines. To do this, the following predefined classification was adopted: Arts were included in the profile, the Department information, description field and the publications included were used to determine the discipline.
Finally, each profile was manually examined to identify the ROR ID included, determining the procedure used by authors to embed the identifier, establishing the following categories: • Raw ROR ID: the ID is included without URL format (e.g.,
• Incomplete: Although the ROR acronym has been added to the profile, the ID is not included (e.g., grid.41312.35 / ROR).

Authors
1,033 profiles in GSP include a ROR ID. One profile is invented (Nombre Apellido1-Apellido2) and another one is institutional (FCNM ESPOL). Considering the remaining 1,031 author profiles, 58.6% are male authors and 41.4% are female authors.
Most authors (83.7%) have included the ROR ID as a raw code in the description field, while the full link (11.5%) and domain name (1.0%) are minority options. It stands out the remarkable number of profiles (38) omitting the ROR ID, although the acronym has been included ( Figure 2). Finally, one profile was not public at the time of the analysis.

Research organizations
Most of the profiles included institutional affiliation via the email verified domain (1017; 98.6%). A total of 144 research organizations have been identified being The Pontificia Universidad Javeriana (javeriana.edu.co) in Colombia the institution with most public profiles in GSP including a ROR ID (400 profiles), followed by the Escuela Superior Politécnica del Litoral-ESPOL (espol.edu.ec) in Ecuador with 167 profiles. However, the number of citations attained by the profiles of these institutions is limited (Table 1). The third research institution is the Consejo Superior de Investigaciones Científicas (csic.es) in Spain, with a fewer number of profiles (37) but higher median of citations received (2,288). To reflect the prestige of profiles by institution, the maximum citation value is also provided.
The results offered by the Table 1  Out of the 144 research organizations identified, the type of the institution has been identified for 140 institutions ( Table 3).
The presence of universities (75.7% of all institutions) stands out, followed by far by research institutes (9) and government bodies (6). 52 Spanish universities have at least one profile including a ROR ID.  was expected, the low percentage of Health and Medicine related profiles (12.7%) is remarkable ( Figure 5).

Research organizations type
Natural Sciences show the highest median value for the number of citations received (500), followed at some distance by the Applied Sciences II-Health and Medicine (140). Otherwise, both Applied Sciences I-Engineering (96) and Social Sciences (108) exhibit lower median citations values, as expected due to the general citation patterns in each of the disciplines.

Discussion
This study has analyzed the presence of ROR IDs in Google Scholar Profiles, identifying the quantity of public profiles including this identifier and the mode used to be included. In addition, some characteristics of the authors including ROR IDs (genre, citation-based impact, organization type, country, research field) have been pointed out.
Considering the GSP population, currently estimated around 4 million profiles, 12 the inclusion of ROR IDs in the Google Scholar profiles is considered low. Moreover, these profiles are heavily concentrated in two institutions. A plausible explanation is that the inclusion obeys to specific guidelines provided by ad hoc courses on research promotion (e.g., guides from the Pontificia Universidad Javeriana) 13 or specific research policies. The fact that 45.1% of authors exhibit less than 100 citations (early-career researchers) and 85.8% of the 1,031 profiles including a ROR ID also include the author's ORCID ID either in the name, description, or website fields (a specific and advanced self-promotion action) reinforces the hypothesis that authors were trained to create a digital research entity, including the ROR identifier.
An unbalanced gender percentage has been found. A lower presence of women in GSP was already pointed out by Tsou et al. (2016). This might indicate that men scholars are more interested in maintaining their profiles than women. These results reinforce the need to carry out gender studies in research promotion activities by universities to better understand the different use of author profiles from a gender perspective. In this case, and considering the lower presence of women, similar patterns of ROR ID inclusion have been found. However, the inclusion of ROR IDs as a raw code and a full link (the most frequently mode used in the sample) is significantly higher for men.
The absence of North American, Canadian, North European, Australian, or Asiatic institutions is quite remarkable. Spain, Colombia, and Ecuador provide 88.7% of all the profiles. These results contrast with previous findings pointing out a bias towards Anglo-Saxon countries when measuring webpages linking to URL-based ROR IDs (Orduña-Malea & Bautista-Puig, 2022). The geographic origin of links to ROR IDs is unrelated to the geographic presence of authors including ROR IDs in their Google Scholar Profiles. Arguably, the inclusion of ROR IDs in Google Scholar Profiles is not being promoted in those countries so far. However, the scholarly databases in those places have started to implement ROR IDs in their bibliographic records.
The fact that Google Scholar Profiles do not provide specific fields for identifiers, neither ORCIDs nor RORs, might explain the low penetration level, as authors should include the identifier in other descriptive fields voluntarily. Nonetheless, this is counterproductive as authors introduce the identifiers in fields not designed for that purpose, limiting their properties and utility. We also observe this effect with ORCIDs, which are mainly included in the name field. This is not the best option because the ORCID ID is embedded as a plain text chain as part of the author's name, which is inaccurate and might jeopardize the location of the author as Google does not understand the ORCID ID as a code, but as part of the authors' name. The use of the website field constitutes a better option, as an active link is provided, and the reader can click and access to the author's ORCID profile.
Finally, the high presence of Social Science researchers is also quite remarkable, as the distribution of fields observed in the Figure 4 does not correspond to the publication patterns of scientific results or the size of research communities. Again, the effect of specific training from and to this community might explain the results obtained.

Limitations
The outcomes obtained in this study have come through several research limitations that should be pointed out to discuss and contextualize the results properly.
First, the search method is limited to find the chain 'ROR' in the public descriptive fields of each profile. This strategy prevents the discovery of ROR IDs embedded directly without the ROR acronym. In any case, the authors believe that the strategy used collected most of the existing profiles, being non-significant the number of the omitted profiles.
Second, data relies on the public information provided by the authors in their profiles. However, this information can be incomplete, erroneous or even fake. For example, the total number of citations can be misleading if the profile is not curated (e.g., including publications not authored by the author). While profiles have not been examined to check their accurateness, results have been displayed via aggregated indicators (mean, median), which minimizes the effects on the results.
Third, affiliation data has been collected from the verified email provided by each author. The main advantage of this procedure is that domain names are unique, and this allows disambiguating institutions. However, this limits to one institution per author, arguably the main institution. In addition, public profiles can be published without a verified email. In those cases, no institutional data was collected. However, the number of emails with no verified email in the sample was non-significant (13; 1.7%).
Fourth, research fields were identified through the areas of interest, that is, up to five keywords provided by the authors. Notwithstanding, authors can include typos, inaccurate or misleading terms, or terms unrelated with their publications but their interests. In some cases, authors do not include keywords (22 profiles in the sample do not show areas of interest). In other cases, the department/school mentioned by the author, if any, was misaligned with the keywords included, although research and teaching are not necessarily related in higher education institutions. Additionally, a very generic thematic classification was employed as a pragmatic solution (e.g., the line between engineering and natural sciences is sometimes very thin). For all these reasons, the disciplinary classification should be considered assuming a tolerable margin of error.
Finally, the low use of ROR IDs found in this study is consistent with the results previously shown by Orduña-Malea & Bautista-Puig (2022), who found that still few scholarly databases embedded links to ROR IDs. The authors estimate that data quality differences between ROR and GRID records could limit ROR implementation, and consequently, its popularity among scholars and scientists.

Conclusions
This study has revealed a low degree of inclusion of ROR IDs in the Google Scholar Profiles. The use of ROR IDs is limited to few institutions probably because of specific research advisory or training activities. The authors of the most important universities and research institutions worldwide are not including ROR IDs yet in their profiles.
The availability of specific descriptive fields for identifiers (mainly ORCID and ROR) is deemed recommended to the Google Scholar team to encourage the use of these standards, which might increase the users' browsing experience and the elaboration of more accurate research studies.
Future studies should analyze the degree of use of ROR IDs by monitoring their inclusion in other author profiles and social media platforms, and its web connectivity with ORCIDs and DOIs, which might lead meta-researchers to a new generation of web-based Science studies.