Wednesday 22nd October 2025—Crossref, the open scholarly infrastructure nonprofit, today releases an enhanced dashboard showing metadata coverage and individual organisations’ contributions to documenting the process and outputs of scientific research in the open. The tool helps research-performing, funding, and publishing organisations identify gaps in open research information, and provides supporting evidence for movements like the Barcelona Declaration for Open Research Information, which encourages more substantial commitment to stewarding and enriching the scholarly record through open metadata.
Welcome back to our series of case studies of research funders using the Grant Linking System. In this interview, I talk with Cátia Laranjeira, PTCRIS Program Manager at FCCN|FCT, Portugal’s main public funding agency, about the agency’s approach to metadata, persistent identifiers, Open Science and Open Infrastructure.
With a holistic approach to the management, production and access to information on science, FCCN|FCT’s decision to implement the Grant Linking System within their processes was not simply a technical upgrade, but a coordinated effort to continue building a strong culture of openness. With the mantra “register once, reuse always”, FCCN|FCT efforts to embrace open funding metadata was only logical.
Repositories are home to a wide range of scholarly content; they often archive theses, dissertations, preprints, datasets, and other valuable outputs. These records are an important part of the research ecosystem and should be connected to the broader scholarly record. But to truly serve their purpose, repository records need to be connected to each other, to the broader research ecosystem, and to the people behind the research. Metadata is what makes that possible. Enhancing metadata is a way to tell a fuller, more accurate story of research. It helps surface relationships between works, people, funders, and institutions, and allows us as a community to build and use a more connected, more useful network of knowledge - what Crossref calls the ‘Research Nexus’.
The Crossref Grant Linking System (GLS) has been facilitating the registration, sharing and re-use of open funding metadata for six years now, and we have reached some important milestones recently! What started as an interest in identifying funders through the Open Funder Registry evolved to a more nuanced and comprehensive way to share and re-use open funding data systematically. That’s how, in collaboration with the funding community, the Crossref Grant Linking System was developed. Open funding metadata is fundamental for the transparency and integrity of the research endeavour, so we are happy to see them included in the Research Nexus.
When each line of code is written it is surrounded by a sea of context: who in the community this is for, what problem we’re trying to solve, what technical assumptions we’re making, what we already tried but didn’t work, how much coffee we’ve had today. All of these have an effect on the software we write.
By the time the next person looks at that code, some of that context will have evaporated. There may be helpful code comments, tests, and specifications to explain how it should behave. But they don’t explain the path not taken, and why we didn’t take it. Or those occasions where the facts changed, so we changed our mind.
Some parts of our system are as old as Crossref itself. Whilst our process still involves coffee, it’s safe to say that most of our working assumptions have changed, and for good reasons! We have to be very careful when working with our oldest code. We always consider why it was written that way, and what might have changed since. We’re always on the look out for Chesterton’s Fence!
Leaving a Trail
We’re building a new generation of systems at Crossref, and as we go we’re being deliberate about supporting the people who will maintain it.
When our oldest code was written, the software development team all worked in an office with a whiteboard or three, and the code was proprietary. Twenty years later, things are very different. The software development team is spread over 8 timezones. Thanks to POSI, all the new code we write is open source, so the next people to read that code might not even be Crossref staff.
Working increasingly asynchronously, without that whiteboard, we need to record the options, collect evidence, and peer-review them within the team.
So for the past couple of years the software team has maintained a decision register. The first decision we recorded was that we should record decisions! Since then we have recorded the significant decisions as they arise. Plus some historical ones.
These aren’t functional specifications, which describe what the system should do. It’s the decisions and trade-offs we made along the way to get to the how. Look out for another blog post about specifications.
By leaving a trail of explanations as we go, we make it easier for people to understand why code was written, and what has changed. We’re writing the story of our new systems. This makes it easier to alter the system in future in response to changes in our community, and the metadata they use.
Difficult Decisions
There are some fun challenges to building systems at Crossref. We have a lot of data. Our schema is very diverse, and has a vast amount of domain knowledge embedded in it. It’s changed over time to accommodate 20 years of scholarly publishing innovations. Our community is diverse too, from small one-person publishers with a handful of articles, through to large ones that publish millions.
What might be an obvious decision for a database table with a thousand rows doesn’t always translate to a million. When you get to a billion, things change again. An initially sensible choice might not scale. And a scalable solution might look over-engineered if we had millions of DOIs, rather than hundreds of millions.
The diversity of the data also poses challenges. A very simple feature might get complicated or expensive when it meets the heterogeneity of our metadata and membership. What might scale for journal article or grant metadata might not work for book chapters.
The big decisions need careful discussion, experimentation, and justification.
2NF or not 2NF
One such recent decision was how we structure our SQL schema for the database that powers our new ‘relationships’ REST API endpoint, currently in development.
The data model is simple: we have a table of Relationships which connect pairs of Items. And each Item can have properties (such as a type). The way to model this is straightforward, following conventional normalization rules:
We built the API around it, and all was well.
We then added a feature which lets you look up relationships based on the properties of the subject or object. For example “find citations where the subject is an article and the object is a dataset”. This design worked well in our initial testing. We loaded more data into it, and it continued to work well.
And then, the context changed. Once we tested loading a billion relationships in the database, the performance dropped. The characteristics of the data: size, shape and distribution, reached a point where the database was unable to run queries in a timely way. The PostgreSQL query planner became unpredictable and occasionally produced some quite exciting query plans (to non-technical readers: databases are neither the time nor the place for excitement).
This is a normal experience in scaling up a system. We expected that something like this would happen at some point, but you don’t know when it will happen until you try. We bounced around some ideas and came up with a couple of alternatives. Each made trade-offs around processing time, data storage and query flexibility. The best way to evaluate them was to use real data at a representative scale.
One of the options was denormalisation. This is a conventional solution to this kind of problem, but was not our first choice as it involves extra machinery to keep the data up-to-date, and more storage. It would not have been the correct solution for a smaller dataset. But we had the evidence that the other two approaches would not scale predictably.
By combining the data into one table, we can serve up API requests much more predictably, and with much better performance. This code is now running with the right performance. Technical readers note that this diagram is simplified. The real SQL schema is a little different.
Without writing this history down, and explaining what we tried, someone might misunderstand the reason for the code and try to simplify it. Decision record DR-0500 guards against that.
But one day, when the context changes, future developers will be able to come back and modify the code, because they understand why it was like that in the first place.