Wikipedia:Wikipedia Signpost/2023-12-04/In focus

From Wikipedia, the free encyclopedia
Trappist the monk
In focus

Tens of thousands of freely available sources flagged

Over the weekend of 25 November, citation templates received some updates. One change, in particular, goes a long way in flagging freely-available resources. Here's a short history of what was needed for the most recent changes to fully pay off.

Step 1: access locks get rolled out

In October 2016, so-called "access locks" were deployed in CS1 and CS2 templates (see Signpost coverage). After a few RfCs on visual appearance, things settled in the current scheme:

– to indicate a full version of a source that is freely accessible, with no conditions
– to indicate a full version of a source that is freely accessible, with some conditions (e.g. free registration is required, only the first 5 reads are free, etc.)
– to indicate a full version of a source that is not freely accessible (e.g. a paid subscription is required).

Step 2: bots get involved

Access locks for always-free resources, like papers hosted on arXiv or papers with PMCIDs, were automatically rolled out. But the main identifier for scientific articles is the sometimes-free DOI, which requires the presence of |doi-access=free to signal whether or not a particular DOI link is free to read.

For those unfamiliar with DOIs, they are roughly the equivalent of what ISBNs are for books, and usually point to individual academic papers published in peer-reviewed journals. Their structure is 10.xxxxx/foobar, with the 10.xxxxx part being the DOI prefix, identifying who has registered the DOI in question. DOI registrants can be access platforms like JSTOR (10.2307), individual journals like Notre Dame Journal of Formal Logic (10.1305), or publishers like the IEEE (10.1109).

While the initial roll-out of DOI access locks was done manually and semi-automatically with WP:AWB, OA Bot greatly assisted in flagging free-to-read resources on select articles. However, OA Bot tends to be user-activated on specific articles, rather than systematically crawling every article on Wikipedia.

One way to find swathes of free DOIs is to identify DOI prefixes belonging to known open-access publishers. For example, 10.3389 belongs to the (in)famous Frontiers Media, while 10.3390 belongs to the equally controversial MDPI. It's then a simple matter to have Citation bot flag them. It worked pretty well for the big publishers, so an effort was made to identify more open-access DOI prefixes, and the bot was updated accordingly.

Step 3: search and flag

Targeted Citation bot runs were done from database dumps — rather efficiently to begin with. But while database scans are good at finding articles containing specific DOI prefixes, they are bad at finding articles containing unflagged DOIs with these prefixes. Meaning that if, hypothetically, 92% of all articles with MDPI DOIs were flagged, you'd be wasting your processing power on 92% of articles with MDPI prefixes in them. As of writing, that's 12,151 articles — meaning well over 11,000 articles would be processed for nothing to catch the other ~1000. And the next time, if you have 98% flagged ... you'll have an even more inefficient run.

Luckily, with the recent update to the CS1 and CS2 citation templates, we have a solution: Category:CS1 maint: unflagged free DOI. This is a category that specifically tracks if a citation has a) a known free DOI prefix and b) a DOI that has been flagged as free. As of writing, a bit over 16,000 Wikipedia articles have been identified and processed. Here's an example edit: flagging 2 DOIs with prefix 10.3847, belonging to the American Astronomical Society. Here's another: flagging 4 DOIs with prefixes 10.1186, associated with BioMed Central journals, and 10.1073, associated with Proceedings of the National Academy of Sciences of the United States of America.

The hope is to have the category mostly cleared by the end of December, when it will contain only new additions. Those should be easily handled by daily bot runs.

Where to next?

About 2 to 3% of the 16,000 or so articles seem to have a free DOI that is unflagged in Wikidata, which are (mostly) the ones remaining in the category. Sadly, {{cite q}} makes it impossible to deal with it here, as well as the many other issues Citation bot is able to correct. Hopefully Wikidata people can look at the updates to the CS1 and CS2 templates and go through whatever is going on on their side of things and update things accordingly.

It should be a relatively straightforward task for someone that understands how Wikidata works. That someone isn't me. But it could be you!