Deduplication of article references
Deduplication of article references is required before screening due to various reasons from journal publishing artifacts like copyrights or footers, article reviews and corrections with identical titles, general vagueness in the metadata and document identifiers, technical encoding issues, or simple things like case changes or abbreviations. It is not a simple task for a computer system to judge if two articles are really the same.
SYRAS has some deduplication tools to assist you with this process. The deduper can be run after import with the following options available.
– Exact Match is very fast, strict option leading to few false positives. It’s recommended to run this one first to quickly remove the obvious duplicates and reduce the processing burden on the more complex scans.
– SRA Matcher this is a third-party tool from Bond University’s “SRA” software. It performs a range of checks including title, author, ISBN etc and abstract analysis. It seems to the best method available in SYRAS, but is slower on a large corpus.
The SRA matcher inspects the DOI and ISBN which is a faster detection but can lead to false duplicates in collected works, where many articles might share the same ISBN. If you suffer this problem in your dataset, you can fine tune the SRA algorithm to ignore either or bot of these identifiers. This will cause it to more often use a deeper text searching on title and/or abstract which is slower on a larger corpus.
– Title Search Match this does a “fuzzy” match on the titles like a search engine might (not the abstracts, nor authors etc). It can be somewhat “permissive” however and lead to false positives.
SYRAS can perform the following actions after duplicates are detected :
– Delete – instantly delete the duplicate candidates.
– Review – places them in a “queue” for you to review and process.
If you have imported new article references but not yet run the deduper, you will see a prompt. If you have any pending duplicates not yet resolved you will also notice a message pinned to the top of the project page.
When you run the deduplication scan, it continues in the background as it can take a long time. As the number of references in a corpus grows, the number of combinations needed to be cross-checked increases dramatically. You are allowed to get on with some other tasks like setting up the team, but during this process, the reference import tools are locked. You will see a progress bar which estimates how long the scan might take, and how many duplicates have been found so far.
After running the dedupe you are taken to the duplicate review pages where you will see articles which have duplicates. Often they are just pairs (one duplicate), but sometimes you will see larger groups where they are more duplicate relations found. SYRAS will set up the proposed actions to KEEP or DELETE each reference. You may modify these actions after reviewing each pair.
Visual Difference Tool – you may use this tool to review the article title, abstract and authors.
Deduplication is a complex topic and could always be improved, so any feedback or suggestions are welcome. Examples of datasets which had duplicates which SYRAS did not handle properly would be useful to improve the algorithms.