A dataset used to determi...

Title: A dataset used to determine a semantic similarity metric based on UMLS for PMC-OA

Type Dataset Garcia Castro, Leyla Jael, Berlanga, Rafael, Garcia, Alexander (2014): A dataset used to determine a semantic similarity metric based on UMLS for PMC-OA. Zenodo. Dataset. https://zenodo.org/record/13323

Authors: Garcia Castro, Leyla Jael (Universitat Jaume I) ; Berlanga, Rafael (Universitat Jaume I) ; Garcia, Alexander (Linking Data LLC) ;

Summary

We have performed a series of in-silico experiments in order to determine a semantic similarity metric based on UMLS annotations for PubMed Central Open Access. Here we have stored the data used for and obtained from such experiments. We have worked with relevant and partially relevant articles from the TREC-2005 Genomics Track Collection, from now referred as the initial collection, including a total of 4240 unique PubMed articles. From those 4240 articles, only 62 had publicly available; those 62 articles correspond to the full-text collection.

Our data comprises flat files using tabs as separators and one Excel sheet. Tab separated values always include a first row with headings:

Stems extracted from title and abstract for articles in the initial collection. Each row contains a stem with its inverse-document-frequency (IDF) within the initial collection. Stems were calculated following the Porter algorithm (available at http://tartarus.org/martin/PorterStemmer/java.txt) stems.TA.tsv Article profiles, i.e., terms (either word stems or UMLS concepts) found in the articles with term frequency (TF) and IDF. The first two columns correspond to PubMed Identifier (PMID) and PubMed Central identifier (PMC). PMC identifier was set to 0 whenever full-text was not available. profiles.TA.tsv: Profiles according word stems in title and abstract for the initial collection profiles.PMID.tsv: Profiles according to UMLS concpets in title and abstract for the initial collection profiles.PMC_TA.tsv: Profiles according to UMLS concepts in title and abstract for the full-text collection profiles.PMC.tsv: Profiles according to UMLS concepts in the full-text for the full-text collection Similarity matrixes calculated on the article profiles with PubMed Related Article metric (PMRA), BM25, and Cosine. There are matrixes for terms found in title-and-abstract as well as full-text. In a similarity matrix, a reference article (an interest has been already expressed for it) correspond to a row, while the columns correspond to all the other articles for which the similarity was calculated. Matrixes for our initial collection similarity.PMRA.TA.profiles.TA.tsv: Similarity matrix for profiles.TA.tsv following the algorithm PMRA. This matrix is considered the baseline for further analyses similarity.PMRA.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm PMRA similarity.BM25_1.2_0.75.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm BM25 with k=1.2 and b=0.75 similarity.COSINE.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm Cosine Matrixes for our full-text collection similarity.PMRA.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm PMRA similarity.PMRA.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm PMRA similarity.BM25.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm BM25 with k=1.2 and b=0.75 similarity.BM25.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm BM25 with k= 1.2 and b= 0.75 similarity.COSINE.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm Cosine similarity.COSINE.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm Cosine Correlation matrixes for similarities calculated for title-and-abstract taking as reference the similarity values obtained with PMRA for word stems on title-and-abstract. pearsonCorrelation.PMRA.tsv: Correlation for similarity.PMRA.profiles.PMID.tsv pearsonCorrelationTopic.PMRA.tsv: Correlation for similarity.PMRA.profiles.PMID.tsv discriminated by TREC topics pearsonCorrelation.BM25_1.2_0.75.tsv: Correlation for similarity.BM25_1.2_0.75.profiles.PMID.tsv pearsonCorrelationTopic.BM25_1.2_0.75.tsv: Correlation for similarity.BM25_1.2_0.75.profiles.PMID.tsv discriminated by TREC topics pearsonCorrelation.COSINE.tsv: Correlation for similarity.COSINE.profiles.PMID.tsv pearsonCorrelationTopic.COSINE.tsv: Correlation for similarity.COSINE.profiles.PMID.tsv discriminated by TREC topics Precision and recall summaries for the similarities calculated based on title-and-abstract. StatsAllSummary.xlsx: Precision and recall at a global level, i.e., without considering TREC topics. This file includes information for BM25 with multiples values for constants k and b

Visualization for correlation matrixes as well as scattered plots for full-text based similarity is available at http://ljgarcia.github.io/semsim.benchmark

More information

DOI: 10.5281/zenodo.13323

Subjects

Semantic similarity, semantic annotation, scientific literature, similarity metrics

Dates

Publication date: 2014
Issued: December 19, 2014

Rights

https://opensource.org/licenses/Apache-2.0 Apache License 2.0
info:eu-repo/semantics/openAccess Open Access

Format

electronic resource

Relateditems

Description	Item type	Relationship	Uri
		IsPartOf	https://zenodo.org/communities/zenodo

This is a limited proof of concept to search for research data, not a production system.

MIT Libraries home

Search the MIT Libraries

Title: A dataset used to determine a semantic similarity metric based on UMLS for PMC-OA

Links

Summary

More information

Subjects

Dates

Rights

Format

Relateditems