This is a limited proof of concept to search for research data, not a production system.

Search the MIT Libraries

Title: A dataset used to determine a semantic similarity metric based on UMLS for PMC-OA

Type Dataset Garcia Castro, Leyla Jael, Berlanga, Rafael, Garcia, Alexander (2014): A dataset used to determine a semantic similarity metric based on UMLS for PMC-OA. Zenodo. Dataset. https://zenodo.org/record/13323

Authors: Garcia Castro, Leyla Jael (Universitat Jaume I) ; Berlanga, Rafael (Universitat Jaume I) ; Garcia, Alexander (Linking Data LLC) ;

Links

Summary

We have performed a series of in-silico experiments in order to determine a semantic similarity metric based on UMLS annotations for PubMed Central Open Access. Here we have stored the data used for and obtained from such experiments. We have worked with relevant and partially relevant articles from the TREC-2005 Genomics Track Collection, from now referred as the initial collection, including a total of 4240 unique PubMed articles. From those 4240 articles, only 62 had publicly available; those 62 articles correspond to the full-text collection.

Our data comprises flat files using tabs as separators and one Excel sheet. Tab separated values always include a first row with headings:

Stems extracted from title and abstract for articles in the initial collection. Each row contains a stem with its inverse-document-frequency (IDF) within the initial collection. Stems were calculated following the Porter algorithm (available at http://tartarus.org/martin/PorterStemmer/java.txt) stems.TA.tsv Article profiles, i.e., terms (either word stems or UMLS concepts) found in the articles with term frequency (TF) and IDF. The first two columns correspond to PubMed Identifier (PMID) and PubMed Central identifier (PMC). PMC identifier was set to 0 whenever full-text was not available. profiles.TA.tsv: Profiles according word stems in title and abstract for the initial collection profiles.PMID.tsv: Profiles according to UMLS concpets in title and abstract for the initial collection profiles.PMC_TA.tsv: Profiles according to UMLS concepts in title and abstract for the full-text collection profiles.PMC.tsv: Profiles according to UMLS concepts in the full-text for the full-text collection Similarity matrixes calculated on the article profiles with PubMed Related Article metric (PMRA), BM25, and Cosine. There are matrixes for terms found in title-and-abstract as well as full-text. In a similarity matrix, a reference article (an interest has been already expressed for it) correspond to a row, while the columns correspond to all the other articles for which the similarity was calculated. Matrixes for our initial collection similarity.PMRA.TA.profiles.TA.tsv: Similarity matrix for profiles.TA.tsv following the algorithm PMRA. This matrix is considered the baseline for further analyses similarity.PMRA.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm PMRA similarity.BM25_1.2_0.75.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm BM25 with k=1.2 and b=0.75 similarity.COSINE.profiles.PMID.tsv: Similarity matrix for profiles.PMID.tsv following the algorithm Cosine Matrixes for our full-text collection similarity.PMRA.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm PMRA similarity.PMRA.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm PMRA similarity.BM25.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm BM25 with k=1.2 and b=0.75 similarity.BM25.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm BM25 with k= 1.2 and b= 0.75 similarity.COSINE.profiles.PMC_TA.tsv: Similarity matrix for profiles.PMC_TA.tsv following the algorithm Cosine similarity.COSINE.profiles.PMC.tsv: Similarity matrix for profiles.PMC.tsv following the algorithm Cosine Correlation matrixes for similarities calculated for title-and-abstract taking as reference the similarity values obtained with PMRA for word stems on title-and-abstract. pearsonCorrelation.PMRA.tsv: Correlation for similarity.PMRA.profiles.PMID.tsv pearsonCorrelationTopic.PMRA.tsv: Correlation for similarity.PMRA.profiles.PMID.tsv discriminated by TREC topics pearsonCorrelation.BM25_1.2_0.75.tsv: Correlation for similarity.BM25_1.2_0.75.profiles.PMID.tsv pearsonCorrelationTopic.BM25_1.2_0.75.tsv: Correlation for similarity.BM25_1.2_0.75.profiles.PMID.tsv discriminated by TREC topics pearsonCorrelation.COSINE.tsv: Correlation for similarity.COSINE.profiles.PMID.tsv pearsonCorrelationTopic.COSINE.tsv: Correlation for similarity.COSINE.profiles.PMID.tsv discriminated by TREC topics Precision and recall summaries for the similarities calculated based on title-and-abstract. StatsAllSummary.xlsx: Precision and recall at a global level, i.e., without considering TREC topics. This file includes information for BM25 with multiples values for constants k and b

Visualization for correlation matrixes as well as scattered plots for full-text based similarity is available at http://ljgarcia.github.io/semsim.benchmark

More information

  • DOI: 10.5281/zenodo.13323

Subjects

  • Semantic similarity, semantic annotation, scientific literature, similarity metrics

Dates

  • Publication date: 2014
  • Issued: December 19, 2014

Rights


Much of the data past this point we don't have good examples of yet. Please share in #rdi slack if you have good examples for anything that appears below. Thanks!

Format

electronic resource

Relateditems

DescriptionItem typeRelationshipUri
IsPartOfhttps://zenodo.org/communities/zenodo