Datasets

Parallel Corpora

ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

This repository contains the developed parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned. We demonstrate the capabilities of our corpus by training Neural Machine Translation (NMT) models for the main 9 language pairs, with a total of 18 models. Our parallel corpus is freely available in TSV format.

DOI: https://doi.org/10.6084/m9.figshare.12627632

If you use this corpus, please cite the following work:

@inproceedings{soares-etal-2020-parapat,
    title = "{P}ara{P}at: The Multi-Million Sentences Parallel Corpus of Patents Abstracts",
    author = "Soares, Felipe  and
      Stevenson, Mark  and
      Bartolome, Diego  and
      Zaretskaya, Anna",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.465",
    pages = "3769--3774",
    language = "English",
    ISBN = "979-10-95546-34-4",
}


Parallel corpus of full-text articles in Portuguese, English and Spanish from SciELO

A parallel corpus of full-text scientific articles collected from Scielo database in the following languages: English, Portuguese and Spanish. The corpus is sentence aligned for all language pairs, as well as trilingual aligned for a small subset of sentences. Alignment was carried out using the Hunalign algorithm.

DOI: https://doi.org/10.6084/m9.figshare.5382757.v2

If you use this corpus, please cite the following work:

@inproceedings{soares2018large,
  title={A Large Parallel Corpus of Full-Text Scientific Articles},
  author={Soares, Felipe and Moreira, Viviane and Becker, Karin},
  booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)},
  year={2018}
}

Parallel corpus of theses and dissertation abstracts in Portuguese and English from CAPES

A parallel corpus of theses and dissertations abstracts in English and Portuguese were collected from the CAPES website (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) - Brazil. The corpus is sentence aligned for all language pairs. Approximately 240,000 documents were collected and aligned using the Hunalign algorithm.

DOI: https://doi.org/10.6084/m9.figshare.5995519.v2

If you use this corpus, please cite the following work:

@inproceedings{soares2018parallel,
  title={A Parallel Corpus of Theses and Dissertations Abstracts},
  author={Soares, Felipe and Yamashita, Gabrielli Harumi and Anzanello, Michel Jose},
  booktitle={International Conference on Computational Processing of the Portuguese Language},
  pages={345--352},
  year={2018},
  organization={Springer}
}

Embeddings

Medical Word Embeddings for Spanish: Development and Evaluation

In this paper, we show the steps we employed to adapt medical semantic evaluation/ datasets for the first time to Spanish, of particular relevance due to the considerable volume of EHRs in this language, as well as the creation of in-domain medical word embeddings for the Spanish using the state-of-the-art FastText model. We performed intrinsic evaluation with our adapted datasets, as well as extrinsic evaluation with a named entity recognition systems using a baseline embedding of general-domain. Both experiments proved that our embeddings are suitable for use in medical NLP in the Spanish language, and are more accurate than general-domain ones.

DOI: https://doi.org/10.6084/m9.figshare.7807928

If you use this corpus, please cite the following work:

@inproceedings{soares-etal-2019-medical,
    title = "Medical Word Embeddings for {S}panish: Development and Evaluation",
    author = "Soares, Felipe  and
      Villegas, Marta  and
      Gonzalez-Agirre, Aitor  and
      Krallinger, Martin  and
      Armengol-Estap{\'e}, Jordi",
    booktitle = "Proceedings of the 2nd Clinical Natural Language Processing Workshop",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-1916",
    doi = "10.18653/v1/W19-1916",
    pages = "124--133",
}

Translation Models

Neural Machine Translation for the Biomedical Domain - WMT19

This package contains the files needed to use the Neural Machine Translation (NMT) system for the Biomedical Domain.

The available language directions for translation are:

  • English to Spanish
  • Spanish to English
  • English to Portuguese
  • Portuguese to English
  • Spanish to Portuguese
  • Portuguese to Spanish

Github: https://github.com/PlanTL-SANIDAD/Medical-Translator-WMT19

DOI: https://doi.org/10.5281/zenodo.3346802