This repository contains the developed parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned. We demonstrate the capabilities of our corpus by training Neural Machine Translation (NMT) models for the main 9 language pairs, with a total of 18 models. Our parallel corpus is freely available in TSV format.
DOI: https://doi.org/10.6084/m9.figshare.12627632
If you use this corpus, please cite the following work:
@inproceedings{soares-etal-2020-parapat,
title = "{P}ara{P}at: The Multi-Million Sentences Parallel Corpus of Patents Abstracts",
author = "Soares, Felipe and
Stevenson, Mark and
Bartolome, Diego and
Zaretskaya, Anna",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.465",
pages = "3769--3774",
language = "English",
ISBN = "979-10-95546-34-4",
}
A parallel corpus of full-text scientific articles collected from Scielo database in the following languages: English, Portuguese and Spanish. The corpus is sentence aligned for all language pairs, as well as trilingual aligned for a small subset of sentences. Alignment was carried out using the Hunalign algorithm.
DOI: https://doi.org/10.6084/m9.figshare.5382757.v2
If you use this corpus, please cite the following work:
@inproceedings{soares2018large,
title={A Large Parallel Corpus of Full-Text Scientific Articles},
author={Soares, Felipe and Moreira, Viviane and Becker, Karin},
booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)},
year={2018}
}
A parallel corpus of theses and dissertations abstracts in English and Portuguese were collected from the CAPES website (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) - Brazil. The corpus is sentence aligned for all language pairs. Approximately 240,000 documents were collected and aligned using the Hunalign algorithm.
DOI: https://doi.org/10.6084/m9.figshare.5995519.v2
If you use this corpus, please cite the following work:
@inproceedings{soares2018parallel,
title={A Parallel Corpus of Theses and Dissertations Abstracts},
author={Soares, Felipe and Yamashita, Gabrielli Harumi and Anzanello, Michel Jose},
booktitle={International Conference on Computational Processing of the Portuguese Language},
pages={345--352},
year={2018},
organization={Springer}
}
In this paper, we show the steps we employed to adapt medical semantic evaluation/ datasets for the first time to Spanish, of particular relevance due to the considerable volume of EHRs in this language, as well as the creation of in-domain medical word embeddings for the Spanish using the state-of-the-art FastText model. We performed intrinsic evaluation with our adapted datasets, as well as extrinsic evaluation with a named entity recognition systems using a baseline embedding of general-domain. Both experiments proved that our embeddings are suitable for use in medical NLP in the Spanish language, and are more accurate than general-domain ones.
DOI: https://doi.org/10.6084/m9.figshare.7807928
If you use this corpus, please cite the following work:
@inproceedings{soares-etal-2019-medical,
title = "Medical Word Embeddings for {S}panish: Development and Evaluation",
author = "Soares, Felipe and
Villegas, Marta and
Gonzalez-Agirre, Aitor and
Krallinger, Martin and
Armengol-Estap{\'e}, Jordi",
booktitle = "Proceedings of the 2nd Clinical Natural Language Processing Workshop",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota, USA",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W19-1916",
doi = "10.18653/v1/W19-1916",
pages = "124--133",
}
This package contains the files needed to use the Neural Machine Translation (NMT) system for the Biomedical Domain.
The available language directions for translation are:
Github: https://github.com/PlanTL-SANIDAD/Medical-Translator-WMT19