This is a limited proof of concept to search for research data, not a production system.

Search the MIT Libraries

Title: huggingface/pytorch-transformers: DistilBERT, GPT-2 Large, XLM multilingual models, bug fixes

Type Software Thomas Wolf, Lysandre Debut, Victor SANH, Denis, Matt, Grégory Châtel, Julien Chaumond, Tim Rault, Catalin Voss, Fei Wang, Malte Pietsch, Davide Fiocco, dhanajitb, Stefan Schweter, Ananya Harsh Jha, yzy5630, Yongbo Wang, Shijie Wu, Guillem García Subies, Weixin Wang, Zeyao Du, Chi-Liang, Liu, Nikolay Korolev, Joel Grus, Jade Abbott, David Pollack, matej-svejda, Clement, Ailing, Abhishek Rao (2019): huggingface/pytorch-transformers: DistilBERT, GPT-2 Large, XLM multilingual models, bug fixes. Zenodo. Software. https://zenodo.org/record/3385998

Authors: Thomas Wolf (@huggingface) ; Lysandre Debut (Hugging Face) ; Victor SANH (@huggingface) ; Denis ; Matt ; Grégory Châtel (DisAItek & Intel AI Innovators) ; Julien Chaumond (Hugging Face) ; Tim Rault (@huggingface) ; Catalin Voss (Stanford University) ; Fei Wang (@ShannonAI) ; Malte Pietsch (deepset) ; Davide Fiocco ; dhanajitb ; Stefan Schweter ; Ananya Harsh Jha ; yzy5630 ; Yongbo Wang (Red Hat) ; Shijie Wu ; Guillem García Subies ; Weixin Wang ; Zeyao Du ; Chi-Liang, Liu (@ntu-spml-lab @Yoctol) ; Nikolay Korolev (@JetBrains) ; Joel Grus (@allenai) ; Jade Abbott (@RetroRabbit) ; David Pollack (i2x) ; matej-svejda ; Clement (@huggingface) ; Ailing ; Abhishek Rao (@microsoft) ;

Links

Summary

New model architecture: DistilBERT

Adding Huggingface's new transformer architecture, DistilBERT described in Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.

This new model architecture comes with two pretrained checkpoints:

distilbert-base-uncased: the base DistilBert model distilbert-base-uncased-distilled-squad: DistilBert model fine-tuned with distillation on SQuAD. An awaited new pretrained checkpoint: GPT-2 large (774M parameters)

The third OpenAI GPT-2 checkpoint (GPT-2 large) is available in the library under the shortcut name gpt2-large: 774M parameters, 36 layers, and 20 heads.

New XLM multilingual pretrained checkpoints in 17 and 100 languages

We have added two new XLM models in 17 and 100 languages which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.

New dependency: sacremoses

Support for XLM is improved by carefully reproducing the original tokenization workflow (work by @shijie-wu in #1092). We now rely on sacremoses, a python port of Moses tokenizer, truecaser and normalizer by @alvations, for XLM word tokenization.

In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:

pythainlp: Thai tokenizer kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100 jieba: Chinese tokenizer *

* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.

Bug fixes and improvements to the library modules Bertology script has seen major improvements (@tuvuumass ) Iterative tokenization now faster and accept arbitrary numbers of added tokens (@samvelyan) Added RoBERTa to AutoModels and AutoTokenizers (@LysandreJik ) Added GPT-2 Large 774M model (@thomwolf ) Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (@LysandreJik @thomwolf ) Multi-GPU training has been patched (@FeiWang96 ) Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (@Morizeyao, @adai183 ) Updated the in-depth BERT fine-tuning scripts to pytorch-transformers (@Morizeyao ) Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (@LysandreJik @thomwolf) Add proxies and force_download options to from_pretrained() method to be able to use proxies and update cached models/tokenizers (@thomwolf) Add shortcut to each special tokens with _id properties (e.g. tokenizer.cls_token_id for the id in the vocabulary of tokenizer.cls_token) (@thomwolf) Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by fairseq authors) (@thomwolf) Fix and clean up byte-level BPE tests (@thomwolf) Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (@LysandreJik ) Fix a warning raised when the decode method is called for a model with no sep_token like GPT-2 (@LysandreJik ) Updated the tokenizers saving method (@boy2000-007man) SpaCy tokenizers have been updated in the tokenizers (@GuillemGSubies ) Stable EnvironmentErrors have been added to utility files (@abhishekraok ) Fixed distributed barrier hang (@VictorSanh ) Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (@LysandreJik ) Change layer norm code to PyTorch's native layer norm (@dhpollack) Improve tokenization of XLM for multilingual inputs (@shijie-wu) Add language input and access to language to id conversion in XLM tokenizer (@thomwolf) Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (@thomwolf) Added new AutoModels: AutoModelWithLMHead, AutoModelForSequenceClassification, AutoModelForQuestionAnswering (@LysandreJik) Torch.hub is now based on AutoModels (@LysandreJik @thomwolf) Fix Transformer-XL attention mask dtype to be bool (@CrafterKolyan) Adding DistilBert model architecture and checpoints (@VictorSanh @LysandreJik @thomwolf) Fixes to DistilBert configuration and training script (@stefan-it) Fix XLNet attention mask for fp16 (@ziliwang) Documentation auto-deploy (@LysandreJik) Fix to add a tuple of tokens (@epwalsh) Update fp16 apex implmentation in scripts (@anhnt170489) Fix XLNet bias resizing when adding/removing tokens (@LysandreJik) Fix tokenizer reloading in example scripts (@rabeehk) Fix byte-level decoding error when using added tokens (@thomwolf @LysandreJik) Fix epsilon value in RoBERTa pretrained checkpoints (@julien-c)

More information

  • DOI: 10.5281/zenodo.3385998

Dates

  • Publication date: 2019
  • Issued: September 04, 2019

Rights

  • info:eu-repo/semantics/openAccess Open Access

Much of the data past this point we don't have good examples of yet. Please share in #rdi slack if you have good examples for anything that appears below. Thanks!

Format

electronic resource

Relateditems

DescriptionItem typeRelationshipUri
IsSupplementTohttps://github.com/huggingface/pytorch-transformers/tree/1.2.0
IsVersionOfhttps://doi.org/10.5281/zenodo.3385997
IsPartOfhttps://zenodo.org/communities/zenodo