Title: huggingface/transformers: FlauBERT, MMBT, Dutch model, improved documentation, training from scratch, clean Python code
Type Software Thomas Wolf, Lysandre Debut, Victor SANH, Julien Chaumond, Aymeric Augustin, Rémi Louf, Funtowicz Morgan, Stefan Schweter, Denis, erenup, Matt, Piero Molino, Grégory Châtel, Patrick von Platen, Tim Rault, MOI Anthony, Catalin Voss, Bilal Khan, Bram Vanroy, Fei Wang, Julien Plu, Malte Pietsch, Louis Martin, Davide Fiocco, dhanajitb, Jinoo, Ananya Harsh Jha, Juha Kiili, Guillem García Subies, Clement (2020): huggingface/transformers: FlauBERT, MMBT, Dutch model, improved documentation, training from scratch, clean Python code. Zenodo. Software. https://zenodo.org/record/3633003
Links
- Item record in Zenodo
- Digital object URL
Summary
FlauBERT, MMBT MMBT was added to the list of available models, as the first multi-modal model to make it in the library. It can accept a transformer model as well as a computer vision model, in order to classify image and text. The MMBT Model is from Supervised Multimodal Bitransformers for Classifying Images and Text by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine (https://github.com/facebookresearch/mmbt/) Added by @suvrat96. A new Dutch BERT model was added under the wietsedv/bert-base-dutch-cased identifier. Added by @wietsedv. Model page A new French model was added, FlauBERT, based on XLM. The FlauBERT model is from FlauBERT: Unsupervised Language Model Pre-training for French (https://github.com/getalp/Flaubert). Four checkpoints are added: small size, base uncased, base cased and large. Model page New TF architectures (@jplu) TensorFlow XLM-RoBERTa was added (@jplu ) TensorFlow CamemBERT was added (@jplu ) Python best practices (@aaugustin) Greatly improved the quality of the source code by leveraging black, isort and flake8. A test was added, check_code_quality, which checks that the contributions respect the contribution guidelines related to those tools. Similarly, optional imports are better handled and raise more precise errors. Cleaned up several requirements files, updated the contribution guidelines and rely on setup.py for the necessary dev dependencies. you can clean up your code for a PR with (more details in CONTRIBUTING.md):make style make quality Documentation (@LysandreJik)
The documentation was uniformized and some better guidelines have been defined. This work is part of an ongoing effort of making transformers accessible to a larger audience. A glossary has been added, adding definitions for most frequently used inputs.
Furthermore, some tips are given concerning each model in their documentation pages.
The code samples are now tested on a weekly basis alongside other slow tests.
Improved repository structure (@aaugustin)The source code was moved from ./transformers to ./src/transformers. Since it changes the location of the source code, contributors must update their local development environment by uninstalling and re-installing the library.
Python 2 is not supported anymore (@aaugustin )Version 2.3.0 was the last version to support Python 2. As we begin the year 2020, official Python 2 support has been dropped.
Parallel testing (@aaugustin)Tests can now be run in parallel
Sampling sequence generator (@rlouf, @thomwolf )An abstract method was added to PreTrainedModel, which is implemented in all models trained with CLM. This abstract method is generate, which offers an API for text generation:
with/without a prompt with/without beam search with/without greedy decoding/sampling with any (and combination) of top-k/top-p/penalized repetitions Resuming training when interrupted (@bkkaggle )Previously, when stopping a training the only saved values would be the model weights/configuration. Now the different scripts save several other values: the global step, current epoch, and the steps trained in the current epoch. When resuming a training, all those values will be leveraged to correctly resume the training.
This applies to the following scripts: run_glue, run_squad, run_ner, run_xnli.
CLI (@julien-c , @mfuntowicz ) Model upload The CLI now has better documentation. Files can now be removed. Pipelines Expose the number of underlying FastAPI workers Async forward methods Fixed the environment variables so that they don't fight each other anymore (USE_TF, USE_TORCH) Training from scratch (@julien-c )The run_lm_finetuning.py script now handles training from scratch.
Changes in the configuration (@julien-c )The configuration files now contain the architecture they're referring to. There is no need to have the architecture in the file name as it was necessary before. This should ease the naming of community models.
New Auto models (@thomwolf )A new type of AutoModel was added: AutoModelForPreTraining. This model returns the base model that was used during the pre-training. For most models it is the base model alongside a language modeling head, whereas for others it is a different model, e.g. BertForPreTraining for BERT.
HANS dataset (@ns-moosavi)The HANS dataset was added to the examples. It allows for testing a model with adversarial evaluation of natural language.
[BREAKING CHANGES] Ignored indices in PyTorch loss computing (@LysandreJik)When using PyTorch, certain values can be ignored when computing the loss. In order for the loss function to understand which indices must be ignored, those have to be set to a certain value. Most of our models required those indices to be set to -1. We decided to set this value to -100 instead as it is PyTorch's default value. This removes the discrepancy between user-implemented losses and the losses integrated in the models.
Further help from @r0mainK.
Community additions/bug-fixes/improvements Can now save and load PreTrainedEncoderDecoder objects (@TheEdoardo93) RoBERTa now bears more similarity to the FairSeq implementation (@DomHudson, @thomwolf) Examples now better reflect the defaults of the encoding methods (@enzoampil) TFXLNet now has a correct input mask (@thomwolf) run_squad was fixed to allow better training for XLNet (@importpandas ) tokenization performance improvement (3-8x) (@mandubian) RoBERTa was added to the run_squad script (@erenup) Fixed the special and added tokens tokenization (@vitaliyradchenko) Fixed an issue with language generation for XLM when having a batch size superior to 1 (@patrickvonplaten) Fixed an issue with the generate method which did not correctly handle the repetition penalty (@patrickvonplaten) Completed the documentation for repeating_words_penalty_for_language_generation (@patrickvonplaten) run_generation now leverages cached past input for models that have access to it (@patrickvonplaten) Finally manage to patch a rarely occurring bug with DistilBERT, eventually named DistilHeisenBug or HeisenDistilBug (@LysandreJik, with the help of @julien-c and @thomwolf). Fixed an import error in run_tf_ner (@karajan1001). Feature conversion for GLUE now has improved logging messages (@simonepri) Patched an issue with GPUs and run_generation (@alberduris) Added support for ALBERT and XLMRoBERTa to run_glue Fixed an issue with the DistilBERT tokenizer not loading correct configurations (@LysandreJik) Updated the SQuAD for distillation script to leverage the new SQuAD API (@LysandreJik) Fixed an issue with T5 related to its rp_bucket (@mschrimpf) PPLM now supports repetition penalties (@IWillPull) Modified the QA pipeline to consider all features for each example (@Perseus14) Patched an issue with a file lock (@dimagalat @aaugustin) The bias should be resized with the weights when resizing a vocabulary projection layer with a new vocabulary size (@LysandreJik) Fixed misleading token type IDs for RoBERTa. It doesn't leverage token type IDs and this has been clarified in the documentation (@LysandreJik ) Same for XLM-R (@maksym-del). Fixed the prepare_for_model when tensorizing and returning token type IDs (@LysandreJik). Fixed the XLNet model which wouldn't work with torch 1.4 (@julien-c) Fetch all possible files remotely (@julien-c ) BERT's BasicTokenizer respects never_split parameters (@DeNeutoy) Add lower bound to tqdm dependency @brendan-ai2 Fixed glue processors failing on tensorflow datasets (@neonbjb) XLMRobertaTokenizer can now be serialized (@brandenchan) A classifier dropout was added to ALBERT (@peteriz) The ALBERT configuration for v2 models were fixed to be identical to those output by Google (@LysandreJik )More information
- DOI: 10.5281/zenodo.3633003
Dates
- Publication date: 2020
- Issued: January 31, 2020
Rights
- info:eu-repo/semantics/openAccess Open Access
Format
electronic resource
Relateditems
Description | Item type | Relationship | Uri |
---|---|---|---|
IsSupplementTo | https://github.com/huggingface/transformers/tree/v2.4.0 | ||
IsVersionOf | https://doi.org/10.5281/zenodo.3385997 | ||
IsPartOf | https://zenodo.org/communities/zenodo |