This is a limited proof of concept to search for research data, not a production system.

Search the MIT Libraries

Title: huggingface/transformers: BART, organizations, community notebooks, lightning examples, dropping Python 3.5

Type Software Thomas Wolf, Lysandre Debut, Julien Chaumond, Victor SANH, Patrick von Platen, Aymeric Augustin, Rémi Louf, Funtowicz Morgan, Stefan Schweter, Denis, erenup, Sam Shleifer, Manuel Romero, Matt, Piero Molino, Grégory Châtel, Bram Vanroy, Tim Rault, Gunnlaugur Thor Briem, Anthony MOI, Malte Pietsch (2020): huggingface/transformers: BART, organizations, community notebooks, lightning examples, dropping Python 3.5. Zenodo. Software. https://zenodo.org/record/3726146

Authors: Thomas Wolf (@huggingface) ; Lysandre Debut (Hugging Face) ; Julien Chaumond (Hugging Face) ; Victor SANH (@huggingface) ; Patrick von Platen ; Aymeric Augustin (@canalplus) ; Rémi Louf ; Funtowicz Morgan (HuggingFace) ; Stefan Schweter ; Denis ; erenup ; Sam Shleifer (Huggingface) ; Manuel Romero ; Matt ; Piero Molino ; Grégory Châtel (DisAItek & Intel AI Innovators) ; Bram Vanroy (@UGent) ; Tim Rault (@huggingface) ; Gunnlaugur Thor Briem (Qlik) ; Anthony MOI (Hugging Face) ; Malte Pietsch (deepset) ;

Links

Summary

New Model: BART (added by @sshleifer)

Bart is one of the first Seq2Seq models in the library, and achieves state of the art results on text generation tasks, like abstractive summarization. Three sets of pretrained weights are released:

bart-large: the pretrained base model bart-large-cnn: the base model finetuned on the CNN/Daily Mail Abstractive Summarization Task bart-large-mnli: the base model finetuned on the MNLI classification task.

Related:

paper model pages are at https://huggingface.co/facebook docs blogpost

Big thanks to the original authors, especially Mike Lewis, Yinhan Liu, Naman Goyal who helped answer our questions.

Model sharing CLI: support for organizations

The huggingface API for model upload now supports organisations.

Notebooks (@mfuntowicz)

A few beginner-oriented notebooks were added to the library, aiming at demystifying the two libraries huggingface/transformers and huggingface/tokenizers. Contributors are welcome to contribute links to their notebooks as well.

pytorch-lightning examples (@srush)

Examples leveraging pytorch-lightning were added, led by @srush. The first example that was added is the NER example. The second example is a lightning GLUE example, added by @nateraw.

New model architectures: CamembertForQuestionAnswering, CamembertForQuestionAnswering was added to the library and to the SQuAD script @maximeilluin AlbertForTokenClassification was added to the library and to the NER example @marma Multiple fixes were done on the fast tokenizers to make them entirely compatible with the python tokenizers (@mfuntowicz)

Most of these fixes were done in the patch 2.5.1. Fast tokenizers should now have the exact same API as the python ones, with some additional functionalities.

Docker images (@mfuntowicz)

Docker images for transformers were added.

Generation overhaul (@patrickvonplaten) Special token IDs logic were improved in run_generation and in corresponding tests. Slow tests for generation were added for pre-trained LM models Greedy generation when doing beam search Sampling when doing beam search Generate functionality was added to TF2: with beam search, greedy generation and sampling. Integration tests were added no_repeat_ngram_size kwarg to avoid redundant generations (@sshleifer) Encoding methods now output only model-specific inputs

Models such as DistilBERT and RoBERTa do not make use of token type IDs. These inputs are not returned by the encoding methods anymore, except if explicitly mentioned during the tokenizer initialization.

Pipelines support summarization (@sshleifer) The default architecture is bart-large-cnn, with the generation parameters published in the paper. Models may now re-use the cache every time without prompting S3 (@BramVanroy)

Previously all attempts to load a model from a pre-trained checkpoint would check that the S3 etag corresponds to the one hosted locally. This has been updated so that an argument local_files_only prevents this, which can be useful when a firewall is involved.

Usage examples for common tasks (@LysandreJik)

In a continuing effort to onboard new users (new to the lib or new to NLP in general), some usage examples were added to the documentation. These usage examples showcase how to do inference on several tasks:

NER Sequence classification Question Answering Causal Language Modeling Masked Language Modeling Test suite on GPU (@julien-c)

CI now runs on GPU. PyTorch and TensorFlow.

Padding token ID needs to be set in order to pad (@patrickvonplaten)

Older tokenizers could pad even when no padding token was defined, which has been updated in this version to match the expected behavior, which is the FastTokenizers' behavior: add a pad token or raise an error when trying to batch without one.

Python >= 3.6

We're now dropping Python 3.5 support.

Community additions/bug-fixes/improvements Added a warning when using add_special_tokens with the fast tokenizer methods of encoding (@LysandreJik) encode_plus was modified and tested to have the exact same behaviour as encode, but batches input Cleanup DistilBERT code (@guillaume-be) Only use F.gelu for torch >= 1.4.0 (@sshleifer) Added a get_vocab method to tokenizers, which can be used to retrieve all the vocabulary from the tokenizers. (@joeddav) Correct behaviour of special_tokens_mask when add_special_tokens=False (@LysandreJik) Removed untested Model2LSTM and Model2Model which was not working kwargs were passed to both model and configuration in AutoModels, which made the model crash (@LysandreJik) Correct transfo-xl tokenization regarding punctions (@patrickvonplaten) Better docstrings for XLNet (@patrickvonplaten) Better operations for TPU support (@srush) XLM-R tokenizer is now tested and bug-free (@LysandreJik) XLM-R model and tokenizer now have integration tests (@patrickvonplaten) Better documentation for tokenizers and pipelines (@LysandreJik) All tests (slow and non-slow) now pass (@julien-c, @LysandreJik, @patrickvonplaten, @sshleifer, @thomwolf) Correct attention mask with GPT-2 when using past (@patrickvonplaten) fix n_gpu count when no_cuda flag is activated in all examples (@VictorSanh) Test TF GPT2 for correct behaviour regarding the past and attn mask variable (@patrickvonplaten) Fixed bug where some missing keys would not be identified (@LysandreJik) Correct num_labels initialization (@LysandreJik) Model special tokens were added to the pretrained configurations (@patrickvonplaten) QA models for XLNet, XLM and FlauBERT are now set to their "simple" architectures when using the pipeline. GPT-2 XL was added to TensorFlow (@patrickvonplaten) NER PL example updated (@shubhamagarwal92) Improved Error message when loading config/model with .from_pretrained() (@patrickvonplaten, @julien-c) Cleaner special token initialization in modeling_xxx.py (@patrickvonplaten) Fixed the learning rate scheduler placement in the run_ner.py script @erip Use AutoModels in examples (@julien-c, @lifefeel)

More information

  • DOI: 10.5281/zenodo.3726146

Dates

  • Publication date: 2020
  • Issued: March 24, 2020

Rights

  • info:eu-repo/semantics/openAccess Open Access

Much of the data past this point we don't have good examples of yet. Please share in #rdi slack if you have good examples for anything that appears below. Thanks!

Format

electronic resource

Relateditems

DescriptionItem typeRelationshipUri
IsSupplementTohttps://github.com/huggingface/transformers/tree/v2.6.0
IsVersionOfhttps://doi.org/10.5281/zenodo.3385997
IsPartOfhttps://zenodo.org/communities/zenodo