huggingface/transformers:...

Title: huggingface/transformers: Pegasus, DPR, self-documented outputs, new pipelines and MT support

Type Software Thomas Wolf, Lysandre Debut, Julien Chaumond, Patrick von Platen, Sam Shleifer, Victor SANH, Sylvain Gugger, Manuel Romero, Funtowicz Morgan, Aymeric Augustin, Rémi Louf, Stefan Schweter, Stas Bekman, Denis, erenup, Suraj Patil, Matt, Grégory Châtel, Piero Molino, Anthony MOI, Bram Vanroy, Clement, Julien Plu, Joe Davison, Gunnlaugur Thor Briem, Teven, Kevin Canwen Xu, Tim Rault, Catalin Voss, Lorenzo Ampil (2020): huggingface/transformers: Pegasus, DPR, self-documented outputs, new pipelines and MT support. Zenodo. Software. https://zenodo.org/record/4010585

Authors: Thomas Wolf (@huggingface) ; Lysandre Debut (Hugging Face) ; Julien Chaumond (Hugging Face) ; Patrick von Platen ; Sam Shleifer (Huggingface) ; Victor SANH (@huggingface) ; Sylvain Gugger ; Manuel Romero ; Funtowicz Morgan (HuggingFace) ; Aymeric Augustin (@qonto) ; Rémi Louf ; Stefan Schweter ; Stas Bekman (Stasosphere Online Inc.) ; Denis ; erenup ; Suraj Patil (Wynum) ; Matt ; Grégory Châtel (DisAItek & Intel AI Innovators) ; Piero Molino ; Anthony MOI (Hugging Face) ; Bram Vanroy (@UGent) ; Clement (@huggingface) ; Julien Plu (Leboncoin Lab) ; Joe Davison (Hugging Face) ; Gunnlaugur Thor Briem (Qlik) ; Teven (HuggingFace) ; Kevin Canwen Xu ; Tim Rault (@huggingface) ; Catalin Voss (Stanford University) ; Lorenzo Ampil (@thinkingmachines) ;

Summary

Pegasus, mBART, DPR, self-documented outputs and new pipelines Pegasus

The Pegasus model from PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu, was added to the library in PyTorch.

Model implemented as a collaboration between Jingqing Zhang and @sshleifer in #6340

PegasusForConditionalGeneration (torch version) #6340 add pegasus finetuning script #6811 script. (warning very slow) DPR

The DPR model from Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih was added to the library in PyTorch.

Add DPR model #5279 (@lhoestq) Fix tests imports dpr #5576 (@lhoestq) DeeBERT

The DeeBERT model from DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference by Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin has been added to the examples/ folder alongside its training script, in PyTorch.

Add DeeBERT (entropy-based early exiting for *BERT) #5477 (@Ji-Xin) Self-documented outputs

As well as returning tuples, PyTorch and TensorFlow models now return a subclass of ModelOutput that is appropriate. A ModelOutput is a dataclass containing all model returns. This allows for easier inspection, and for self-documenting model outputs.

Change model outputs types to self-document outputs #5438 (@sgugger) Tf model outputs #6247 (@sgugger)

Models return tuples by default, and return self-documented outputs if the return_dict configuration flag is set to True or if the return_dict=True keyword argument is passed to the forward/call method.

Encoder-Decoder framework

The encoder-decoder framework has been enhanced to allow more encoder decoder model combinations, e.g.: Bert2Bert, Bert2GPT2, Roberta2Roberta, Longformer2Roberta, ....

[EncoderDecoder] Add encoder-decoder for roberta/ vanilla longformer #6411 (@patrickvonplaten) [EncoderDecoder] Add Cross Attention for GPT2 #6415 (@patrickvonplaten) [EncoderDecoder] Add functionality to tie encoder decoder weights #6538 (@patrickvonplaten) Multiple combinations of EncoderDecoder models have been fine-tuned and evaluated on CNN/Daily-Mail summarization: https://huggingface.co/models?search=cnn_dailymail-fp16 (@patrickvonplaten) TensorFlow as a first-class citizen

As we continue working towards having TensorFlow be a first-class citizen, we continually improve on our TensorFlow API and models.

[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile #5395 (@patrickvonplaten) [Benchmark] Add benchmarks for TF Training #5594 (@patrickvonplaten) ## Machine Translation MarianMTModel en-zh and 357 other checkpoints for machine translation were added from the Helsinki-NLP group's Tatoeba Project (@sshleifer + @jorgtied). There are now > 1300 supported pairs for machine translation. Marian converter updates #6342 (@sshleifer) Marian distill scripts + integration test #6799 (@sshleifer) mBART

The mBART model from Multilingual Denoising Pre-training for Neural Machine Translation was can now be accessed through MBartForConditionalGeneration.

Add mbart-large-cc25, support translation finetuning #5129 (@sshleifer) [mbart] prepare_translation_batch passes **kwargs to allow DeprecationWarning #5581 (@sshleifer) MBartForConditionalGeneration #6441 (@patil-suraj) [fix] mbart_en_ro_generate test now identical to fairseq #5731 (@sshleifer) [Doc] explaining romanian postprocessing for MBART BLEU hacking #5943 (@sshleifer) [test] partial coverage for train_mbart_enro_cc25.sh #5976 (@sshleifer) MbartTokenizer: do not hardcode vocab size #5998 (@sshleifer) MBART: support summarization tasks where max_src_len > max_tgt_len #6003 (@sshleifer) Fix #6096: MBartTokenizer's mask token #6098 (@sshleifer) [s2s] Document better mbart finetuning command #6229 (@sshleifer) mBART Conversion script #6230 (@sshleifer) [s2s] add BartTranslationDistiller for distilling mBART #6363 (@sshleifer) [Doc] add more MBart and other doc #6490 (@patil-suraj) examples/seq2seq examples/seq2seq/finetune.py supports --task translation All sequence to sequence tokenizers (T5, Bart, Marian, Pegasus) expose a prepare_seq2seq_batch method that makes batches for sequence to sequence trianing.

PRs:

Seq2SeqDataset uses linecache to save memory #5792 (@Pradhy729) [examples/seq2seq]: add --label_smoothing option #5919 (@sshleifer) seq2seq/run_eval.py can take decoder_start_token_id #5949 (@sshleifer) [examples (seq2seq)] fix preparing decoder_input_ids for T5 #5994 (@patil-suraj) [s2s] add support for overriding config params #6149 (@stas00) s2s: fix LR logging, remove some dead code. #6205 (@sshleifer) [s2s] tiny QOL improvement: run_eval prints scores #6341 (@sshleifer) [s2s] fix label_smoothed_nll_loss #6344 (@patil-suraj) [s2s] fix --gpus clarg collision #6358 (@sshleifer) [s2s] Script to save wmt data to disk #6403 (@sshleifer) rename prepare_translation_batch -> prepare_seq2seq_batch #6103 (@sshleifer) Mult rouge by 100: standard units #6359 (@sshleifer) allow spaces in bash args with "$@" #6521 (@sshleifer) [seq2seq] MAX_LEN env var for MT commands #5837 (@sshleifer) [seq2seq] distillation.py accepts trainer arguments #5865 (@sshleifer) [s2s]Use prepare_translation_batch for Marian finetuning #6293 (@sshleifer) [BartTokenizer] add prepare s2s batch #6212 (@patil-suraj) [T5Tokenizer] add prepare_seq2seq_batch method #6122 (@patil-suraj) [s2s] round runtime in run_eval #6798 (@sshleifer) [s2s README] Add more dataset download instructions #6737 (@sshleifer) [s2s] round bleu, rouge to 4 digits #6704 (@sshleifer) [s2s] command line args for faster val steps #6833 New documentation

Several new documentation pages have been added and older documentation has been tweaked to be more accurate and understandable. An open in colab button has been added on the tutorial pages.

Guide to fixed-length model perplexity evaluation #5449 (@joeddav) Improvements to PretrainedConfig documentation #5642 (@sgugger) Document model outputs #5673 (@sgugger) docs(wandb): explain how to use W&B integration #5607 (@borisdayma) Model utils doc #6005 (@sgugger) ONNX documentation #5992 (@mfuntowicz) Tokenizer documentation #6110 (@sgugger) Pipeline documentation #6175 (@sgugger) Encoder decoder config docs #6195 (@afcruzs) Colab button #6389 (@sgugger) Generation documentation #6470 (@sgugger) Add custom datasets tutorial #6466 (@joeddav) Logging documentation #6852 (@sgugger) Trainer updates

New additions to the Trainer

Added data collator for permutation (XLNet) language modeling and related calls #5522 (@shngt) Trainer support for iterabledataset #5834 (@Pradhy729) Adding PaddingDataCollator #6442 (@sgugger) Add hyperparameter search to Trainer #6576 (@sgugger) [examples] Add trainer support for question-answering #4829 (@patil-suraj) Adds comet_ml to the list of auto-experiment loggers #6176 (@dsblank) Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644 (@HuangLianzhe) New models & model architectures

The following model architectures have been added to the library

FlaubertForTokenClassification #5644 (@stas00) TFXLMForTokenClassification #5614 (@LysandreJik) TFXLMForMultipleChoice #5614 (@LysandreJik) TFFlaubertForTokenClassification #5614 (@LysandreJik) TFFlaubertForMultipleChoice #5614 (@LysandreJik) TFElectraForSequenceClassification #6227 (@jplu) TFElectraForMultipleChoice #6227 (@jplu) TF Longformer #5764 (@patrickvonplaten) CamembertForCausalLM #6577 (@patil-suraj) Regression testing on TPU & TPU CI

Thanks to @zcain117 we now have access to TPU CI for the PyTorch/xla framework. This enables regression testing on the TPU aspects of the Trainer, and offers very simple regression testing on model training performance.

Test XLA examples #5583 Add setup for TPU CI to run every hour. #6219 (@zcain117) Add missing docker arg for TPU CI. #6393 (@zcain117) Get GKE logs via kubectl logs instead of gcloud logging read. #6446 (@zcain117) New pipelines

New pipelines have been added:

Zero shot classification pipeline #5760 (@joeddav) Addition of a DialoguePipeline #5516 (@guillaume-be) Add targets arg to fill-mask pipeline #6239 (@joeddav) Community notebooks Fine-tune Electra and interpret with Integrated Gradients #6321 (@elsanns) Update ONNX notebook to include section on quantization. #6831 (@mfuntowicz) Centralized logging

Logging is now centralized. The library offers methods to handle the verbosity level of all loggers contained in the library. [Link to logging doc here]:

Centralize logging #6434 (@LysandreJik) Bug fixes and improvements [Reformer] Adapt Reformer MaskedLM Attn mask #5560 (@patrickvonplaten) Make T5 compatible with ONNX #5518 (@abelriboulot) [Bart] enable test_torchscript, update test_tie_weights #5457 (@sshleifer) [docs] fix model_doc links in model summary #5566 (@patil-suraj) [Benchmark] Readme for benchmark #5363 (@patrickvonplaten) Fix Inconsistent NER Grouping (Pipeline) #4987 (@enzoampil) QA pipeline BART compatible #5496 (@mfuntowicz) More explicit error when failing to tensorize overflowing tokens #5633 (@LysandreJik) Should check that torch TPU is available #5636 (@LysandreJik) Add forum link in the docs #5637 (@sgugger) Fixed TextGenerationPipeline on torch + GPU #5629 (@TevenLeScao) Fixed use of memories in XLNet (caching for language generation + warning when loading improper memoryless model) #5632 (@TevenLeScao) [squad] add version tag to squad cache #5669 (@lazovich) Deprecate old past arguments #5671 (@sgugger) Pipeline model type check #5679 (@JetRunner) rename the functions to match the rest of the test convention #5692 (@stas00) doc improvements #5688 (@stas00) Fix Trainer in DataParallel setting #5685 (@sgugger) [Longformer] fix longformer global attention output #5659 (@patrickvonplaten) [Fix] github actions CI by reverting #5138 #5686 (@sshleifer) [Reformer classification head] Implement the reformer model classification head for text classification #5198 (@as-stevens) Cleanup bart caching logic #5640 (@sshleifer) [AutoModels] Fix config params handling of all PT and TF AutoModels #5665 (@patrickvonplaten) [cleanup] T5 test, warnings #5761 (@sshleifer) [fix] T5 ONNX test: model.to(torch_device) #5769 (@mfuntowicz) [Benchmark] fix benchmark non standard model #5801 (@patrickvonplaten) [Benchmark] Fix models without architectures param in config #5808 (@patrickvonplaten) [Longformer] fix longformer slow-down #5811 (@patrickvonplaten) [seq2seq] pack_dataset.py rewrites dataset in max_tokens format #5819 (@sshleifer) [seq2seq] Don't copy self.source in sortishsampler #5818 (@sshleifer) [cleanups] make Marian save as Marian #5830 (@sshleifer) [Reformer] - Cache hidden states and buckets to speed up inference #5578 (@patrickvonplaten) Lightning Updates for v0.8.5 #5798 (@nateraw) Update tokenizers to 0.8.1.rc to fix Mac OS X issues #5867 (@sepal)

Xlnet outputs #5883 (@TevenLeScao)

DataParallel fixes #5733 (@stas00)

[cleanup] squad processor #5868 (@sshleifer) Improve doc of use_cache #5912 (@sgugger) [Fix] seq2seq pack_dataset.py actually packs #5913 (@sshleifer) Add AlbertForPretraining to doc #5914 (@sgugger) DataParallel fix: multi gpu evaluation #5926 (@csarron)

Clarify arg class #5916 (@sgugger)

[CI] self-scheduled runner tests examples/ #5927 (@sshleifer)

Update doc to new model outputs #5946 (@sgugger) [CI] Install examples/requirements.txt #5956 (@sshleifer) Expose padding_strategy on squad processor to fix QA pipeline performance regression #5932 (@mfuntowicz) [docs] Add integration test example to copy pasta template #5961 (@sshleifer) Cleanup Trainer and expose customization points #5982 (@sgugger) Avoid unnecessary warnings when loading pretrained model #5922 (@sgugger) Ensure OpenAI GPT position_ids is correctly initialized and registered at init. #5773 (@mfuntowicz) [CI] Don't test apex #6021 (@sshleifer) add a summary report flag for run_examples on CI #6035 (@stas00) don't complain about missing W&B when WANDB_DISABLED=true #6036 (@stas00) Allow to set Adam beta1, beta2 in TrainingArgs #5592 (@gonglinyuan)

Fix the return documentation rendering for all model outputs #6022 (@sgugger)

Fix typo (model saving TF) #5734 (@Colanim)

Add new AutoModel classes in pipeline #6062 (@patil-suraj) [pack_dataset] don't sort before packing, only pack train #5954 (@sshleifer) CL util to convert models to fp16 before upload #5953 (@sshleifer) Add fire to setup.cfg to make isort happy #6066 (@sgugger) [fix] no warning for position_ids buffer #6063 (@sshleifer) Pipelines should use tuples instead of namedtuples #6061 (@LysandreJik) Moving transformers package import statements to relative imports in some files #5796 (@afcruzs) github issue template suggests who to tag #5790 (@sshleifer) Make all data collators accept dict #6065 (@sgugger) Add inference widget examples #5825 (@clmnt) [s2s] Delete useless method, log tokens_per_batch #6081 (@sshleifer) Logs should not be hidden behind a logger.info #6097 (@LysandreJik) Fix zero-shot pipeline single seq output shape #6104 (@joeddav) [fix] add bart to LM_MAPPING #6099 (@sshleifer) [Fix] position_ids tests again #6100 (@sshleifer) Fix deebert tests #6102 (@sshleifer) Use FutureWarning to deprecate #6111 (@sgugger) Added capability to quantize a model while exporting through ONNX. #6089 (@mfuntowicz) XLNet PLM Readme #6121 (@LysandreJik) Fix TF CTRL model naming #6134 (@jplu) Use google style to document properties #6130 (@sgugger) Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleChoice} #5614 Rework TF trainer #6038 (@jplu) Actually the extra_id are from 0-99 and not from 1-100 #5967 (@orena1) add another e.g. to avoid confusion #6055 (@orena1) Tf trainer cleanup #6143 (@sgugger) Switch from return_tuple to return_dict #6138 (@sgugger) Fix FlauBERT GPU test #6142 (@LysandreJik) Enable ONNX/ONNXRuntime optimizations through converter script #6131 (@mfuntowicz) Add Pytorch Native AMP support in Trainer #6151 (@prajjwal1) enable easy checkout switch #5645 (@stas00) Replace mecab-python3 with fugashi for Japanese tokenization #6086 (@polm) parse arguments from dict #4869 (@patil-suraj) Harmonize both Trainers API #6157 (@sgugger) Model output test #6155 (@sgugger) [s2s] clean up + doc #6184 (@stas00) Add script to convert BERT tf2.x checkpoint to PyTorch #5791 (@mar-muel) Empty assert hunt #6056 (@TevenLeScao) Fix saved model creation #5468 (@jplu) Adds train_batch_size, eval_batch_size, and n_gpu to to_sanitized_dict output for logging. #5331 (@jaymody) [DataCollatorForLanguageModeling] fix labels #6213 (@patil-suraj) Fix _shift_right function in TFT5PreTrainedModel #6214 (@maurice-g) Remove outdated BERT tips #6217 (@JetRunner) run_hans label fix #6221 (@VictorSanh) Make the order of additional special tokens deterministic #5704 (@gonglinyuan) cleanup torch unittests #6196 (@stas00) test_tokenization_common.py: Remove redundant coverage #6224 (@sshleifer) [Reformer] fix reformer fp16 test #6237 (@patrickvonplaten) [Reformer] Make random seed generator available on random seed and not on model device #6244 (@patrickvonplaten) Update to match renamed attributes in fairseq master #5972 (@LilianBordeau) [WIP] lightning_base: support --lr_scheduler with multiple possibilities #6232 (@stas00) Trainer + wandb quality of life logging tweaks #6241 (@TevenLeScao) Add strip_accents to basic BertTokenizer. #6280 (@PhilipMay) Argument to set GPT2 inner dimension #6296 (@TevenLeScao) [Reformer] fix default generators for pytorch < 1.6 #6300 (@patrickvonplaten) Remove redundant line in run_pl_glue.py #6305 (@xujiaze13) [Fix] text-classification PL example #6027 (@bhashithe) fix the shuffle agrument usage and the default #6307 (@stas00) CI dependency wheel caching #6287 (@LysandreJik) Patch GPU failures #6281 (@LysandreJik) fix consistency CrossEntropyLoss in modeling_bart #6265 (@idoh) Add a script to check all models are tested and documented #6298 (@sgugger) Fix the tests for Electra #6284 (@jplu)

[examples] consistently use --gpus, instead of --n_gpu #6315 (@stas00)

refactor almost identical tests #6339 (@stas00)

Small docfile fixes #6328 (@sgugger) Patch models #6326 (@LysandreJik) Ci GitHub caching #6382 (@LysandreJik) Fix links for open in colab #6391 (@sgugger)

[EncoderDecoderModel] add a add_cross_attention boolean to config #6377 (@patrickvonplaten)

Feed forward chunking #6024 (@Pradhy729)

add pl_glue example test #6034 (@stas00) testing utils: capturing std streams context manager #6231 (@stas00) Fix tokenizer saving and loading error #6026 (@yobekiko) Warn if debug requested without TPU #6390 (@dmlap) [Performance improvement] "Bad tokens ids" optimization #6064 (@guillaume-be) pl version: examples/requirements.txt is single

More information

DOI: 10.5281/zenodo.4010585

Dates

Publication date: 2020
Issued: September 01, 2020

Rights

info:eu-repo/semantics/openAccess Open Access

Format

electronic resource

Relateditems

Description	Item type	Relationship	Uri
		IsSupplementTo	https://github.com/huggingface/transformers/tree/v3.1.0
		IsVersionOf	https://doi.org/10.5281/zenodo.3385997
		IsPartOf	https://zenodo.org/communities/zenodo

This is a limited proof of concept to search for research data, not a production system.

MIT Libraries home

Search the MIT Libraries

Title: huggingface/transformers: Pegasus, DPR, self-documented outputs, new pipelines and MT support

Links

Summary

More information

Dates

Rights

Format

Relateditems