This is a limited proof of concept to search for research data, not a production system.

Search the MIT Libraries

Title: huggingface/transformers: Rust Tokenizers, DistilBERT base cased, Model cards

Type Software Thomas Wolf, Lysandre Debut, Julien Chaumond, Victor SANH, Aymeric Augustin, Rémi Louf, Funtowicz Morgan, Stefan Schweter, Denis, erenup, Matt, Piero Molino, Patrick von Platen, Grégory Châtel, Tim Rault, MOI Anthony, Bram Vanroy (2020): huggingface/transformers: Rust Tokenizers, DistilBERT base cased, Model cards. Zenodo. Software. https://zenodo.org/record/3675440

Authors: Thomas Wolf (@huggingface) ; Lysandre Debut (Hugging Face) ; Julien Chaumond (Hugging Face) ; Victor SANH (@huggingface) ; Aymeric Augustin (@canalplus) ; Rémi Louf ; Funtowicz Morgan (HuggingFace) ; Stefan Schweter ; Denis ; erenup ; Matt ; Piero Molino ; Patrick von Platen ; Grégory Châtel (DisAItek & Intel AI Innovators) ; Tim Rault (@huggingface) ; MOI Anthony (Hugging Face) ; Bram Vanroy (@UGent) ;

Links

Summary

Rust tokenizers (@mfuntowicz, @n1t0 ) Tokenizers for Bert, Roberta, OpenAI GPT, OpenAI GPT2, TransformerXL are now leveraging tokenizers library for fast tokenization :rocket: AutoTokenizer now defaults to fast tokenizers implementation when available Calling batch_encode_plus on fast version of tokenizers will make better usage of CPU-cores. Tokenizers leveraging native implementation will use all the CPU-cores by default when calling batch_encode_plus. You can change this behavior by setting the environment variable RAYON_NUM_THREADS = N An exception is raised when tokenizing an input with pad_to_max_length=True but no padding token is defined.

Known Issues:

RoBERTa fast tokenizer implementation has slightly different output when compared to the original Python tokenizer (< 1%). Squad example are not currently compatible with the new fast tokenizers thus, it will default to plain-old Python one. DistilBERT base cased (@VictorSanh)

The distilled version of the bert-base-cased BERT checkpoint has been released.

Model cards (@julien-c )

Model cards are now stored directly in the repository

CLI script for environment information (@BramVanroy )

We now host a CLI script that gathers all the environment information when reporting an issue. The issue templates have been updated accordingly.

Contributors visible on repository (@clmnt )

The main contributors as identified by Sourcerer are now visible directly on the repository.

From fine-tuning to pre-training (@julien-c )

The language fine-tuning script has been renamed from run_lm_finetuning to run_lm_pretraining as it is now able to train language models from scratch.

Extracting archives now available from cached_path (@thomwolf )

Slight modification to cached_path so that zip and tar archives can be automatically extracted.

archives are extracted in the same directory than the (possibly downloaded) archive in a created extraction directory named from the archive. automatic extraction is activated by setting extract_compressed_file=True when calling cached_file. the extraction directory is re-used to avoid extracting the archive again unless we set force_extract=True, in which case the cached extraction directory is removed and the archive is extracted again. New activations file (@sshleifer )

Several activation functions (relu, swish, gelu, tanh and gelu_new) can now be accessed from the activations.py file and be used in the different PyTorch models.

Community additions/bug-fixes/improvements Remove redundant hidden states that broke encoder-decoder architectures (@LysandreJik ) Cleaner and more readable code in test_attention_weights (@sshleifer) XLM can be trained on SQuAD in different languages (@yuvalpinter) Improve test coverage on several models that were ill-tested (@LysandreJik) Fix issue where TFGPT2 could not be saved (@neonbjb ) Multi-GPU evaluation on run_glue now behaves correctly (@peteriz ) Fix issue with TransfoXL tokenizer that couldn't be saved (@dchurchwell) More Robust conversion from ALBERT/BERT original checkpoints to huggingface/transformers models (@monologg ) FlauBERT bug fix; only add langs embeddings when there is more than one language handled by the model (@LysandreJik ) Fix CircleCI error with TensorFlow 2.1.0 (@mfuntowicz ) More specific testing advice in contributing (@sshleifer ) BERT decoder: Fix failure with the default attention mask (@asivokon ) Fix a few issues regarding the data preprocessing in run_language_modeling (@LysandreJik ) Fix an issue with leading spaces and the RobertaTokenizer (@joeddav ) Added pipeline: TokenClassificationPipeline, which is an alias over NerPipeline (@julien-c )

More information

  • DOI: 10.5281/zenodo.3675440

Dates

  • Publication date: 2020
  • Issued: February 19, 2020

Rights

  • info:eu-repo/semantics/openAccess Open Access

Much of the data past this point we don't have good examples of yet. Please share in #rdi slack if you have good examples for anything that appears below. Thanks!

Format

electronic resource

Relateditems

DescriptionItem typeRelationshipUri
IsSupplementTohttps://github.com/huggingface/transformers/tree/v2.5.0
IsVersionOfhttps://doi.org/10.5281/zenodo.3385997
IsPartOfhttps://zenodo.org/communities/zenodo