huggingface/transformers:...

Title: huggingface/transformers: Downstream NLP task API (feature extraction, text classification, NER, QA), Command-Line Interface and Serving – models: T5 – community-added models: Japanese & Finnish BERT, PPLM, XLM-R

Type Software Thomas Wolf, Lysandre Debut, Victor SANH, Julien Chaumond, Rémi Louf, Funtowicz Morgan, Stefan Schweter, Denis, erenup, Matt, Piero Molino, Grégory Châtel, Tim Rault, Catalin Voss, Fei Wang, Louis Martin, Malte Pietsch, Davide Fiocco, Bilal Khan, dhanajitb, Jinoo, Ananya Harsh Jha, Simon Layton, yzy5630, Yongbo Wang, Shijie Wu, Nikolay Korolev, Masatoshi Suzuki, Juha Kiili, Guillem García Subies (2019): huggingface/transformers: Downstream NLP task API (feature extraction, text classification, NER, QA), Command-Line Interface and Serving – models: T5 – community-added models: Japanese & Finnish BERT, PPLM, XLM-R. Zenodo. Software. https://zenodo.org/record/3588551

Authors: Thomas Wolf (@huggingface) ; Lysandre Debut (Hugging Face) ; Victor SANH (@huggingface) ; Julien Chaumond (Hugging Face) ; Rémi Louf (@huggingface) ; Funtowicz Morgan (NAVER LABS Europe) ; Stefan Schweter ; Denis ; erenup ; Matt ; Piero Molino ; Grégory Châtel (DisAItek & Intel AI Innovators) ; Tim Rault (@huggingface) ; Catalin Voss (Stanford University) ; Fei Wang (@ShannonAI) ; Louis Martin ; Malte Pietsch (deepset) ; Davide Fiocco ; Bilal Khan ; dhanajitb ; Jinoo ; Ananya Harsh Jha ; Simon Layton (@NVIDIA) ; yzy5630 ; Yongbo Wang (Red Hat) ; Shijie Wu ; Nikolay Korolev (@JetBrains) ; Masatoshi Suzuki (Tohoku University) ; Juha Kiili (Juhakiili Oy) ; Guillem García Subies ;

Summary

New class Pipeline (beta): easily run and use models on down-stream NLP tasks

We have added a new class called Pipeline to simply run and use models for several down-stream NLP tasks.

A Pipeline is just a tokenizer + model wrapped so they can take human-readable inputs and output human-readable results.

The Pipeline will take care of : tokenizing inputs strings => convert in tensors => run in the model => post-process output

Currently, we have added the following pipelines with a default model for each:

feature extraction (can be used with any pretrained and finetuned models) inputs: strings/list of strings – output: list of floats (last hidden-states of the model for each token) sentiment classification (DistilBert model fine-tuned on SST-2) inputs: strings/list of strings – output: list of dict with label/score of the top class Named Entity Recognition (XLM-R finetuned on CoNLL2003 by the awesome @stefan-it), and inputs: strings/list of strings – output: list of dict with label/entities/position of the named-entities Question Answering (Bert Large whole-word version fine-tuned on SQuAD 1.0) inputs: dict of strings/list of dict of strings – output: list of dict with text/position of the answers

There are three ways to use pipelines:

in python: ```python from transformers import pipeline Test the default model for QA (Bert large finetuned on SQuAD 1.0)

nlp = pipeline('question-answering') nlp(question= "Where does Amy live ?", context="Amy lives in Amsterdam.")

{'answer': 'Amsterdam', 'score': 0.9657156007786263, 'start': 13, 'end': 21}

Test a specific model for NER (XLM-R finetuned by @stefan-it on CoNLL03 English)

nlp = pipeline('ner', model='xlm-roberta-large-finetuned-conll03-english') nlp("My name is Amy. I live in Paris.")

[{'word': 'Amy', 'score': 0.9999586939811707, 'entity': 'I-PER'}, {'word': 'Paris', 'score': 0.9999983310699463, 'entity': 'I-LOC'}]

- in bash (using the command-line interface) ```bash bash $ echo -e "Where does Amy live?\tAmy lives in Amsterdam" | transformers-cli run --task question-answering {'score': 0.9657156007786263, 'start': 13, 'end': 22, 'answer': 'Amsterdam'} as a REST APItransformers-cli serve --task question-answering

This new feature is currently in beta and will evolve in the coming weeks.

CLI tool to upload community models @julien-c

Users can now create accounts on the huggingface.co website and then login using the transformers CLI. Doing so allows users to upload their models to our S3 in their respective directories, so that other users may download said models and use them in their tasks.

Users may upload files or directories.

It's been tested by @stefan-it for a German BERT and by @singletongue for a Japanese BERT.

New model architectures: T5, Japanese BERT, PPLM, XLM-RoBERTa, Finnish BERT T5 (Pytorch & TF) (from Google) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. Japanese BERT (Pytorch & TF) from CL-tohoku, implemented by @singletongue PPLM (Pytorch & TF) (from Uber AI) released with the paper Plug and Play Language Models: a Simple Approach to Controlled Text Generation by Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu. XLM-RoBERTa (Pytorch & TF) (from FAIR, implemented by @stefan-it) released with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov Finnish BERT (Pytorch & TF) (from TurkuNLP) released with the paper Multilingual is not enough: BERT for Finnish by Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, Sampo Pyysalo Refactoring the SQuAD example

The run_squad script has been massively refactored. The reasons are the following:

it was made to work with only a few models (BERT, XLNet, XLM and DistilBERT), which had three different ways of encoding sequences. The script had to be individually modified in order to train different models, which would not scale as other models are added to the library. the utilities did not rely on the QOL adjustments that were made to the encoding methods these past months.

It now leverages the full capacity of encode_plus, easing the addition of new models to the script. A new method squad_convert_examples_to_features encapsulates all of the tokenization. This method can handle tensorflow_datasets as well as squad v1 json files and squad v2 json files.

ALBERT was added to the SQuAD script BertAbs summarization

A contribution by @rlouf building on the encoder-decoder mechanism to do abstractive summarization.

Utilities to load the CNN/DailyMail dataset BertAbs now usable as a traditional library model (using from_pretrained()) ROUGE evaluation New Models Additional architectures

@alexzubiaga added XLNetForTokenClassification and TFXLNetForTokenClassification

New model cards Community additions/bug-fixes/improvements Added mish activation function @digantamisra98 run_bertology.py was updated with correct imports and the ability to overwrite the cache Training can be exited and relaunched safely, while keeping the epochs, global steps, scheduler steps and other variables in run_lm_finetuning.py @bkkaggle Tests now run on cuda @aaugustin @julien-c Cleaned up the pytorch to tf conversion script @thomwolf Progress indicator improvements when downloading pre-trained models @leopd from_pretrained() can now load from urls directly. New tests to check that all files are accessible on HuggingFace's S3 @rlouf Updated tf.shape and tensor.shape to all use shape_list @thomwolf Valohai integration @thomwolf Always use SequentialSampler in run_squad.py @ethanjperez Stop using GPU when importing transformers @ondewo Fixed the XLNet attention output @roskoN Several QOL adjustments: removed dead code, deep cleaned tests and removed pytest dependency @aaugustin Fixed an issue with the Camembert tokenization @thomwolf Correctly create an encoder attention mask from the shape of the hidden states @rlouf Fixed a non-deterministic behavior when encoding and decoding empty strings @pglock Fixing tensor creation in encode_plus @LysandreJik Remove usage of tf.mean which does not exist in TF2 @LysandreJik A segmentation fault error was fixed (due to scipy 1.4.0) @LysandreJik Start sunsetting support of Python 2 An example usage of Model2Model was added to the quickstart.

More information

DOI: 10.5281/zenodo.3588551

Dates

Publication date: 2019
Issued: December 20, 2019

Rights

info:eu-repo/semantics/openAccess Open Access

Format

electronic resource

Relateditems

Description	Item type	Relationship	Uri
		IsSupplementTo	https://github.com/huggingface/transformers/tree/v2.3.0
		IsVersionOf	https://doi.org/10.5281/zenodo.3385997
		IsPartOf	https://zenodo.org/communities/zenodo

This is a limited proof of concept to search for research data, not a production system.

MIT Libraries home

Search the MIT Libraries

Title: huggingface/transformers: Downstream NLP task API (feature extraction, text classification, NER, QA), Command-Line Interface and Serving – models: T5 – community-added models: Japanese & Finnish BERT, PPLM, XLM-R

Links

Summary

More information

Dates

Rights

Format

Relateditems