πŸ‡¬πŸ‡§ T5-large

The model corrects spelling errors and typos by bringing all words in the text to the standard English language. The proofreader was trained based on the T5-large model. An extensive dataset with β€œartificial” errors was taken as a training corpus: the corpus was assembled on the basis of the English-language Wikipedia and News blogs, then typos and spelling errors were automatically introduced into it using the functionality of the SAGE library.

Table of contents

Public references

Examples

Input

Output

Th festeivΠ°l was excelzecnt in many ways, and in particular it beinganinternational festjival sss a chΠ°llenging, bet brilli an t ea.

The festival was excellent in many ways, and in particular it beinganinternational festival is a challenging, but brilliant one to see.

That β€˜s why I believe in the solution which is the closest to human nature and can help us to avoid boredome. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There wo n’t be any problem with being up - do - date .

That’s why I believe in the solution which is the closest to human nature and can help us to avoid boredom. I am sure that eventually we will take off our clothes and in the future we will be undressed and free. There won’t be any problem with being up - do - date.

If you bought something goregous, you well be very happy.

If you bought something gorgeous, you will be very happy.

Metrics

Below are automatic metrics for determining the correctness of the spell checkers. We present a comparison of our solution both with open automatic spell checkers and with the ChatGPT family of models on two available datasets: - BEA60K: English spelling errors collected from several domains; - JFLEG: 1601 sentences in English, which contain about 2 thousand spelling errors;

BEA60K

Model

Precision

Recall

F1

T5-large-spell

66.5

83.1

73.9

ChatGPT gpt-3.5-turbo-0301

66.9

84.1

74.5

ChatGPT gpt-4-0314

68.6

85.2

76.0

ChatGPT text-davinci-003

67.8

83.9

75.0

Bert (https://github.com/neuspell/neuspell)

65.8

79.6

72.0

SC-LSTM (https://github.com/neuspell/neuspell)

62.2

80.3

72.0

JFLEG

Model

Precision

Recall

F1

T5-large-spell

83.4

84.3

83.8

ChatGPT gpt-3.5-turbo-0301

77.8

88.6

82.9

ChatGPT gpt-4-0314

77.9

88.3

82.8

ChatGPT text-davinci-003

76.8

88.5

82.2

Bert (https://github.com/neuspell/neuspell)

78.5

85.4

81.8

SC-LSTM (https://github.com/neuspell/neuspell)

80.6

86.1

83.2

How to use

from transformers import T5ForConditionalGeneration, AutoTokenizer

path_to_model = "ai-forever/T5-large-spell"
model = T5ForConditionalGeneration.from_pretrained(path_to_model)
tokenizer = AutoTokenizer.from_pretrained(path_to_model)

prefix = "grammar: "
sentence = "If you bought something goregous, you well be very happy."
sentence = prefix + sentence
encodings = tokenizer(sentence, return_tensors="pt")
generated_tokens = model.generate(**encodings)
answer = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(answer)

# ["If you bought something gorgeous, you will be very happy."]

API

class sage.spelling_correction.t5_correctors.T5ModelForSpellingCorruption(model_name_or_path: str | PathLike)

Bases: Corrector

T5-based models.

batch_correct(sentences: List[str], batch_size: int, prefix: str | None = '', **generation_params) List[List[Any]]

Corrects multiple sentences.

Parameters:
  • sentences (list of str) – input sentences to correct;

  • batch_size (int) – size of subsample of input sentences;

  • prefix (str) – some models need some sort of a prompting;

  • generation_params (dict) – parameters passed to generate method of a HuggingFace model;

Returns:

corresponding corrections

Return type:

list of list of str

correct(sentence: str, prefix: str | None = '', **generation_params) List[str]

Corrects a single input sentence.

Parameters:
  • sentence (str) – a source sentence;

  • prefix (str) – some models need some sort of a prompting;

  • generation_params (dict) – parameters passed to generate method of a HuggingFace model;

Returns:

corresponding corrected sentence

Return type:

list of str

evaluate(dataset_name_or_path: str | PathLike | None, metrics: List, batch_size: int, prefix: str = '', dataset_split: str = 'test', **generation_params) Dict[str, float]

Evaluate the particular model on the spellcheck datasets.

Parameters:
  • dataset_name_or_path (str) – a path to a locally situated dataset or a name of a dataset on HuggingFace;

  • metrics (list of str) – set of metrics to be used to report performance;

  • batch_size (int) – size of subsample of input sentences;

  • prefix (str) – some models need some sort of a prompting;

  • dataset_split (str) – train / test / dev part to be evaluated on;

  • generation_params (dict) – parameters passed to generate method of a HuggingFace model;

Returns:

mapping between metric’s name and its corresponding value

Return type:

dict[str, float]

classmethod from_pretrained(model_name_or_path: str | PathLike)

Initialize the T5-type corrector from a pre-trained checkpoint. The latter can be either locally situated checkpoint or a name of a model on HuggingFace.

Parameters:

model_name_or_path (str or os.PathLike) – the aforementioned name or path to checkpoint;

Returns:

corrector initialized from pre-trained weights

Return type:

object of T5ModelForSpellingCorruption

Resources

License

The T5-large model, on which our solution is based, and its source code are supplied under the APACHE-2.0 license. Our solution is supplied under MIT license.

Specifications

  • File size: 3 Gb;

  • Framework: pytorch

  • Format: AI Service

  • Version: v1.0

  • Developer: SberDevices, AGI NLP

Contacts

nikita.martynov.98@list.ru