✏️ Augmentex
We implemented two methods for spelling corruption. Statistic-based Spelling Corruption (SBSC) aims to mimic human behaviour when making an error. While Augmentex relies on rule-based heuristics and common errors and mistypings especially those committed while typing text on a keyboard.
🚀 Both methods proved their effectiveness for spelling correction systems and celebrated substantial performance gains fully reported in our Paper.
Augmentex introduces rule-based and common statistic (empowered by KartaSlov project) approach to insert errors in text. It is fully described again in the Paper and in this 🗣️Talk.
🖇️ Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of specific methods suited for particular level:
Word level:
replace - replace a random word with its incorrect counterpart;
delete - delete random word;
swap - swap two random words;
stopword - add random words from stop-list;
reverse - change a case of the first letter of a random word;
Character level:
shift - randomly swaps upper / lower case in a string;
orfo - substitute correct characters with their common incorrect counterparts;
typo - substitute correct characters as if they are mistyped on a keyboard;
delete - delete random character;
multiply - multiply random character;
swap - swap two adjacent characters;
insert - insert random character;
To access Augmentex you only need these few manipulations:
from sage.spelling_corruption import CharAugConfig, CharAugCorruptor
config = CharAugConfig(
unit_prob=0.3, # proportion of characters that is going to undergo edits
min_aug=1, # minimum number of edits
max_aug=5, # maximum number of edits
mult_num=3 # `multiply` edit
)
corruptor = CharAugCorruptor.from_config(config)
… or like this:
from sage.spelling_corruption import WordAugConfig, WordAugCorruptor
config = WordAugConfig(
unit_prob=0.4, # proportion of characters that is going to undergo edits
min_aug=1, # minimum number of edits
max_aug=5, # maximum number of edits
)
corruptor = WordAugCorruptor.from_config(config)
Augmentex has been created by our fellow team, the project has its own repo, do not forget to take a look!