✏️ Augmentex

We implemented two methods for spelling corruption. Statistic-based Spelling Corruption (SBSC) aims to mimic human behaviour when making an error. While Augmentex relies on rule-based heuristics and common errors and mistypings especially those committed while typing text on a keyboard.

🚀 Both methods proved their effectiveness for spelling correction systems and celebrated substantial performance gains fully reported in our Paper.

Augmentex introduces rule-based and common statistic (empowered by KartaSlov project) approach to insert errors in text. It is fully described again in the Paper and in this 🗣️Talk.

🖇️ Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of specific methods suited for particular level:

  • Word level:

    • replace - replace a random word with its incorrect counterpart;

    • delete - delete random word;

    • swap - swap two random words;

    • stopword - add random words from stop-list;

    • reverse - change a case of the first letter of a random word;

  • Character level:

    • shift - randomly swaps upper / lower case in a string;

    • orfo - substitute correct characters with their common incorrect counterparts;

    • typo - substitute correct characters as if they are mistyped on a keyboard;

    • delete - delete random character;

    • multiply - multiply random character;

    • swap - swap two adjacent characters;

    • insert - insert random character;

To access Augmentex you only need these few manipulations:

from sage.spelling_corruption import CharAugConfig, CharAugCorruptor

config = CharAugConfig(
    unit_prob=0.3, # proportion of characters that is going to undergo edits
    min_aug=1, # minimum number of edits
    max_aug=5, # maximum number of edits
    mult_num=3 # `multiply` edit
)
corruptor = CharAugCorruptor.from_config(config)

… or like this:

from sage.spelling_corruption import WordAugConfig, WordAugCorruptor

config = WordAugConfig(
    unit_prob=0.4, # proportion of characters that is going to undergo edits
    min_aug=1, # minimum number of edits
    max_aug=5, # maximum number of edits
)
corruptor = WordAugCorruptor.from_config(config)

Augmentex has been created by our fellow team, the project has its own repo, do not forget to take a look!