Expander

Published:

See the source code here

Part of my work at Cognostics (see the CV) involved natural language processing. We had spoken input data that we transcribed with text-to-speech software to text. In order to automatically organise and classify the spoken text we needed to sanitize the input of which one part was to expand common contractions like don’t or shan’t to do not and shall not.

When we realised there was not definitive solution for this problem, it was suggested at Cognostics to use deep learning methods. I thought this was a bit exaggerated for the problem at hand and my code served as a proof of concept that a simple statistical analysis based on Stanford’s POS-tagging and NER-tagging models was enough, rather than try to create a model from scratch.

The main idea behind the solution is to start with the list of common english contractions. If the expansion of a contraction is unambiguous (like shan’t to shall not) we are done. For the ambiguous expansions we download sentence corpora included in NLTK. These sentences contain no contractions, so we take all relevant sentences, contract them and look at the resulting POS-tags to see which function in a sentence a certain contraction has. This results in a dictionary that I also share that contains the statistical distribution of expansions based on the grammatical role of the contraction as identified by the POS-tagger.

This problem has been mentioned on stackoverflow but aside from a few basic solutions did not contain any satisfying answers so that I provided my code as answers to it (see here or here).

This code has since fallen into disrepair.