This tool automatically classifies text into natural language (e.g., English) , and non-natural language text portions (e.g., stack traces, code snippets, log outputs, file listings, urls,) on a line-by-line basis using natural language processing (NLP). It is intended to be used as a preprocessing step in NLP approaches on bug reports.
The Python implementation of the machine learning classifier model, basic scripts for automated training set creation from GitHub issue tickets, a sample dataset sourced from 101 Java projects hosted on GitHub, and a scikit-learn transformer that wraps the pretrained model to be used as preprocessing step in a scikit-learn pipeline can be found on GitHub and Zenodo.