Question : What are the different ways we can do basic text processing?
Answer: Starting from simple to complex models, here’s the list of different ways we can do basic text processing. We can also mix these approaches in conjuction to work with each other. Often this technique mix is done for better results, as they complement each other. e.g Word tokenisation + word normalisation and stemming
1. Regular Expression
2. Word tokenisation
3. Word normalization and stemming
- Regular Expression :
- Regular expression are often the first model of any text processing
- They can be used as feature in classifiers as well
- Word tokenisation : Follow following steps
- Merge upper and lower case (Apple vs apple – same word diff case)
- Sort and count
Unix code : tr ‘A-z’ ‘a-z’ < shakes.txt | tr -sc ‘A-Za-z’ ‘\n’ |sort | uniq -c | sort -n -r
Question : What are some of the problems with word tokenisation we discussed above?
Answer: Some of the issues are
- Window vs windows, grow vs grown
- I’m vs I am
- Language Issues:
- French L’ensemble ~ unensemble
- Chinese and Japanese Language – no spaces between words
“Sharapova now lives in US southeastern Florida”
One Solution can be : “Greedy Match i.e maximum matching”. Works well in Chinese language but does not work in English
- German Language – compound noun
‘life insurance company employee’
Question : How can we capture the word variation i.e “U.S.A” vs “USA” , “window vs windows”, “washing vs wash” , “Fed vs fed”, “US vs us”?
Answer: We can use the word Normalisation and Stemming techniques such as
- Normalise by defininig equivalence rules manually i.e
- delete periods in terms i.e U.S.A to USA
Disadvantage is : language specific, domain specific rules, globally not applicable.
- Lemmatization : windows – > window i.e Reduce inflectional forms to base form i.e
- windows, window’s, window -> window
- am, are, is -> be
- Stemming : washing -> wash i.e Reduce terms to their stem. It uses rule base stemming approach as in
Disadvantage : Stemming is also language specific ans is not applicable globally.
- Case Folding: reduce to lower case. What about US vs us then ?
Solution : Introduce specific pattern based rules i.e Upper case in mid- sentence, do not lowercase.
1.Basic Text Processing