Question : What are the different ways we can do basic text processing?
Answer:  Starting from simple to complex models, here’s the list of different ways we can do basic text processing. We can also mix  these approaches  in conjuction to work with each other. Often this technique mix is done for better results, as they complement each other. e.g Word tokenisation + word normalisation and stemming
1. Regular Expression
2. Word tokenisation
3. Word normalization and stemming


  1.  Regular Expression : 
    • Regular expression are often the first model of any text processing
    • They can be used as feature in  classifiers as well
  2. Word tokenisation :  Follow following steps
    • Tokenize
    • Sort
    • Merge upper and lower case (Apple vs apple – same word diff case)
    •  Sort and count
      Unix code : tr  ‘A-z’ ‘a-z’ < shakes.txt | tr -sc ‘A-Za-z’ ‘\n’ |sort | uniq -c | sort -n -r

Simple word tokenisation in unix
Question :  What  are  some of the problems with word tokenisation we discussed above?
Answer: Some of the issues are

  •  Window vs windows, grow vs grown
  • I’m vs I am
  • Language Issues:
    • French L’ensemble ~ unensemble
    • Chinese and Japanese Language – no spaces between words
      莎拉波娃现在居住在美国东南部的佛罗里达。  ~~
      “Sharapova now lives in US southeastern Florida”
      One Solution can be : “Greedy Match i.e maximum matching”.  Works well in Chinese language but does not work in English
      Greedy Match for word tokenisation in chinese language.JPG
    • German Language – compound noun
      Lebensversicherungsgesellschaftsangestellter ~
      ‘life insurance company employee’

Question :  How can we capture the word variation i.e “U.S.A” vs “USA” , “window vs windows”, “washing vs wash” , “Fed vs fed”, “US vs us”?
Answer:  We can   use the  word Normalisation and Stemming techniques such as

  • Normalise by defininig  equivalence  rules manually i.e
    • delete periods in terms  i.e U.S.A to USA

    Disadvantage is : language specific, domain specific rules, globally not applicable.

  • Lemmatization :  windows – > window i.e Reduce inflectional forms to base form i.e
    • windows, window’s, window -> window
    • am, are, is -> be
  • Stemming : washing -> wash i.e Reduce terms to their stem. It uses rule base stemming approach as in
    porter's english stemming.JPG
    Disadvantage : Stemming is also language specific ans is not applicable globally.
  • Case Folding: reduce to lower case.  What about US vs us then ?
    Solution :  Introduce specific pattern based rules i.e Upper case in mid- sentence, do not lowercase.


1.Basic Text Processing