7

Eliminating words from a file through ngrams stored in another file

view full story
linux-howto

http://www.unix.com – Hello, I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear: Code: kpaware nlupset rrrbring In other words these words are invalid in English and constitute garbage in the data. I have identified such combinations (at least in the initial position) and have prepared a file of such combos which for lack of better I call bigrams, trigrams An example of such combos is given below: Code: nl kp rrr Is there a script which could load the ngram file and check in the database which words do not meet (HowTos)