7

Problems using awk/sed/sort with a ucs-2le encoded file

view full story
linux-howto

http://www.linuxquestions.org – Hello I'm having lots of fun and games trying to use (g)awk (and sed, sort) with a file encoded in ucs-2le. The overall command looks like this: sed '1d' ./ucs-2le_file.txt | sort -t '¤' -n -k 2,2 -k 3,3 -k 5,5 -T . | awk -F"¤" -f aggregate.awk > new_file.txt Ideally I would like the new file to be created with the same ucs-2le encoding but I don't think it is. To give some background: Previously a large file encoded in ucs-2le was FTP'd to the server and then loaded into an Oracle table using SQL*loader and using a UTF16 character set (parameter in the control file) To improve performance I'm trying to remove and aggregate data within the file so the SQL*Load and the subsequent SQL has less data to play with. I'm therefore trying to use the existing process but adding an additional step to create a smaller file (using awk/sort/sed as above) and use the same SQL*loader control file to load the new file with the reduced dataset. Unforunately after the new file has been created the SQL*loader part fails because it can no longer recognise the delimiter and end of line characters (The control file specifies the character hex values. And when the new file is viewed in vim there are addition "^@" characters inbetween every 'normal' acsii character I would expect to see). This has led me to believe it's a character set/encoding problem. I've experimented with modifing the locale but to no avail. So, the question is can awk/sort/sed support multi-byte character sets? (I've checked vim and it does). If so what do I need to do to allow this. If not can someone suggest an alternative approach. Server details (i.e. uname -a): Linux migdev 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 20:19:25 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Many thanks for your help. (HowTos)