Welcome to the Linux Foundation Forum!

iconv and sed help

Hi,

I have a file which is a UTF-8 file type which i need to convert into ISO-8859-1 file type.

Now the UTF-8 file type contains characters like å/ä/ö and i dont want these characters.

So, i apply the sed command.

$ sed "s/å/aa/g; s/ä/aaa/g; s/ö/ooo/g" utf8.txt > output.txt


Now when i view this file, there are no such characters like å/ä/ö

Then,

i use iconv command to covert that UTF-8 (output.txt) file type into ISO-8859-1 file type

$ iconv -c -f UTF-8 -t ISO-8859-1 < output.txt > newfile


BUT

when i view the file type using file command it tells that its an ASCII file type not the ISO-8859-1

$ file newfile
newfile: ASCII text, with CRLF line terminators


I don't understand what went wrong. I have also attached that UTF-8 file with this post.

Please help.

usmangt

Comments

  • mfillpot
    mfillpot Posts: 2,177
    I have went through your exact procedure on slackware 13.1 and my output file is showing as:
    ut3.txt: ISO-8859 text, with very long lines

    The way that the data is read and displayed may be controlled by a deeper configuration within your OS, can you share what distro you use so those familiar with it can tell you where those settings are?
  • I am using Linux Fedora 13 distribution.
  • Hi,

    I am so Sorry that i have attached the wrong file (actually both are of same name but in different folder on my machine).

    This is the one which is causing the problem.
  • Here is the file.

    Don't know why it become such long name when uploading.

    [file name=utf8-7a6351909c73ba4a81575d6ad10cf46f.txt size=1131]http://www.linux.com/media/kunena/attachments/legacy/files/utf8-7a6351909c73ba4a81575d6ad10cf46f.txt[/file]
  • mfillpot
    mfillpot Posts: 2,177
    Now that I have processed your original file I am getting the same issue, it appears that something is different between the files.

    The two files are very different. I have concatinated your command to
    sed "s/å/aa/g; s/ä/aaa/g; s/ö/ooo/g" utf8.txt|iconv -c -f UTF-8 -t ISO-8859-1 -o out.txt
    

    when I ran that command against both files I got the following output:
    matt:~/Desktop$rm *.txt.txt;for i in `ls|grep utf|grep -v "txt\.txt"`;do sed "s/å/aa/g; s/ä/aaa/g; s/ö/ooo/g" $i|iconv -c -f UTF-8 -t ISO-8859-1 -o $i.txt ;file $i;file $i.txt;done
    utf8.txt: UTF-8 Unicode text, with very long lines, with CRLF line terminators
    utf8.txt.txt: ISO-8859 text, with very long lines, with CRLF line terminators
    utf82.txt: UTF-8 Unicode text
    utf82.txt.txt: ASCII text
    

    Based upon the output it looks as though the line terminators in the second file are not ISO-8859-1 compliant, but the iconv applications does not correct those.
  • Thank you for analyzing and checking it. Yes i doubt the same thing also concern about the ' - ' ( minus symbol/character ) in the file.

    Do you think if there is a solution for this.


    Thank you

    usmangt
  • mfillpot
    mfillpot Posts: 2,177
    Can you tell me if the two files were created on different platforms, such as file1 being created in windows and file2 being created in Linux?
  • Well both are created on Linux

Categories

Upcoming Training