[X-Unix] text file parsing?

Robert Frank robert.frank at unibas.ch
Wed May 3 07:40:55 PDT 2006


On 03/05/2006, at 15:40 , x-unix- 
request at listserver.themacintoshguy.com wrote:

> I'm trying to write a bash script for something I thought would be
> simple, but haven't been able to figure it out.
>
> I have some files that are essentially text files, but have binary
> data in them.  For instance, using grep I need to use the "-a" option
> to get any output.
>
> When one opens these files in a text editor you see a readable text
Hmm, sounds like they might be utf-8? These files will look like  
normal text files except for some characters which aren't ASCII. In  
fact, utf-8 and utf-16 text files will often )(but not always, such  
as xml files) start with a special sequence which tells the reading  
software that this is a utf-8 file.

As to the end, this could be some kind of 'signature' in a non-ASCII  
code.
Try opening them in TextEdit with another than the default encoding,  
just to see. Or use XCode to recode/interprete the file contents.

You can use od on a command line to look at the character sequences  
('od -b' for decmal output, 'od -c' for character output, where non- 
displayable characters will be in ocal or with a special code).

Once you know exactly what you have and what you want to throw away,  
it should be quite simple to use sed, awk, or perl to get the job  
done. You could also just delete any characters less than a space  
with tr -d 'character set'.

If it is just the first line, and then from a well known point on all  
the rest, sed would be:

sed '1,1d; /pattern/,$d' file >new_file
where pattern is the unique pattern from which line on all will be  
discarded.
You could also usd a substitution to keep the line from which on you  
will discard the rest.

There are, of course, many other ways of accomplishing the same!

Robert


> file.  However, the first line of the file is a few "garbage"
> characters, and the last 1-10 pages are all "garbage" binary  
> characters.
>
> All I'd like to do is make a script that would strip off the first
> line and then remove all the garbage characters from the end of the
> file.  The text of the files always end with the same set of
> characters so I had hoped to find a way to basically do something  
> like:
>
> 	delete from the end of $EndString to the end of the file
>
> 	where $EndString would be the last text I want to keep and is unique
> in the file.
>
> The fact that the file has binary data may make line counting hard,
> grep didn't seem to be able to return a line number.
>
> Ben

Departement Informatik   FGB   tel   +41 (0)61 267 14 66
Universität Basel                          fax. +41 (0)61 267 14 61
Robert Frank
Klingelbergstrasse 50                 Robert.Frank at unibas.ch
CH-4056 Basel
Switzerland                                   http:// 
www.informatik.unibas.ch/personen/frank_r.html





More information about the X-Unix mailing list