[X-Unix] text file parsing?
Robert Frank
robert.frank at unibas.ch
Wed May 3 07:40:55 PDT 2006
On 03/05/2006, at 15:40 , x-unix-
request at listserver.themacintoshguy.com wrote:
> I'm trying to write a bash script for something I thought would be
> simple, but haven't been able to figure it out.
>
> I have some files that are essentially text files, but have binary
> data in them. For instance, using grep I need to use the "-a" option
> to get any output.
>
> When one opens these files in a text editor you see a readable text
Hmm, sounds like they might be utf-8? These files will look like
normal text files except for some characters which aren't ASCII. In
fact, utf-8 and utf-16 text files will often )(but not always, such
as xml files) start with a special sequence which tells the reading
software that this is a utf-8 file.
As to the end, this could be some kind of 'signature' in a non-ASCII
code.
Try opening them in TextEdit with another than the default encoding,
just to see. Or use XCode to recode/interprete the file contents.
You can use od on a command line to look at the character sequences
('od -b' for decmal output, 'od -c' for character output, where non-
displayable characters will be in ocal or with a special code).
Once you know exactly what you have and what you want to throw away,
it should be quite simple to use sed, awk, or perl to get the job
done. You could also just delete any characters less than a space
with tr -d 'character set'.
If it is just the first line, and then from a well known point on all
the rest, sed would be:
sed '1,1d; /pattern/,$d' file >new_file
where pattern is the unique pattern from which line on all will be
discarded.
You could also usd a substitution to keep the line from which on you
will discard the rest.
There are, of course, many other ways of accomplishing the same!
Robert
> file. However, the first line of the file is a few "garbage"
> characters, and the last 1-10 pages are all "garbage" binary
> characters.
>
> All I'd like to do is make a script that would strip off the first
> line and then remove all the garbage characters from the end of the
> file. The text of the files always end with the same set of
> characters so I had hoped to find a way to basically do something
> like:
>
> delete from the end of $EndString to the end of the file
>
> where $EndString would be the last text I want to keep and is unique
> in the file.
>
> The fact that the file has binary data may make line counting hard,
> grep didn't seem to be able to return a line number.
>
> Ben
Departement Informatik FGB tel +41 (0)61 267 14 66
Universität Basel fax. +41 (0)61 267 14 61
Robert Frank
Klingelbergstrasse 50 Robert.Frank at unibas.ch
CH-4056 Basel
Switzerland http://
www.informatik.unibas.ch/personen/frank_r.html
More information about the X-Unix
mailing list