[X-Unix] Parsing or splitting a text file

William H. Magill magill at mcgillsociety.org
Thu Sep 29 09:03:02 PDT 2005


On 28 Sep, 2005, at 16:07, Doug McNutt intoned:
> At 15:16 -0400 9/28/05, Richard Nagle wrote:
>
>> What would be the fastest way,
>> of splitting a large text document into separate doc,
>> via a common identifier "From ???@???"
>>
>> This From ???@???, is at the top of every email document,
>> inside this large single text file.
>> this file contains about 15,000 emails.
>
> That sounds like a job for perl.
>
> I was fussing with mailboxes produced by Eudora that look a lot  
> like that. (This list for instance.) I was collecting and sorting  
> by Subject: line to make one file file for each subject. It's been  
> a while since I looked at it.  When I realized I needed to handle  
> "Antwort", "AW", and others I got discouraged.  Ask off line if  
> you'd like a copy of my last known perl script.

The original file sounds like it is in standard "mbox" format.
          First line is "From:"
          Each message is separated from the next
          by a single blank line above the next "From:"

(This format is the format used by Apple's Mail.app to store messages.)

Briefly, "mbox" (all messages in one file) is the standard unix  
"mail" program format, while one file per message is the standard  
format for the "mh" program. These two formats have existed "forever"  
and there are any number of tools around to convert from one to the  
other.

That being the case, there are a ton of "digestifiers" and  
"undigestifiers" floating around on the net (since the beginning of  
the ARPAnet) which will take single messages and put them into one  
file as well as take the single file and parse them out into a single  
file per message.

T.T.F.N.
William H. Magill
# Beige G3 [Rev A motherboard - 300 MHz 768 Meg] OS X 10.2.8
# Flat-panel iMac (2.1) [800MHz - Super Drive - 768 Meg] OS X 10.4.1
# PWS433a [Alpha 21164 Rev 7.2 (EV56)- 64 Meg] Tru64 5.1a
# XP1000  [Alpha 21264-3 (EV6) - 256 meg] FreeBSD 5.3
# XP1000  [Alpha 21264-A (EV 6.7) - 384 meg] FreeBSD 5.3
magill at mcgillsociety.org
magill at acm.org
magill at mac.com
whmagill at gmail.com




More information about the X-Unix mailing list