As I mentioned before, I had a need recently to archive a text conversation thread off my phone.  Plus, as a general principle, if I backup my phone, what do I do with it?

The app 'SMS Backup & Restore' does a perfect job of backing up my phone.  It creates a XML file that I can archive to my computer.  But then what?  In my case, the XML file was over 3GB.  I have been doing the phone upgrade since 2016 that automatically carries over your old data to your new device, so I had text messages going back almost 10 years.

I like the idea of keeping all these old text messages, but not necessarily on my phone.  If I regularly backup up my phone, then 1) I don't have to worry as much about loosing my phone, and 2) I don't have to keep 10 years (+) of data directly on my phone.

But back to the main problem, once the (3GB) XML file is on my computer (or in my case, on my home server), what do I do with it?

I found two applications to extract from the backup XML file, but neither came close to what I needed.  I both wanted to extract a single conversation, but overall, I wanted to access everything in the XML file.  Neither app I could find did that.  So ... I wrote one.

I choose Java for several reasons:
Next I and to decide how I wanted it to work.
Just to speed things along, I started with the XML example app that comes with Java 8 SDK.

First, I needed to understand the structure of the XML file I need to parse.  For the text messages, it's basically a list of messages.  The elements look like this:

  <smss>
    <sms></sms>
    <sms></sms>
    ...
  </smss>
  <mmss>
    <mms></mms>
    <mms></mms>
    ...
  </mmss>

The SMS elements are simple.  They completely consist of a single <sms> tag and a bunch of attributes.

The MMS elements consists of subelements:

  <parts>
    <part></part>
    ...
  </parts>
  <addrs>
    <addr></addr>
    ...
  </addrs>

Within the <part> elements are the text part of the message and the multimedia parts.  The <addr> elements provide a list of participants in the conversations.

Using the XML example Java application, it would have been easy to just recognize each of these elements.  Context wouldn't really matter for this application.  For example, the <addr> element only ever occurs inside the <addrs> element, which only ever occurs inside the <mms> element. So I could have just had one level with a big case statement for each of the tags, and it would have been sufficient for this application.  But that's bad practice, and I just couldn't do that (sorry).

So, how to I create context in a XML element stream?  I'm a big fan of recursive descent parsers for jobs like this, but the element stream only exists at the top with 'startElement()', 'endElement()' and 'endDocument()' methods called by the SAX parser.  So I decided to create sub-parsers with their own 'startElement()', 'endElement()', and have the top pass those method calls down.

For example:
    main->startElememnt()
mmsParser->startElement()
        AddrParser->startElement()
    main->endElememnt()
    mmsParser->endElement()
        AddrParser->endElement()
At each level, when the parser sees an element, it creates a subparser for that element.
My cheap and dirty recursive descent for XML.

The application has two parts to it:
  1. Parse the XML document
  2. Create the HTML files
The second part is all done in the 'endDocument()' method.  The parser has created a list of conversations, each conversation has a list of messages.  Here I iterate through the conversations, and
  1. add an entry to the index page.
  2. create a conversation page with all the messages and multimedia.
This app reads the XML file, and writes out all the HTML and media files.

ISSUES

Images:
The first main issue I had was storing the images.  Initially I just parsed the document.  I waited until the 'endDocument()' to write anything out to files.  Well, I have a 3GB XML file.  Parsing it and storing it in memory was possibly a lot more than 3GB.  That worked on my home workstation (with 128GB of RAM) but when I tried to work on this on my laptop while traveling, I couldn't as I kept getting 'out of memory' Java exceptions.

Solution: Since I was going to store all these images to files later anyway, I am storing them to file immediately, and only keeping the filename for the end.

HEIC images:
I have HEIC images in my text messages, but Firefox can't display them.

Solution:  I have a script that postprocesses the output after parsing the XML file that finds and converts all the HEIC images to JPEG images.

Phone Numbers:
I explicitly wrote this to only display phone numbers assuming they are US based numbers.  This will have problems if there are international phone numbers in the XML files.

Solution:
Write a bunch of code that normalizes phone numbers (with or without the + and with or without the country code).

FUTURES

I'm tentatively planning on open sourcing this app and putting it in a public GIT repository.  Let me know if you would be interested in that.

I'm thinking that archiving text messages, and having access to them, literally forever, might be useful to a lot of people, so I'm considering providing a cloud service that provided cloud storage for all your backups, and (using this app) access to all the messages in all your backups.  Let me know if you'd be interested in that too.

There are lots of tweaks and improvements I can make, but it does everything I want now.  We'll have to see where this goes.