Parsing USGS QuakeML files with Python

The USGS provides earthquake lists in a number of formats including kml for Google Earth, easy-to-read comma-separated-value (csv) files that are quite simple to parse with any spreadsheet, matlab, Mathematica, and many unix tools. Some more information is available in the JSON formatted files, which are the preferred methods for downloading the data feeds. The XML formatted data, stored in the community-standard quakeml dialect, contain the most information. XML is not a human-readable format, but is compatible with the database centric workflows often used in modern earthquake monitoring activities. Accessing the data directly from the databases makes it more likely to have fewer translation errors (or at least consistent data).

My goal in these notes is not to explain the QuakeML (qml) format (I couldn’t if I wanted). You can read about QML at https://quake.ethz.ch/quakeml/, it’s a complex beast of a data format that can be used to store many of the different types of information used in earthquake analyses. Information items range from seismic event hypocenters and origin times (which are collectively called origins) to individual seismic wave arrival time and amplitude measurements. QML contains many schema and allows much of the seismic data to be encoded as they are measured. For example an arrival time pick can be created and stored before the earthquake location associated with that arrival is estimated. The same is true for the seismic wave amplitudes. Storing the information as it is measured is more robust and more conducive for near-real time monitoring processing flows. This helps produce quick locations and magnitudes, which lead in turn to quick earthquake alerts.

At the analysis end of the flow, the information organization can be less convenient than many old formats. For example, information that a seismologist might like to list together may be split apart in theQML file. An arrival time pick and/or measured wave amplitude may be stored as part of the event object, but a travel time residual must be associated with an event origin object. And a single event may have multiple origins – in the simplest case, one origin could be the hypocenter location estimated from body-wave arrival times, another origin could be a rupture centroid estimated using a moment-tensor inversion of long-period waveforms. These files can be complicated. Another issue with USGS QML is that sometimes you get more information, other times you get less. For example, a recent earthquake file might Include information on the station, network, channel, and location code for the waveform used for a measurement, but a file corresponding to an event from a few years older may just provide the station ID. Changing content can have you pulling your hair out when trying to extract information.

QuakeML can also simply be confusing, look at this encoding of an arrival time pick

.
.
.
        <arrival publicID="quakeml:us.anss.org/arrival/200081v8/us_3395_bpid-2001275574">
          <pickID>quakeml:us.anss.org/pick/200081v8/us_3395_bpid-2001275574</pickID>
          <phase>Pn</phase>
          <azimuth>226.495</azimuth>
          <distance>13.8121</distance>
          <takeoffAngle>
            <value>71.6</value>
          </takeoffAngle>
          <timeResidual>1.30</timeResidual>
          <timeWeight>0.6300</timeWeight>
          <creationInfo>
            <agencyID>us</agencyID>
            <agencyURI>smi:anss.org/metadata/agencyid/us</agencyURI>
            <author>manual</author>
            <creationTime>2016-12-17T10:59:33Z</creationTime>
          </creationInfo>
        </arrival>
.
.
.

Look carefully and you will see that the azimuth, distance, and time items all are specific values, the takeoffAngle item has a sublevel to store a value. I don’t know why. Variability in data storage practice adds flexibility, but also makes codes hard to maintain.

Parsing USGS QuakeML with Python

Python provides a number of packages for parsing xml, but all require some custom code to create objects or dictionaries to hold the information. In the following, I use the cElementTree package (ElementTree, but faster because the low-level routines are written in C, not python), which is part of the default python distribution.

The codes below are what I use to start parsing USGS QuakeML, however, since the format is so “flexible” every time I try to unpack another event, I find another “issue”. I try to store most of the items in the original file in python dictionaries, which I think are one of the nicest container classes available. Functionality could be improved if I made a class, but to be honest, I am often interested in a quick look at the data or in reformatting the information so I can import into another tool like Mathematica. At least that part of the code is flexible, extensible, and easy to use.

#
# To make outputting information simple, I insure that certain values are in each dictionary,  
#   whether they are defined in the xml or not. These dictionaries set up default values,
#   but as the xml is parsed, defined key value pairs are updated.
#
defaultPick = {'stationCode':'--','networkCode':'--','channelCode':'--',
                         'locationCode':'--','phase':'NA','time':'NA'}
#
defaultArrival = {'genericAmplitude':'NA','type':'NA','unit':'NA',
                  'period':'NA', 'evaluationMode':'NA','timeResidual':'NA',
                  'timeWeight':'NA'}
#
defaultAmplitude = {'pickID':'NA','genericAmplitude':'NA','period':'NA',
                  'unit':'NA', 'evaluationMode':'NA'}                  
#
#---------------------------------------------------------------------------------
# def getEventOrigins(xevent):
#     xorigins = xevent.findall('d:origin',ns)
#     return xorigins
#
#---------------------------------------------------------------------------------
def parse_origins(xevent):
    xorigins = xevent.findall('d:origin',ns)
    origins = []
    for xorigin in xorigins:
        anOrigin = xorigin.attrib.copy()
        anOrigin.update({
        'otime': get_xitem_value_as_text(xorigin,'d:time','d:value'),
        'latitude' : get_xitem_value_as_text(xorigin,'d:latitude','d:value'),
        'longitude' : get_xitem_value_as_text(xorigin,'d:longitude','d:value'),
        'depth' : get_xitem_value_as_text(xorigin,'d:depth','d:value'),
        'dotime' : get_xitem_value_as_text(xorigin,'d:time','d:uncertainty'),
        'dlatitude' : get_xitem_value_as_text(xorigin,'d:latitude','d:uncertainty'),
        'dlongitude' : get_xitem_value_as_text(xorigin,'d:longitude','d:uncertainty'),
        'ddepth' : get_xitem_value_as_text(xorigin,'d:depth','d:uncertainty')
        })
        #
        origins.append(anOrigin)
    #
    return origins 
#
#---------------------------------------------------------------------------------   
def parse_magnitudes(xevent):
    xmags = xevent.findall('d:magnitude',ns)
    mags = []
    for xmag in xmags:
        mdict = xmag.attrib.copy()        
        mdict.update({'mag': get_xitem_value_as_text(xmag,'d:mag','d:value')})       
        mdict.update({'magType': get_xitem_as_text(xmag,'d:type')})       
        value = get_xitem_as_text(xmag,'d:evaluationMode')
        if(value!='NA'):
            mdict.update({"evaluationMode" : value})
            
        value = get_xitem_as_text(xmag,'d:originID')
        if(value!='NA'):
            mdict.update({"originID" : value})
            
        value = get_xitem_value_as_text(xmag,'d:creationInfo', 'd:agencyID')
        if(value!='NA'):
            mdict.update({"agencyID" : value})
        #
        mags.append(mdict)
    return mags
#
#---------------------------------------------------------------------------------
def parse_picks(xev):
    xpicks = xev.findall('d:pick',ns)
    picks = []
    for pick in xpicks:
        pdict = defaultPick.copy()
        pdict.update(pick.attrib.copy())
        
        value = get_xitem_value_as_text(pick,'d:time','d:value')
        if(value!='NA'):
            pdict.update({"time" :value})

        value = get_xitem_as_text(pick,'d:phaseHint')
        if(value!='NA'):
            pdict.update({"phase" :value})

        value = get_xitem_as_text(pick,'d:evaluationMode')
        if(value!='NA'):
            pdict.update({"evaluationMode" :value})

        pdict.update(pick.find('d:waveformID',ns).attrib)
        picks.append(pdict)
    return picks
#
#---------------------------------------------------------------------------------
def parse_arrivals(xorigin):
    xarrivals = xorigin.findall('d:arrival',ns)
    arrivals = []
    for xarr in xarrivals:
        adict = defaultArrival.copy()
        value = get_xitem_as_text(xarr,'d:pickID')
        if(value!='NA'):
            adict.update({"pickID" :value})
        value = get_xitem_as_text(xarr,'d:phase')
        if(value!='NA'):
            adict.update({"phase" :value})
        value = get_xitem_as_text(xarr,'d:azimuth')
        if(value!='NA'):
            adict.update({"azimuth" :value})
        value = get_xitem_as_text(xarr,'d:distance')
        if(value!='NA'):
            adict.update({"distance" :value})
        value = get_xitem_as_text(xarr,'d:takeoffAngle')
        if(value!='NA'):
            adict.update({"takeoffAngle" :value})
        value = get_xitem_as_text(xarr,'d:timeResidual')
        if(value!='NA'):
            adict.update({"timeResidual" :value})
        value = get_xitem_as_text(xarr,'d:timeWeight')
        if(value!='NA'):
            adict.update({"timeWeight" :value})
        arrivals.append(adict)
    return arrivals    
    
#---------------------------------------------------------------------------------
#  Extract the arrival items from the xml
#
def parse_amplitudes(xevent):
    xamplitudes = xevent.findall('d:amplitude',ns)
    amplitudes = []
    for xamp in xamplitudes:
        adict = xamp.attrib.copy()
        adict.update(defaultAmplitude)

        value = xamp.find('d:waveformID',ns)
        if(value != None):
            adict.update(value.attrib)
        
        value = get_xitem_value_as_text(xamp,'d:genericAmplitude','d:value')
        if(value!='NA'):
            adict.update({"genericAmplitude" :value})

        value = get_xitem_as_text(xamp,'d:unit')
        if(value!='NA'):
            adict.update({"unit" :value})

        value = get_xitem_value_as_text(xamp,'d:period','d:value')
        if(value!='NA'):
            adict.update({"period" :value})
        
        value = get_xitem_as_text(xamp,'d:evaluationMode')
        if(value!='NA'):
            adict.update({"evaluationMode" :value})
        
        value = get_xitem_as_text(xamp,'d:twindowbegin')
        if(value!='NA'):
            adict.update({"twindowbegin" :value})
        
        value = get_xitem_as_text(xamp,'d:twindowend')
        if(value!='NA'):
            adict.update({"twindowend" :value})
        
        value = get_xitem_as_text(xamp,'d:twindowref')
        if(value!='NA'):
            adict.update({"twindowref" :value})
             
        amplitudes.append(adict)

    return amplitudes

#---------------------------------------------------------------------------------
#
# 'distance', 'timeResidual', 'publicID', 'timeWeight', 'time', 
#     'networkCode', 'evaluationMode', 'stationCode', 'pickID', 
#     'azimuth', 'phase', 'channelCode', 'takeoffAngle', 'locationCode'
#
#---------------------------------------------------------------------------------
def merge_arrivals_picks(arrivals,picks):
    merged = []
    for a in arrivals:
        pid = a['pickID']
        p = search_pdicts('publicID', pid, picks)
        m = a.copy()
        if(p != None):
            m.update(p[0])
        merged.append(m)
    return merged

#---------------------------------------------------------------------------------
# Make a simple tab separated table of the picks with weights greater 
#    than minWeight
def list_arrival_time_picks(arrivalTimePicks, minWeight=0.0):
    print 'StationChannel\tphase\ttime\tdistance\tazimuth\tResidual\tWeight'
    for ap in arrivalTimePicks:
        if float(ap['timeWeight']) &gt;= minWeight:
            try:
                s0 = ap['stationCode']+'-'+ap['networkCode']+'-'+ap['channelCode']+'-'+ap['locationCode']
                s0 += '\t'+ap['phase']+'\t'+ap['time']
                s0 += '\t'+ap['distance']+'\t'+ap['azimuth']
                s0 += '\t'+ap['timeResidual']+'\t'+ap['timeWeight']
                print s0
            except:
                print 'Incomplete arrival time observation.'       
#
#---------------------------------------------------------------------------------
def list_magnitudes(mags):
    print 'magType\tagencyID\tmagnitude'
    for mag in mags:
        print "%s\t%s\t%s" % (mag['magType'], mag['agencyID'],mag['mag'])
#
#---------------------------------------------------------------------------------
# get the preferred origin from the eventInfo dict and the origins list
#
def get_preferred_origin(eventInfo,origins):
        preforigin = eventInfo['preferredOriginID'].lower().split("/")[-1]
        for origin in origins:
            pID = origin['publicID'].lower().split("/")[-1]
            if(pID == preforigin):
                return origin
#

Processing a USGS quakeML file

We can use the above defined functions to extract event, origin, magnitude, and phase pick information from a file containing a list of events. here’s one version of a code to do so.

# name spaces employed at USGS
# need to find a way to parse these from the file.
#
ns = {"q":"http://quakeml.org/xmlns/quakeml/1.2",
       "d":"http://quakeml.org/xmlns/bed/1.2",
        "catalog":"http://anss.org/xmlns/catalog/0.1",
        "tensor":"http://anss.org/xmlns/tensor/0.1"}
def parse_usgs_xml(filepath):
    #
    # you can import xml from online:
    # url = 'http://earthquake.usgs.gov/realtime/product/phase-data/us20007z6r/us/1481210600040/quakeml.xml'
    # response = urllib2.urlopen(url)
    # xmlstring = response.read()
    # xroot = ElementTree.fromstring(xmlstring)
    #
    xtree = ElementTree.parse(filepath)
    xroot = xtree.getroot()
    #
    xeventParameters = xroot.findall('d:eventParameters',ns)
    #
    for ep in xeventParameters:
        xevents = ep.findall('d:event',ns)
        print "Found %d events." % (len(xevents))
    #    
    events = []
    #
    i = 0
    for xev in xevents:
        # build an event dictionary 
        ev = {}
        ev['eventid'] = xev.attrib["{http://anss.org/xmlns/catalog/0.1}eventid"]
        ev['publicID'] = xev.attrib['publicID']
        ev['eventsource'] = xev.attrib['{http://anss.org/xmlns/catalog/0.1}eventsource']
        ev['datasource'] = xev.attrib['{http://anss.org/xmlns/catalog/0.1}datasource']
        ev['preferredOriginID'] = xev.find("d:preferredOriginID",ns).text
        ev['preferredMagnitudeID'] = xev.find("d:preferredMagnitudeID",ns).text
        #
        mags = parse_magnitudes(xev)
        picks = parse_picks(xev)
        amplitudes = parse_amplitudes(xev)
        #
        preforigin = ev['preferredOriginID'].lower().split("/")[-1]
        xorigins = xev.findall('d:origin',ns)
        origins = parse_origins(xev)
        pxorigin = xev[0]
        n = 0
        for xorigin in xorigins:
            anOrigin = origins[n]
            pID = anOrigin['publicID'].lower().split("/")[-1]
            if(pID == preforigin):
                pxorigin = xorigin
            n += 1
        #
        arrivals = parse_arrivals(pxorigin)
        #
        events.append({'eventInfo':ev,'origins':origins,'magnitudes':mags,'picks':picks,\
                        'arrivals':arrivals,'amplitudes':amplitudes})
        #
        i += 1
        #
        print "parsed %d events." % (i)
        #
    return events

Here’s an example usage:

an_event_dictionary = {eventInfo,origins,magnitudes,picks,arrivals,amplitudes}

Each event is stored in a dictionary of dictionaries and the result is a list of events dictionaries. The origins, magnitudes, picks, amplitudes, and arrivals of the preferred origin are also extracted and stored in each event dictionary.

Example

You can use the functions to convert a USGS XML file into a python dictionary. Note that the functions do not extract all the information from these files (for example, I don’t parse the moment tensors).

events = parse_usgs_xml("20120411-OffShoreSumatra.xml")

Using the Information

At this point, the event information is contained in a series of standard python dictionaries. I have focused on QML files that contain one event, but the basic building blocks can be used to process QML containing multiple events (with a little modification). Here are some example actions to take with the events list.

list_magnitudes(events[0]['magnitudes'])

   magType	agencyID	magnitude
      Mw	official	8.6
     Mww	US	8.6
      Mb	US	7.4
      Ms	US	8.5
     Mwc	US	8.6
      Me	US	9.0
     Mwc	GCMT	8.6
      Mw	PPT	8.7

Arrivals and picks in the XML are contained in separate lists (better for operational tasks – you can have many picks but they may not all associate with an event). For most applications a seismologist would like information contained in both the pick item and the arrival item, so I merge arrivals and picks into arrivalTimePicks. Then list the information.

#
#  print out the arrival times by merging the arrivals with the picks
#
arrivals = events[0]['arrivals']
picks = events[0]['picks']
#
arrivalTimePicks = merge_arrivals_picks(arrivals,picks)
#
if(len(arrivalTimePicks) &gt; 0):
    list_arrival_time_picks(arrivalTimePicks,-1)
StationChannel	phase	time	distance	azimuth	Residual	Weight
GSI-IR------	Pn	2012-04-11T08:39:43.960Z	4.62	102.6	-1.6	1
LHMI-IR------	Pn	2012-04-11T08:39:47.830Z	4.83	53.2	-0.5	1
PSI-IR------	Pn	2012-04-11T08:40:02.970Z	5.87	86.3	0.2	1
KULM-IR------	Pn	2012-04-11T08:40:32.770Z	8.12	68.5	-0.8	1
BKNI-IR------	Pn	2012-04-11T08:40:33.630Z	8.22	103.9	-1.3	1
IPM-IR------	Pn	2012-04-11T08:40:33.820Z	8.26	74.1	-1.7	1
MYKOM-IR------	Pn	2012-04-11T08:41:10.070Z	10.79	92.6	-0.2	1
MNAI-IR------	Pn	2012-04-11T08:41:24.180Z	11.91	123.9	-1.4	1
PALK-IR------	Pn	2012-04-11T08:41:43.360Z	13.26	292.2	-0.7	1
.
.
.

To see a plot of the travel times requires that we convert the date-time strings into differences in seconds. I use python’s datetime objects to handle this. This approach is not robust with time zones, but it works for UTC. I compute the travel times and then make a plot of the travel times using matplotlib.

import datetime
def from_utc(utcTime,fmt="%Y-%m-%dT%H:%M:%S.%fZ"):
    return datetime.datetime.strptime(utcTime, fmt)
#
preferredOrigin = get_preferred_origin(events[0]['eventInfo'],events[0]['origins'])
originDateTime = from_utc(preferredOrigin['otime'])
# Extract the distance and timeResidual for data used in the location
x = []
y = []
for ap in arrivalTimePicks:
    if float(ap['timeWeight']) &gt; 0.0:
        x.append(ap['distance'])
        dt = from_utc(ap['time']) - originDateTime
        y.append(dt.total_seconds())
# 
import matplotlib.pyplot as plt
%matplotlib inline
#
plt.plot(x,y,'ko')
plt.axis([0, 180, 0, 1400])
plt.ylabel('Travel Time (s)')
plt.xlabel('Azimuth (degrees)')
plt.show()

Summary

The scripts above are a starting point for pulling information out of the USGS xml files that you can access from the Origin tab of a USGS earthquake event page. Not all the information in the file is extracted, moment tensors, for example are omitted. Copying and pasting this code would be a hassle. Below is a link to a streamlined Jupyter Notebook version of this post. Start there, it’s online so I can access it too.

parse_USGS_QML_export.ipynb

A final comment, to the extent that the USGS quake-ml is standard, you could also use this to parse QML you obtain elsewhere, but I have not tested the above codes on QML from other sources. At a minimum, the namespaces will have to be changed, since they look usgs-specific. I have also only ever used it for single-event XML files, so be careful if you use it.