I don’t like XML, how can the XML file be converted to table format?

By Steven Firth, s.k.firth@lboro.ac.uk, Loughborough University, UK

The 'REFIT_BUILDING_SURVEY.xml' in the REFIT dataset can be converted to flat csv files by:

  1. Importing the relevant python packages
  2. Reading the 'REFIT_BUILDING_SURVEY.xml' file into memory
  3. Recursing through the elements in the XML file and creating a flat files for export to csv format

This notepook is developed using the 'jupyter' software and Python implementation in the Anaconda platform.

For more information see the REFIT project website.

Step 1: Importing the relevant python packages

In [1]:
from lxml import etree
import pandas as pd

Step 2: Reading the 'REFIT_BUILDING_SURVEY.xml' file into memory

This loads the xml file into the lxml objectify data structure. 'root' can then be used to access the building survey data.

In [2]:
path=r'REFIT_BUILDING_SURVEY.xml'
tree = etree.parse(path)
root = tree.getroot()
NS={'a':'http://www.refitsmarthomes.org'}

Step 3: Recursing through the elements in the XML file and creating a flat files for export to csv format

This recurses through the elements in the 'REFIT_BUILDING_SURVEY.xml' file. The information in each elements is stored in a flat table. Tables are created for each element type, i.e. 'RefitXML', 'Stock', 'Building', 'Space', 'Surface' etc. Foreign key fields are used to relate an object to its parent. The tables are saved in csv format.

In [3]:
def create_tables(element,tables,FK_column,FK_value):
    """
    A recursive function which travels through 'element' and its children.
    During the recursion, the 'tables' dictionary is populated with DataFrame tables - one for each element type   
    """
    name=element.tag.split('}')[1]
    if not name in tables.keys(): tables[name]=pd.DataFrame()#columns=attributes)
    d={}
    if not FK_column is None:
        d[FK_column]=FK_value
    for a in element.attrib.keys():
        d[a]=element.get(a)
    tables[name]=tables[name].append(d,ignore_index=True)
    tables[name].index.name=name+'_PK'
    for child in element:  
        create_tables(child,tables,name+'FK',tables[name].index.max())
        
tables={}
create_tables(root,tables,None,None) 
for name in tables.keys():
    tables[name].to_csv(name+'.csv')