By Steven Firth, s.k.firth@lboro.ac.uk, Loughborough University, UK
The 'REFIT_BUILDING_SURVEY.xml' in the REFIT dataset can be converted to flat csv files by:
This notepook is developed using the 'jupyter' software and Python implementation in the Anaconda platform.
For more information see the REFIT project website.
Step 1: Importing the relevant python packages
from lxml import etree
import pandas as pd
Step 2: Reading the 'REFIT_BUILDING_SURVEY.xml' file into memory
This loads the xml file into the lxml objectify data structure. 'root' can then be used to access the building survey data.
path=r'REFIT_BUILDING_SURVEY.xml'
tree = etree.parse(path)
root = tree.getroot()
NS={'a':'http://www.refitsmarthomes.org'}
Step 3: Recursing through the elements in the XML file and creating a flat files for export to csv format
This recurses through the elements in the 'REFIT_BUILDING_SURVEY.xml' file. The information in each elements is stored in a flat table. Tables are created for each element type, i.e. 'RefitXML', 'Stock', 'Building', 'Space', 'Surface' etc. Foreign key fields are used to relate an object to its parent. The tables are saved in csv format.
def create_tables(element,tables,FK_column,FK_value):
"""
A recursive function which travels through 'element' and its children.
During the recursion, the 'tables' dictionary is populated with DataFrame tables - one for each element type
"""
name=element.tag.split('}')[1]
if not name in tables.keys(): tables[name]=pd.DataFrame()#columns=attributes)
d={}
if not FK_column is None:
d[FK_column]=FK_value
for a in element.attrib.keys():
d[a]=element.get(a)
tables[name]=tables[name].append(d,ignore_index=True)
tables[name].index.name=name+'_PK'
for child in element:
create_tables(child,tables,name+'FK',tables[name].index.max())
tables={}
create_tables(root,tables,None,None)
for name in tables.keys():
tables[name].to_csv(name+'.csv')