Understanding the format behind GML spatial data files
and loading them into Python
Data Cleaning & Preparation
Vocabulary, then a short RCN GML fragment. Syntax and namespaces follow in later sections.
Data Cleaning & Preparation
Hierarchical text · elements and attributes
XML is plain-text markup: nested elements, optional attributes, one document tree. You define tags or adopt a standard (e.g. GML); a parser builds the tree.
Data Cleaning & Preparation
Beyond this course
.docx / .xlsx), RSS, Android layouts, build files
Data Cleaning & Preparation
Declaration (line 1) · comments (2–6, ignored by parsers) · <gml:boundedBy … /> = empty
element
Data Cleaning & Preparation
One example record (LOK_1) as CSV, JSON, and XML. Later material assumes XML because GML is
XML.
Data Cleaning & Preparation
Comma-separated values · flat tables
Comma-Separated Values — one row per record, columns for fields. Typical for spreadsheets and SQL exports.
id,pow,cena
LOK_1,60.25,750000
Data Cleaning & Preparation
JavaScript Object Notation · nested objects
JSON — objects { }, arrays [ ], key–value pairs. Common in REST
APIs and document stores.
{
"id": "LOK_1",
"pow": 60.25,
"uom": "m2",
"cena": 750000
}
Data Cleaning & Preparation
eXtensible Markup Language · tagged trees
XML — elements in angle brackets, attributes, a strict document tree. Vocabularies such as GML are defined with schemas (e.g. XSD).
uom="m2" can sit on the same element as the value.Data Cleaning & Preparation
Elements and attributes — toy examples first, then the same ideas in RCN GML.
Data Cleaning & Preparation
Opening tag · content · closing tag
<city>Poznań</city>
Data Cleaning & Preparation
Same structure — longer names, namespace prefix
rcn: is a namespace prefix — the element is still opening tag · text · closing
tag.
Full meaning of prefixes in the namespaces section.
Data Cleaning & Preparation
Bare value → unit metadata on the opening tag
<pow>60.25</pow>
60.25 has no unit (m², ha, km²…) — for analysis, units must travel with the
number, not only in a separate column or comment.
Put metadata on the same element
<pow uom="m2">60.25</pow>
Value unchanged; uom="m2" is unit of measure (real pattern from RCN_Lokal).
Syntax: name="value" inside <…>; quotes; several attributes =
space-separated (order does not matter).
Data Cleaning & Preparation
Fragments of real geometry and property fields
srsName, gml:id — CRS + stable idcount, srsDimension — how many coordinates, 2D vs 3Duom — units (check before comparing numbers)xsi:nil — “no value here” (e.g. missing geometry)Data Cleaning & Preparation
One root element, parent–child nesting, and paths to data — then a real RCN_Lokal fragment.
Data Cleaning & Preparation
Not a single table — a hierarchy of elements
Each element has at most one parent (except the root, which has none). Children are fully inside the parent — that nesting is how “this address belongs to this flat” is represented.
gml:FeatureCollection).
gml:featureMember and typed features such as
rcn:RCN_Lokal.
RCN_Adres
→ miejscowosc).Data Cleaning & Preparation
Scalars vs nested block (address)
FeatureCollection
└─ featureMember
└─ RCN_Lokal
├─ idLokalu · rooms · floor · area · price …
└─ adresBudynkuZLokalem
└─ RCN_Adres → miejscowosc · ulica · numer …
Flat fields sit directly under RCN_Lokal; the address is a subtree (container
→ RCN_Adres → fields).
Data Cleaning & Preparation
RCN_Lokal (trimmed)Real excerpt — indentation follows depth
gml:id (stable id in the file).
uom="m2".
adresBudynkuZLokalem → RCN_Adres
→ fields).
Data Cleaning & Preparation
Prefixes (gml:, rcn:, …) point at standard vocabularies; you bind them once, then
query with full names in code.
Data Cleaning & Preparation
Many standards · same local names · different meaning
GML, RCN, XSD, and others all define tags like id or name. A
namespace ties each
prefix to one vocabulary so gml:id (geometry id) and rcn:id (cadastre id) never
clash.
prefix:localName is a QName — the colon is only punctuation, not special syntax.
FeatureCollection,
Polygon, …).
RCN_Lokal, RCN_Transakcja, …).xsi:nil, schemaLocation, …Data Cleaning & Preparation
xmlns: on the rootEach line binds a prefix to a URI — a stable vocabulary id
gml / rcn / xsi — main geometry, RCN features, schema hints
(see previous slide).xlink: — xlink:href points at another feature's gml:id.
gmd / gco / gts — ISO metadata / time; often declared; fewer
direct paths in exercises.
Declared once on FeatureCollection — like imports for the whole document.
Data Cleaning & Preparation
Identifiers — not necessarily pages to open in a browser
http://… and urn:… strings are globally unique vocabulary
names. Parsing does not require downloading them.
ns = {
'gml': 'http://www.opengis.net/gml/3.2',
'rcn': 'urn:gugik:specyfikacje:gmlas:rejestrcennieruchomosci:1.0',
'xlink': 'http://www.w3.org/1999/xlink',
}
tree.findall('gml:featureMember/rcn:RCN_Lokal', ns)
Elements may appear as {http://www.opengis.net/gml/3.2}Polygon — Clark notation — in errors and
repr.
Data Cleaning & Preparation
What the Poznań .gml represents in business terms: sales, properties, parcels, buildings,
flats, and deeds — as linked records, not one nested story.
Data Cleaning & Preparation
Rejestr Cen Nieruchomości — price and transaction facts
The file is a long sequence of gml:featureMember blocks. Each block is one business
object
(one transaction, one parcel, one apartment, …). The same real-world deal is split across several such
objects,
connected by identifiers — not by putting everything inside one big XML subtree.
RCN_Transakcja, RCN_Lokal, …) and follow links to
assemble a full case.Data Cleaning & Preparation
Polish tag names — plain-language role
Data Cleaning & Preparation
A graph of IDs — not one nested XML document per deal
Each feature has a gml:id. Elsewhere, xlink:href attributes store
references to another feature's id
(deed, property bundle, parcel, …). So the business view is: many rows + foreign-key-style
links, same idea as joins in a database.
Typical path to “full picture”: Transakcja → nieruchomosc href →
RCN_Nieruchomosc → hrefs to Dzialka / Budynek / Lokal;
podstawaPrawna → Dokument.
Data Cleaning & Preparation
Price on the row · rest via xlink:href
Wrapper + id · scalars (incl. gross price) · link to deed · link to property aggregate — then follow those ids elsewhere in the file.
Data Cleaning & Preparation
The foundation of analysis and the consequences of a naive data loading approach.
Data Cleaning & Preparation
Consequences of a naive approach
Proper data loading is more than just reading a file. It dictates the efficiency, accuracy, and feasibility of all subsequent Data Cleaning steps.
Data Cleaning & Preparation
Key principles and a checklist to follow before writing your parsing code.
Data Cleaning & Preparation
Checklist for loading spatial/XML data
RCN_Lokal nodes.
rcn:,
gml:). Always define a namespace dictionary in your code, or your searches will return empty
results.
to_numeric)
before analyzing.
Data Cleaning & Preparation
Let's look at the example how to extract notarial deeds (RCN_Dokument) from the 2026 dataset.
Data Cleaning & Preparation
Parsing the file and targeting specific nodes
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('../data/Baza_danych_RCN_Poznań_2026.gml')
root = tree.getroot()
ns = {
'gml': 'http://www.opengis.net/gml/3.2',
'rcn': 'urn:gugik:specyfikacje:gmlas:rejestrcennieruchomosci:1.0'
}
document_nodes = root.findall('.//rcn:RCN_Dokument', ns)
Data Cleaning & Preparation
Extracting elements into a flat list of dictionaries
document_list = []
for doc in document_nodes:
doc_dict = {}
for child in doc:
# Strip the long namespace URI from the tag name
clean_tag = child.tag.split('}')[-1]
# Extract the text from inside the element (if it exists)
value = child.text.strip() if child.text else None
doc_dict[clean_tag] = value
document_list.append(doc_dict)
Without splitting by }, tags look like: {urn:gugik:...}oznaczenieDokumentu. We
want clean column names.
Data Cleaning & Preparation
Converting to Pandas and basic cleanup
df_documents = pd.DataFrame(document_list)
# Drop 'ghost' columns that XML might have created as entirely empty
df_documents = df_documents.dropna(axis=1, how='all')
print(df_documents.head(3))
print(df_documents.dtypes)
<gml:boundedBy />). This command removes columns consisting of 100% NaN values.
.dtypes will reveal that everything is currently an 'object'
(text).Data Cleaning & Preparation
Open gml_xml_tasks.ipynb and work with
Baza_danych_RCN_Poznan_2021-2025.gml
Data Cleaning & Preparation
Code walkthroughs for every exercise — five bug fixes included.
Data Cleaning & Preparation
msno.bar(df_lokale, figsize=(12, 4), color='steelblue')
plt.title('Non-null counts per column')
plt.tight_layout()
plt.show()
msno.matrix(df_lokale, figsize=(12, 5))
plt.title('Nullity matrix')
plt.tight_layout()
plt.show()
msno.heatmap(df_lokale, figsize=(8, 6))
plt.title('Missing-value correlation heatmap')
plt.tight_layout()
plt.show()
Data Cleaning & Preparation
for col, dtype in COLUMN_TYPES.items():
if dtype in ('float64', 'Int64'):
df[col] = pd.to_numeric(df[col], errors='coerce')
if dtype == 'Int64':
df[col] = df[col].astype('int64') # ← BUG
Line 5: 'int64' cannot hold NaN — crashes when liczbaIzb or
nrKondygnacji have missing values.
Data Cleaning & Preparation
for col, dtype in COLUMN_TYPES.items():
if dtype in ('float64', 'Int64'):
df[col] = pd.to_numeric(df[col], errors='coerce')
if dtype == 'Int64':
df[col] = df[col].astype('Int64') # ← FIXED
int64 has no NaN representation — raises ValueError
when NaN is presentInt64 (capital I) stores <NA> safely
pd.to_numeric(errors='coerce') first converts non-parsable strings to NaN
before castingData Cleaning & Preparation
funkcjaLokalu codes
FUNKCJA_LABELS = {
1: 'mieszkalna',
2: 'użytkowa',
3: 'mieszkalno-użytkowa',
4: 'rekreacji indywidualnej',
5: 'zbiorowego zamieszkania',
6: 'garażowa',
7: 'inne',
} df['funkcjaLokalu_label'] = df['funkcjaLokalu'].map(FUNKCJA_LABELS).fillna('nieznana')
.map() with a dict is vectorised — no loops needed.fillna('nieznana') catches any code absent from the dictData Cleaning & Preparation
def iqr_bounds(series, factor=3.0):
q1, q3 = series.quantile([0.25, 0.75])
iqr = q3 - q1
return q1 - factor * iqr, q3 + factor * iqr
price_lo, price_hi = iqr_bounds(df['cenaLokaluBrutto'].dropna())
area_lo, area_hi = iqr_bounds(df['powUzytkowaLokalu'].dropna())
df['is_outlier'] = (
(df['cenaLokaluBrutto'] < price_lo) | (df['cenaLokaluBrutto'] > price_hi) |
(df['powUzytkowaLokalu'] < area_lo) | (df['powUzytkowaLokalu'] > area_hi)
)
.dropna(), not from a
visualisation-clipped seriesis_outlier preserves the rows for inspection rather than
silently dropping themData Cleaning & Preparation
price_lo, price_hi = iqr_bounds(clip_series(df['cenaLokaluBrutto']))
area_lo, area_hi = iqr_bounds(clip_series(df['powUzytkowaLokalu']))
Lines 1–2: clip_series compresses the range to the 1st–99th percentile — the resulting IQR is
much tighter than the true one.
Data Cleaning & Preparation
price_lo, price_hi = iqr_bounds(df['cenaLokaluBrutto'].dropna())
area_lo, area_hi = iqr_bounds(df['powUzytkowaLokalu'].dropna())
.dropna() on the raw column gives the correct Q1/Q3Data Cleaning & Preparation
df_bug3 = df.copy()
for col in IMPUTE_COLS:
df_bug3[col] = df_bug3[col].fillna(df_bug3[col].median()) # global median
Line 3: the global median mixes residential apartments, commercial units, garages — very different price distributions.
Data Cleaning & Preparation
df_median = df.copy()
for col in IMPUTE_COLS:
group_median = df_median.groupby('funkcjaLokalu')[col].transform('median')
df_median[col] = df_median[col].fillna(group_median)
.groupby().transform('median') returns a Series aligned to the original
index — drop-in for .fillna()Data Cleaning & Preparation
def pmm_impute(df, target_col, predictors, k=5, random_state=42):
rng = np.random.default_rng(random_state)
preds = [p for p in predictors if p != target_col and p in df.columns]
mask_obs = df[target_col].notna() & df[preds].notna().all(axis=1)
mask_miss = df[target_col].isna() & df[preds].notna().all(axis=1)
X_obs = df.loc[mask_obs, preds].values.astype(float)
y_obs = df.loc[mask_obs, target_col].values.astype(float)
X_miss = df.loc[mask_miss, preds].values.astype(float)
X_obs, y_obs) and missing
rows (X_miss)Data Cleaning & Preparation
X_obs_a = np.column_stack([np.ones(len(X_obs)), X_obs])
X_miss_a = np.column_stack([np.ones(len(X_miss)), X_miss])
beta = np.linalg.lstsq(X_obs_a, y_obs, rcond=None)[0]
y_hat_obs = X_obs_a @ beta
y_hat_miss = X_miss_a @ beta
donors = [y_obs[np.argsort(np.abs(y_hat_obs - p))[:k]] for p in y_hat_miss]
imputed = np.array([rng.choice(d) for d in donors])
df.loc[mask_miss, target_col] = imputed
lstsq fits OLS in one callData Cleaning & Preparation
mask_obs = df[target_col].notna() & df[preds].notna().all(axis=1)
mask_miss = df[target_col].isna() # BUG: doesn't check predictors
Line 2: if a predictor is also NaN, the matrix multiply produces NaN predictions that are silently written back.
Data Cleaning & Preparation
mask_obs = df[target_col].notna() & df[preds].notna().all(axis=1)
mask_miss = df[target_col].isna() & df[preds].notna().all(axis=1) # FIXED
.isnull().sum() still reports the column as
fully imputedData Cleaning & Preparation
beta = np.linalg.lstsq(X_obs_a, y_obs, rcond=None)[0]
residuals = y_obs - X_obs_a @ beta
sigma = np.std(residuals) # residual σ
y_pred = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))
Data Cleaning & Preparation
residuals = y_obs - X_obs_a @ beta
sigma = np.std(df[target_col].dropna()) # BUG: full-column std
y_pred = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))
Line 2: the full-column std includes explained variation — far larger than the residual std — producing implausibly wide imputed distributions.
Data Cleaning & Preparation
residuals = y_obs - X_obs_a @ beta
sigma = np.std(residuals) # FIXED: residual std
y_pred = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))
Data Cleaning & Preparation
Data Cleaning & Preparation
msno.bar(df_clean, figsize=(12, 4), color='seagreen')
plt.title('Null counts after cleaning')
plt.tight_layout()
plt.show()
df_median.to_csv('../data/lokale_median.csv', index=False)
df_pmm.to_csv('../data/lokale_pmm.csv', index=False)
df_stochastic.to_csv('../data/lokale_stochastic.csv', index=False)
df_clean.to_csv('../data/lokale_clean.csv', index=False)
msno.bar as a final sanity check — no unexpected nulls remaining
index=False prevents an unnamed index column appearing in the output file
Data Cleaning & Preparation