00
Poznań University of Economics and Business · Real Estate Market Analysis




Data Cleaning & Preparation

Understanding the format behind GML spatial data files
and loading them into Python

01
Data Cleaning & Preparation · PUEB REMA

Introduction to XML

Vocabulary, then a short RCN GML fragment. Syntax and namespaces follow in later sections.

Introduction to XML

What is XML?

Hierarchical text · elements and attributes

XML is plain-text markup: nested elements, optional attributes, one document tree. You define tags or adopt a standard (e.g. GML); a parser builds the tree.

Why it is used for exchange

  • Self-describing — tags name the data; you can read structure without a separate data dictionary.
  • Validation — schemas (e.g. XSD) define allowed elements and types before data enters a pipeline.
  • Rich fields — attributes carry units, IDs, language, or “nil” flags next to values.
  • Interoperability — long track record in government, GIS (OGC/ISO), and regulated sectors.
Introduction to XML

Where you see XML

Beyond this course

  • Geodata and graphics: GML, KML, SVG
  • Office and tools: OOXML (.docx / .xlsx), RSS, Android layouts, build files
  • Enterprise: SOAP, B2B payloads where schemas matter
Introduction to XML

RCN export (excerpt)

Declaration (line 1) · comments (2–6, ignored by parsers) · <gml:boundedBy … /> = empty element

02
Data Cleaning & Preparation · PUEB REMA

Format overview

One example record (LOK_1) as CSV, JSON, and XML. Later material assumes XML because GML is XML.

Format overview · CSV

CSV

Comma-separated values · flat tables

Comma-Separated Values — one row per record, columns for fields. Typical for spreadsheets and SQL exports.


id,pow,cena
LOK_1,60.25,750000
  • Works well for quick analysis in pandas or Excel.
  • No native hierarchy: nested data usually means extra columns, encoding tricks, or another file and a join.
  • Units and rich geometry have no standard column — often WKT in text or separate layers.
Format overview · JSON

JSON

JavaScript Object Notation · nested objects

JSON — objects { }, arrays [ ], key–value pairs. Common in REST APIs and document stores.


{
  "id": "LOK_1",
  "pow": 60.25,
  "uom": "m2",
  "cena": 750000
}
  • Nesting maps cleanly to Python dicts and JavaScript objects.
  • Units or typed values often need extra keys or wrapper objects.
  • GeoJSON is widely used; cadastre-grade pipelines often still use GML/XML.
Format overview · XML

XML

eXtensible Markup Language · tagged trees

XML — elements in angle brackets, attributes, a strict document tree. Vocabularies such as GML are defined with schemas (e.g. XSD).

  • Attributes such as uom="m2" can sit on the same element as the value.
  • Child elements express containment without a second table.
  • GML, KML, SVG, CityGML, and many government feeds use XML; XSD can validate structure before ingest.
03
Data Cleaning & Preparation · PUEB REMA

Building blocks

Elements and attributes — toy examples first, then the same ideas in RCN GML.

Building blocks

Elements

Opening tag · content · closing tag


<city>Poznań</city>
  • <city> — opening tag
  • Poznań — text content
  • </city> — closing tag
Building blocks

Elements in RCN data

Same structure — longer names, namespace prefix

rcn: is a namespace prefix — the element is still opening tag · text · closing tag. Full meaning of prefixes in the namespaces section.

Building blocks

Attributes

Bare value → unit metadata on the opening tag


<pow>60.25</pow>

60.25 has no unit (m², ha, km²…) — for analysis, units must travel with the number, not only in a separate column or comment.

Put metadata on the same element


<pow uom="m2">60.25</pow>

Value unchanged; uom="m2" is unit of measure (real pattern from RCN_Lokal).

Syntax: name="value" inside <…>; quotes; several attributes = space-separated (order does not matter).

Building blocks

Common attributes

Fragments of real geometry and property fields

  • srsName, gml:id — CRS + stable id
  • count, srsDimension — how many coordinates, 2D vs 3D
  • uom — units (check before comparing numbers)
  • xsi:nil — “no value here” (e.g. missing geometry)
04
Data Cleaning & Preparation · PUEB REMA

Document structure

One root element, parent–child nesting, and paths to data — then a real RCN_Lokal fragment.

Document structure

XML is a tree

Not a single table — a hierarchy of elements

Each element has at most one parent (except the root, which has none). Children are fully inside the parent — that nesting is how “this address belongs to this flat” is represented.

  • Root — one outer element wraps the whole document (GML: gml:FeatureCollection).
  • Records — wrappers like gml:featureMember and typed features such as rcn:RCN_Lokal.
  • Paths — nested fields are reached by walking the tree (e.g. … → RCN_Adresmiejscowosc).
  • Well-formed — matching tags, proper nesting, one root; invalid XML fails in the parser (official RCN exports are well-formed).
Document structure

One feature (schematic)

Scalars vs nested block (address)


FeatureCollection
 └─ featureMember
     └─ RCN_Lokal
          ├─ idLokalu · rooms · floor · area · price …
          └─ adresBudynkuZLokalem
               └─ RCN_Adres → miejscowosc · ulica · numer …

Flat fields sit directly under RCN_Lokal; the address is a subtree (container → RCN_Adres → fields).

Document structure

RCN_Lokal (trimmed)

Real excerpt — indentation follows depth

  • 1 / 14 — feature open/close + gml:id (stable id in the file).
  • 2–6 — scalar fields; area carries uom="m2".
  • 7–13 — nested address block (adresBudynkuZLokalemRCN_Adres → fields).
05
Data Cleaning & Preparation · PUEB REMA

Namespaces

Prefixes (gml:, rcn:, …) point at standard vocabularies; you bind them once, then query with full names in code.

Namespaces

Why prefixes?

Many standards · same local names · different meaning

GML, RCN, XSD, and others all define tags like id or name. A namespace ties each prefix to one vocabulary so gml:id (geometry id) and rcn:id (cadastre id) never clash. prefix:localName is a QName — the colon is only punctuation, not special syntax.

  • gml: — OGC geometry / exchange shell (FeatureCollection, Polygon, …).
  • rcn: — Polish RCN types (RCN_Lokal, RCN_Transakcja, …).
  • xsi: — schema instance: xsi:nil, schemaLocation, …
Namespaces

xmlns: on the root

Each line binds a prefix to a URI — a stable vocabulary id

  • gml / rcn / xsi — main geometry, RCN features, schema hints (see previous slide).
  • xlink:xlink:href points at another feature's gml:id.
  • gmd / gco / gts — ISO metadata / time; often declared; fewer direct paths in exercises.

Declared once on FeatureCollection — like imports for the whole document.

Namespaces

URIs and ElementTree

Identifiers — not necessarily pages to open in a browser

http://… and urn:… strings are globally unique vocabulary names. Parsing does not require downloading them.


ns = {
    'gml':   'http://www.opengis.net/gml/3.2',
    'rcn':   'urn:gugik:specyfikacje:gmlas:rejestrcennieruchomosci:1.0',
    'xlink': 'http://www.w3.org/1999/xlink',
}
tree.findall('gml:featureMember/rcn:RCN_Lokal', ns)

Elements may appear as {http://www.opengis.net/gml/3.2}Polygon — Clark notation — in errors and repr.

06
Data Cleaning & Preparation · PUEB REMA

RCN in the file

What the Poznań .gml represents in business terms: sales, properties, parcels, buildings, flats, and deeds — as linked records, not one nested story.

RCN in the file

What is in the export?

Rejestr Cen Nieruchomości — price and transaction facts

The file is a long sequence of gml:featureMember blocks. Each block is one business object (one transaction, one parcel, one apartment, …). The same real-world deal is split across several such objects, connected by identifiers — not by putting everything inside one big XML subtree.

  • You will filter by type (RCN_Transakcja, RCN_Lokal, …) and follow links to assemble a full case.
  • Geometry and address detail live on parcel / building / unit features; the transaction row carries price and pointers.
RCN in the file

Six feature types (domain)

Polish tag names — plain-language role

  • RCN_Transakcja — one reported deal: transaction price, type (e.g. sale vs other), links to the legal deed and to the property aggregate.
  • RCN_Nieruchomosc — the economic property sold as a bundle (may combine plot, building, unit; totals and roll-up attributes).
  • RCN_Dzialka — a cadastral plot (land parcel): identifiers, land use, polygon geometry.
  • RCN_Budynek — a building on a plot: footprint geometry, type/classification, address hooks.
  • RCN_Lokal — a dwelling unit (flat): rooms, floor, usable area, unit price — what you use for typical apartment analytics.
  • RCN_Dokument — the notarial deed (number, date, notary office) that is the legal basis of the transaction.
RCN in the file

How records tie together

A graph of IDs — not one nested XML document per deal

Each feature has a gml:id. Elsewhere, xlink:href attributes store references to another feature's id (deed, property bundle, parcel, …). So the business view is: many rows + foreign-key-style links, same idea as joins in a database.

Typical path to “full picture”: Transakcjanieruchomosc href → RCN_Nieruchomosc → hrefs to Dzialka / Budynek / Lokal; podstawaPrawnaDokument.

RCN in the file

One transaction in XML

Price on the row · rest via xlink:href

Wrapper + id · scalars (incl. gross price) · link to deed · link to property aggregate — then follow those ids elsewhere in the file.

07
Data Cleaning & Preparation · PUEB REMA

Why Data Preparation Matters

The foundation of analysis and the consequences of a naive data loading approach.

Why Data Preparation Matters

The Foundation of Analysis

Consequences of a naive approach

Proper data loading is more than just reading a file. It dictates the efficiency, accuracy, and feasibility of all subsequent Data Cleaning steps.

  • Memory Overload (RAM): Loading an entire multi-gigabyte GML file into memory at once can crash your environment.
  • Loss of Structure: Flattening nested XML elements incorrectly can sever crucial relationships (e.g., losing the address linked to a specific transaction).
  • Encoding Errors: Misinterpreting text encodings ruins string matching later (e.g., grouping by street names with broken Polish characters).
08
Data Cleaning & Preparation · PUEB REMA

Best Practices

Key principles and a checklist to follow before writing your parsing code.

Best Practices

What to pay attention to

Checklist for loading spatial/XML data

  • 1. Selective Loading: Don't load everything. Filter data at the parsing stage. If you only need apartments, extract only RCN_Lokal nodes.
  • 2. Namespace Management: XML files like GML rely heavily on prefixes (rcn:, gml:). Always define a namespace dictionary in your code, or your searches will return empty results.
  • 3. Input Type Awareness: Standard XML parsers treat absolutely everything as text. Prices and areas will be imported as strings. You must plan for type casting (e.g., to_numeric) before analyzing.
09
Data Cleaning & Preparation · PUEB REMA

Live Demo: XML to DataFrame

Let's look at the example how to extract notarial deeds (RCN_Dokument) from the 2026 dataset.

Live Demo

1. Loading & Filtering

Parsing the file and targeting specific nodes


              import pandas as pd
              import xml.etree.ElementTree as ET

              tree = ET.parse('../data/Baza_danych_RCN_Poznań_2026.gml')
              root = tree.getroot()

              ns = {
                  'gml': 'http://www.opengis.net/gml/3.2',
                  'rcn': 'urn:gugik:specyfikacje:gmlas:rejestrcennieruchomosci:1.0'
              }

              document_nodes = root.findall('.//rcn:RCN_Dokument', ns)
              
  • ET.parse: Reads the entire XML document into a tree structure.
  • Namespaces (ns): We bind the prefixes found in the file to their full URIs.
  • .findall(): The essence of Selective Loading. We grab only the notarial deed nodes, ignoring all plots and buildings.
Live Demo

2. Flattening the Data

Extracting elements into a flat list of dictionaries


              document_list = []
              for doc in document_nodes:
                  doc_dict = {}
                  for child in doc:
                      # Strip the long namespace URI from the tag name
                      clean_tag = child.tag.split('}')[-1]
                      
                      # Extract the text from inside the element (if it exists)
                      value = child.text.strip() if child.text else None
                      doc_dict[clean_tag] = value
                      
                  document_list.append(doc_dict)
              

Without splitting by }, tags look like: {urn:gugik:...}oznaczenieDokumentu. We want clean column names.

Live Demo

3. Building the DataFrame

Converting to Pandas and basic cleanup


              df_documents = pd.DataFrame(document_list)

              # Drop 'ghost' columns that XML might have created as entirely empty
              df_documents = df_documents.dropna(axis=1, how='all')

              print(df_documents.head(3))
              print(df_documents.dtypes)
              
  • pd.DataFrame: Instantly converts our list of dictionaries into a flat, tabular format.
  • .dropna(how='all'): GML files often contain empty structural tags (like <gml:boundedBy />). This command removes columns consisting of 100% NaN values.
  • All strings: .dtypes will reveal that everything is currently an 'object' (text).
10
Up next

Now let's parse it

Open gml_xml_tasks.ipynb and work with
Baza_danych_RCN_Poznan_2021-2025.gml

11
Data Cleaning & Preparation · PUEB REMA

Solutions: Data Cleaning & Imputation

Code walkthroughs for every exercise — five bug fixes included.

Data Cleaning · Task 1

Audit missing values


msno.bar(df_lokale, figsize=(12, 4), color='steelblue')
plt.title('Non-null counts per column')
plt.tight_layout()
plt.show()

msno.matrix(df_lokale, figsize=(12, 5))
plt.title('Nullity matrix')
plt.tight_layout()
plt.show()

msno.heatmap(df_lokale, figsize=(8, 6))
plt.title('Missing-value correlation heatmap')
plt.tight_layout()
plt.show()
              
  • bar — per-column fill rate at a glance
  • matrix — row-level pattern; reveals whether entire rows are block-missing together
  • heatmap — correlated missingness: which columns go missing together
Data Cleaning · Task 2 🐛 Bug 1

Bug 1 — int64 cannot hold NaN


for col, dtype in COLUMN_TYPES.items():
    if dtype in ('float64', 'Int64'):
        df[col] = pd.to_numeric(df[col], errors='coerce')
        if dtype == 'Int64':
            df[col] = df[col].astype('int64')  # ← BUG
              

Line 5: 'int64' cannot hold NaN — crashes when liczbaIzb or nrKondygnacji have missing values.

Data Cleaning · Task 2 ✓ Fix 1

Bug 1 — int64 cannot hold NaN


for col, dtype in COLUMN_TYPES.items():
    if dtype in ('float64', 'Int64'):
        df[col] = pd.to_numeric(df[col], errors='coerce')
        if dtype == 'Int64':
            df[col] = df[col].astype('Int64')  # ← FIXED
              
  • NumPy int64 has no NaN representation — raises ValueError when NaN is present
  • Pandas nullable Int64 (capital I) stores <NA> safely
  • pd.to_numeric(errors='coerce') first converts non-parsable strings to NaN before casting
Data Cleaning · Task 3

Decode funkcjaLokalu codes


FUNKCJA_LABELS = {
    1: 'mieszkalna',
    2: 'użytkowa',
    3: 'mieszkalno-użytkowa',
    4: 'rekreacji indywidualnej',
    5: 'zbiorowego zamieszkania',
    6: 'garażowa',
    7: 'inne',
} df['funkcjaLokalu_label'] = df['funkcjaLokalu'].map(FUNKCJA_LABELS).fillna('nieznana')
              
  • Integer codes are not self-describing in reports or plots
  • .map() with a dict is vectorised — no loops needed
  • .fillna('nieznana') catches any code absent from the dict
Data Cleaning · Task 4

Detect & flag outliers


def iqr_bounds(series, factor=3.0):
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = q3 - q1
    return q1 - factor * iqr, q3 + factor * iqr

price_lo, price_hi = iqr_bounds(df['cenaLokaluBrutto'].dropna())
area_lo,  area_hi  = iqr_bounds(df['powUzytkowaLokalu'].dropna())

df['is_outlier'] = (
    (df['cenaLokaluBrutto'] < price_lo) | (df['cenaLokaluBrutto'] > price_hi) |
    (df['powUzytkowaLokalu'] < area_lo)  | (df['powUzytkowaLokalu'] > area_hi)
)
              
  • IQR ×3 is a lenient bound — flags only extreme outliers, not just unusual values
  • Always compute bounds from raw data with .dropna(), not from a visualisation-clipped series
  • Boolean flag is_outlier preserves the rows for inspection rather than silently dropping them
Data Cleaning · Task 4 🐛 Bug 2

Bug 2 — IQR bounds on clipped data


price_lo, price_hi = iqr_bounds(clip_series(df['cenaLokaluBrutto']))
area_lo,  area_hi  = iqr_bounds(clip_series(df['powUzytkowaLokalu']))
              

Lines 1–2: clip_series compresses the range to the 1st–99th percentile — the resulting IQR is much tighter than the true one.

Data Cleaning · Task 4 ✓ Fix 2

Bug 2 — IQR bounds on clipped data


price_lo, price_hi = iqr_bounds(df['cenaLokaluBrutto'].dropna())
area_lo,  area_hi  = iqr_bounds(df['powUzytkowaLokalu'].dropna())
              
  • Clipping is for visualisation only — never feed clipped data into statistical computations
  • Compressed IQR → tighter bounds → legitimate values flagged as outliers
  • .dropna() on the raw column gives the correct Q1/Q3
Data Imputation · Task 1 🐛 Bug 3

Bug 3 — global vs grouped median


df_bug3 = df.copy()
for col in IMPUTE_COLS:
    df_bug3[col] = df_bug3[col].fillna(df_bug3[col].median())  # global median
              

Line 3: the global median mixes residential apartments, commercial units, garages — very different price distributions.

Data Imputation · Task 1 ✓ Fix 3

Bug 3 — global vs grouped median


df_median = df.copy()
for col in IMPUTE_COLS:
    group_median = df_median.groupby('funkcjaLokalu')[col].transform('median')
    df_median[col] = df_median[col].fillna(group_median)
              
  • Global median ignores premise type — mixes incomparable distributions
  • .groupby().transform('median') returns a Series aligned to the original index — drop-in for .fillna()
  • Each missing value is filled with the median of its own premise category
Data Imputation · Task 2

Predictive Mean Matching (PMM)


def pmm_impute(df, target_col, predictors, k=5, random_state=42):
    rng   = np.random.default_rng(random_state)
    preds = [p for p in predictors if p != target_col and p in df.columns]

    mask_obs  = df[target_col].notna() & df[preds].notna().all(axis=1)
    mask_miss = df[target_col].isna()  & df[preds].notna().all(axis=1)

    X_obs  = df.loc[mask_obs,  preds].values.astype(float)
    y_obs  = df.loc[mask_obs,  target_col].values.astype(float)
    X_miss = df.loc[mask_miss, preds].values.astype(float)
            
  • Both masks require all predictors to be non-null — prevents silent NaN propagation
  • Separate arrays for observed rows (X_obs, y_obs) and missing rows (X_miss)
Data Imputation · Task 2

Predictive Mean Matching (PMM)


    X_obs_a  = np.column_stack([np.ones(len(X_obs)),  X_obs])
    X_miss_a = np.column_stack([np.ones(len(X_miss)), X_miss])

    beta       = np.linalg.lstsq(X_obs_a, y_obs, rcond=None)[0]
    y_hat_obs  = X_obs_a  @ beta
    y_hat_miss = X_miss_a @ beta

    donors  = [y_obs[np.argsort(np.abs(y_hat_obs - p))[:k]] for p in y_hat_miss]
    imputed = np.array([rng.choice(d) for d in donors])
    df.loc[mask_miss, target_col] = imputed
            
  • Prepend a ones column → design matrix; lstsq fits OLS in one call
  • For each missing row: find the k observed rows with the closest predicted value
  • Randomly pick one of their actual values — only real observed values are ever imputed
Data Imputation · Task 2 🐛 Bug 4

Bug 4 — PMM ignores missing predictors


mask_obs  = df[target_col].notna() & df[preds].notna().all(axis=1)
mask_miss = df[target_col].isna()   # BUG: doesn't check predictors
              

Line 2: if a predictor is also NaN, the matrix multiply produces NaN predictions that are silently written back.

Data Imputation · Task 2 ✓ Fix 4

Bug 4 — PMM ignores missing predictors


mask_obs  = df[target_col].notna() & df[preds].notna().all(axis=1)
mask_miss = df[target_col].isna()  & df[preds].notna().all(axis=1)  # FIXED
              
  • If a predictor is NaN, the matrix product produces NaN predictions
  • NaN is silently written back — .isnull().sum() still reports the column as fully imputed
  • Fix: require both the target and all predictors to be non-null before attempting imputation
Data Imputation · Task 3

Stochastic Regression Imputation


beta      = np.linalg.lstsq(X_obs_a, y_obs, rcond=None)[0]
residuals = y_obs - X_obs_a @ beta
sigma     = np.std(residuals)                        # residual σ

y_pred = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))
              
  • Plain regression imputes the conditional mean — variance is compressed toward the fitted line
  • Adding N(0, σ_residual) noise restores the spread of the original distribution
  • σ must come from residuals — the unexplained part after fitting — not the full column
Data Imputation · Task 3 🐛 Bug 5

Bug 5 — sigma from full column


residuals = y_obs - X_obs_a @ beta
sigma     = np.std(df[target_col].dropna())   # BUG: full-column std
y_pred    = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))
              

Line 2: the full-column std includes explained variation — far larger than the residual std — producing implausibly wide imputed distributions.

Data Imputation · Task 3 ✓ Fix 5

Bug 5 — sigma from full column


residuals = y_obs - X_obs_a @ beta
sigma     = np.std(residuals)                 # FIXED: residual std
y_pred    = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))
              
  • Full-column std includes the variance explained by the model — much larger
  • Residual std is only the unexplained part — the correct noise scale
  • Oversized sigma produces imputed values far outside the plausible range
Data Imputation · Task 4

Comparing three imputation strategies

Group median
  • Simple & fast
  • Respects category structure
  • No variance preservation — creates artificial spikes at the median
PMM
  • Only real observed values are imputed
  • Preserves distribution shape
  • Requires enough observed neighbours (k donors)
Regresja stochastyczna
  • Restores variance via residual noise
  • Parametric — assumes linear relationship
  • Sensitive to predictor availability
Data Imputation · Task 5

Final inspection & export


msno.bar(df_clean, figsize=(12, 4), color='seagreen')
plt.title('Null counts after cleaning')
plt.tight_layout()
plt.show()

df_median.to_csv('../data/lokale_median.csv', index=False)
df_pmm.to_csv('../data/lokale_pmm.csv', index=False)
df_stochastic.to_csv('../data/lokale_stochastic.csv', index=False)
df_clean.to_csv('../data/lokale_clean.csv', index=False)
              
  • Re-run msno.bar as a final sanity check — no unexpected nulls remaining
  • Save one CSV per method for downstream comparison
  • index=False prevents an unnamed index column appearing in the output file