Poznań University of Economics and Business · Real Estate Market Analysis

Data Cleaning & Preparation

Understanding the format behind GML spatial data files
and loading them into Python

Data Cleaning & Preparation · PUEB REMA

Introduction to XML

Vocabulary, then a short RCN GML fragment. Syntax and namespaces follow in later sections.

Introduction to XML

What is XML?

Hierarchical text · elements and attributes

XML is plain-text markup: nested elements, optional attributes, one document tree. You define tags or adopt a standard (e.g. GML); a parser builds the tree.

Why it is used for exchange

Self-describing — tags name the data; you can read structure without a separate data dictionary.
Validation — schemas (e.g. XSD) define allowed elements and types before data enters a pipeline.
Rich fields — attributes carry units, IDs, language, or “nil” flags next to values.
Interoperability — long track record in government, GIS (OGC/ISO), and regulated sectors.

Self-describing vs CSV: column order alone does not document meaning; XML element names do. Still, semantics ultimately come from the standard (e.g. GML application schema), not from the angle brackets alone.

History: XML 1.0 — W3C Recommendation (1998), simplified from SGML for the web; Unicode; meant for documents and machine interchange. XSD (and others) describe valid structure.

XML is a generic interchange format, not “web pages” — unlike HTML’s fixed tag set. GML uses XML for schema validation, tooling, and ISO/OGC alignment.

Many students know JSON: both are tree-shaped text; XML adds attributes on elements and has a long trail in government and GIS procurement.

Introduction to XML

Where you see XML

Beyond this course

Geodata and graphics: GML, KML, SVG
Office and tools: OOXML (.docx / .xlsx), RSS, Android layouts, build files
Enterprise: SOAP, B2B payloads where schemas matter

Introduction to XML

RCN export (excerpt)

Declaration (line 1) · comments (2–6, ignored by parsers) · <gml:boundedBy … /> = empty element

Step through the four highlighted sections:

Line 1 — XML declaration (<?xml … ?>): version 1.0, UTF-8. Must be first in the file. standalone="yes" = no external DTD.
Lines 2–6 — comment block (). Export metadata from WEGA; parsers ignore comments entirely.
Lines 7–15 — root element FeatureCollection. Seven xmlns: lines declare namespace prefixes — normal in GML.
Lines 17–24 — first featureMember: RCN_Dokument (notarial deed), with xsi:nil geometry, deed number, date, notary.

Self-closing: <gml:boundedBy xsi:nil="true" /> — no text content; /> closes the tag. Same idea as <boundedBy></boundedBy> when empty.

Ask: "Has anyone opened a file like this before?" — The goal today is that by the end, this looks readable.

The course file Baza_danych_RCN_Poznan_2021-2025.gml is a full multi-year Poznań export (millions of lines). Use streaming (iterparse) or geopandas.read_file(), not a naive full-tree load.

Data Cleaning & Preparation · PUEB REMA

Format overview

One example record (LOK_1) as CSV, JSON, and XML. Later material assumes XML because GML is XML.

Format overview · CSV

CSV

Comma-separated values · flat tables

Comma-Separated Values — one row per record, columns for fields. Typical for spreadsheets and SQL exports.


id,pow,cena
LOK_1,60.25,750000

Works well for quick analysis in pandas or Excel.
No native hierarchy: nested data usually means extra columns, encoding tricks, or another file and a join.
Units and rich geometry have no standard column — often WKT in text or separate layers.

Format overview · JSON

JSON

JavaScript Object Notation · nested objects

JSON — objects { }, arrays [ ], key–value pairs. Common in REST APIs and document stores.


{
  "id": "LOK_1",
  "pow": 60.25,
  "uom": "m2",
  "cena": 750000
}

Nesting maps cleanly to Python dicts and JavaScript objects.
Units or typed values often need extra keys or wrapper objects.
GeoJSON is widely used; cadastre-grade pipelines often still use GML/XML.

Format overview · XML

XML

eXtensible Markup Language · tagged trees

XML — elements in angle brackets, attributes, a strict document tree. Vocabularies such as GML are defined with schemas (e.g. XSD).

Attributes such as uom="m2" can sit on the same element as the value.
Child elements express containment without a second table.
GML, KML, SVG, CityGML, and many government feeds use XML; XSD can validate structure before ingest.

Data Cleaning & Preparation · PUEB REMA

Building blocks

Elements and attributes — toy examples first, then the same ideas in RCN GML.

Building blocks

Elements

Opening tag · content · closing tag


<city>Poznań</city>

<city> — opening tag
Poznań — text content
</city> — closing tag

Building blocks

Elements in RCN data

Same structure — longer names, namespace prefix

rcn: is a namespace prefix — the element is still opening tag · text · closing tag. Full meaning of prefixes in the namespaces section.

Building blocks

Attributes

Bare value → unit metadata on the opening tag


<pow>60.25</pow>

60.25 has no unit (m², ha, km²…) — for analysis, units must travel with the number, not only in a separate column or comment.

Put metadata on the same element


<pow uom="m2">60.25</pow>

Value unchanged; uom="m2" is unit of measure (real pattern from RCN_Lokal).

Syntax: name="value" inside <…>; quotes; several attributes = space-separated (order does not matter).

Building blocks

Common attributes

Fragments of real geometry and property fields

srsName, gml:id — CRS + stable id
count, srsDimension — how many coordinates, 2D vs 3D
uom — units (check before comparing numbers)
xsi:nil — “no value here” (e.g. missing geometry)

Data Cleaning & Preparation · PUEB REMA

Document structure

One root element, parent–child nesting, and paths to data — then a real RCN_Lokal fragment.

Document structure

XML is a tree

Not a single table — a hierarchy of elements

Each element has at most one parent (except the root, which has none). Children are fully inside the parent — that nesting is how “this address belongs to this flat” is represented.

Root — one outer element wraps the whole document (GML: gml:FeatureCollection).
Records — wrappers like gml:featureMember and typed features such as rcn:RCN_Lokal.
Paths — nested fields are reached by walking the tree (e.g. … → RCN_Adres → miejscowosc).
Well-formed — matching tags, proper nesting, one root; invalid XML fails in the parser (official RCN exports are well-formed).

Document structure

One feature (schematic)

Scalars vs nested block (address)


FeatureCollection
 └─ featureMember
     └─ RCN_Lokal
          ├─ idLokalu · rooms · floor · area · price …
          └─ adresBudynkuZLokalem
               └─ RCN_Adres → miejscowosc · ulica · numer …

Flat fields sit directly under RCN_Lokal; the address is a subtree (container → RCN_Adres → fields).

Document structure

`RCN_Lokal` (trimmed)

Real excerpt — indentation follows depth

1 / 14 — feature open/close + gml:id (stable id in the file).
2–6 — scalar fields; area carries uom="m2".
7–13 — nested address block (adresBudynkuZLokalem → RCN_Adres → fields).

Data Cleaning & Preparation · PUEB REMA

Namespaces

Prefixes (gml:, rcn:, …) point at standard vocabularies; you bind them once, then query with full names in code.

Namespaces

Why prefixes?

Many standards · same local names · different meaning

GML, RCN, XSD, and others all define tags like id or name. A namespace ties each prefix to one vocabulary so gml:id (geometry id) and rcn:id (cadastre id) never clash. prefix:localName is a QName — the colon is only punctuation, not special syntax.

gml: — OGC geometry / exchange shell (FeatureCollection, Polygon, …).
rcn: — Polish RCN types (RCN_Lokal, RCN_Transakcja, …).
xsi: — schema instance: xsi:nil, schemaLocation, …

Namespaces

`xmlns:` on the root

Each line binds a prefix to a URI — a stable vocabulary id

gml / rcn / xsi — main geometry, RCN features, schema hints (see previous slide).
xlink: — xlink:href points at another feature's gml:id.
gmd / gco / gts — ISO metadata / time; often declared; fewer direct paths in exercises.

Declared once on FeatureCollection — like imports for the whole document.

Namespaces

URIs and ElementTree

Identifiers — not necessarily pages to open in a browser

http://… and urn:… strings are globally unique vocabulary names. Parsing does not require downloading them.


ns = {
    'gml':   'http://www.opengis.net/gml/3.2',
    'rcn':   'urn:gugik:specyfikacje:gmlas:rejestrcennieruchomosci:1.0',
    'xlink': 'http://www.w3.org/1999/xlink',
}
tree.findall('gml:featureMember/rcn:RCN_Lokal', ns)

Elements may appear as {http://www.opengis.net/gml/3.2}Polygon — Clark notation — in errors and repr.

Data Cleaning & Preparation · PUEB REMA

RCN in the file

What the Poznań .gml represents in business terms: sales, properties, parcels, buildings, flats, and deeds — as linked records, not one nested story.

RCN in the file

What is in the export?

Rejestr Cen Nieruchomości — price and transaction facts

The file is a long sequence of gml:featureMember blocks. Each block is one business object (one transaction, one parcel, one apartment, …). The same real-world deal is split across several such objects, connected by identifiers — not by putting everything inside one big XML subtree.

You will filter by type (RCN_Transakcja, RCN_Lokal, …) and follow links to assemble a full case.
Geometry and address detail live on parcel / building / unit features; the transaction row carries price and pointers.

RCN in the file

Six feature types (domain)

Polish tag names — plain-language role

RCN_Transakcja — one reported deal: transaction price, type (e.g. sale vs other), links to the legal deed and to the property aggregate.
RCN_Nieruchomosc — the economic property sold as a bundle (may combine plot, building, unit; totals and roll-up attributes).
RCN_Dzialka — a cadastral plot (land parcel): identifiers, land use, polygon geometry.
RCN_Budynek — a building on a plot: footprint geometry, type/classification, address hooks.
RCN_Lokal — a dwelling unit (flat): rooms, floor, usable area, unit price — what you use for typical apartment analytics.
RCN_Dokument — the notarial deed (number, date, notary office) that is the legal basis of the transaction.

RCN in the file

How records tie together

A graph of IDs — not one nested XML document per deal

Each feature has a gml:id. Elsewhere, xlink:href attributes store references to another feature's id (deed, property bundle, parcel, …). So the business view is: many rows + foreign-key-style links, same idea as joins in a database.

Typical path to “full picture”: Transakcja → nieruchomosc href → RCN_Nieruchomosc → hrefs to Dzialka / Budynek / Lokal; podstawaPrawna → Dokument.

RCN in the file

One transaction in XML

Price on the row · rest via xlink:href

Wrapper + id · scalars (incl. gross price) · link to deed · link to property aggregate — then follow those ids elsewhere in the file.

Data Cleaning & Preparation · PUEB REMA

Why Data Preparation Matters

The foundation of analysis and the consequences of a naive data loading approach.

Why Data Preparation Matters

The Foundation of Analysis

Consequences of a naive approach

Proper data loading is more than just reading a file. It dictates the efficiency, accuracy, and feasibility of all subsequent Data Cleaning steps.

Memory Overload (RAM): Loading an entire multi-gigabyte GML file into memory at once can crash your environment.
Loss of Structure: Flattening nested XML elements incorrectly can sever crucial relationships (e.g., losing the address linked to a specific transaction).
Encoding Errors: Misinterpreting text encodings ruins string matching later (e.g., grouping by street names with broken Polish characters).

Data Cleaning & Preparation · PUEB REMA

Best Practices

Key principles and a checklist to follow before writing your parsing code.

Best Practices

What to pay attention to

Checklist for loading spatial/XML data

1. Selective Loading: Don't load everything. Filter data at the parsing stage. If you only need apartments, extract only RCN_Lokal nodes.
2. Namespace Management: XML files like GML rely heavily on prefixes (rcn:, gml:). Always define a namespace dictionary in your code, or your searches will return empty results.
3. Input Type Awareness: Standard XML parsers treat absolutely everything as text. Prices and areas will be imported as strings. You must plan for type casting (e.g., to_numeric) before analyzing.

Data Cleaning & Preparation · PUEB REMA

Live Demo: XML to DataFrame

Let's look at the example how to extract notarial deeds (RCN_Dokument) from the 2026 dataset.

Live Demo

1. Loading & Filtering

Parsing the file and targeting specific nodes


              import pandas as pd
              import xml.etree.ElementTree as ET

              tree = ET.parse('../data/Baza_danych_RCN_Poznań_2026.gml')
              root = tree.getroot()

              ns = {
                  'gml': 'http://www.opengis.net/gml/3.2',
                  'rcn': 'urn:gugik:specyfikacje:gmlas:rejestrcennieruchomosci:1.0'
              }

              document_nodes = root.findall('.//rcn:RCN_Dokument', ns)

ET.parse: Reads the entire XML document into a tree structure.
Namespaces (ns): We bind the prefixes found in the file to their full URIs.
.findall(): The essence of Selective Loading. We grab only the notarial deed nodes, ignoring all plots and buildings.

Live Demo

2. Flattening the Data

Extracting elements into a flat list of dictionaries


              document_list = []
              for doc in document_nodes:
                  doc_dict = {}
                  for child in doc:
                      # Strip the long namespace URI from the tag name
                      clean_tag = child.tag.split('}')[-1]
                      
                      # Extract the text from inside the element (if it exists)
                      value = child.text.strip() if child.text else None
                      doc_dict[clean_tag] = value
                      
                  document_list.append(doc_dict)

Without splitting by }, tags look like: {urn:gugik:...}oznaczenieDokumentu. We want clean column names.

Live Demo

3. Building the DataFrame

Converting to Pandas and basic cleanup


              df_documents = pd.DataFrame(document_list)

              # Drop 'ghost' columns that XML might have created as entirely empty
              df_documents = df_documents.dropna(axis=1, how='all')

              print(df_documents.head(3))
              print(df_documents.dtypes)

pd.DataFrame: Instantly converts our list of dictionaries into a flat, tabular format.
.dropna(how='all'): GML files often contain empty structural tags (like <gml:boundedBy />). This command removes columns consisting of 100% NaN values.
All strings: .dtypes will reveal that everything is currently an 'object' (text).

Up next

Now let's parse it

Open gml_xml_tasks.ipynb and work with
Baza_danych_RCN_Poznan_2021-2025.gml

Brief summary of what was covered:

XML structure: elements, attributes, nesting — the three building blocks
Declaration, comments, empty tags — on the RCN excerpt; well-formedness — with document structure
Namespaces: 7 prefixes in the RCN file, the xmlns: declaration pattern, URIs as unique labels
RCN business structure: six feature types (transaction, property, plot, building, unit, deed) linked by gml:id / xlink:href

Hand off to next section: loading and querying this file in Python.

Remind students: Baza_danych_RCN_Poznan_2021-2025.gml is very large — use iterparse() instead of ElementTree.parse() to avoid loading everything into RAM. geopandas.read_file() can also read GML directly and is the fastest path to a GeoDataFrame.

Data Cleaning & Preparation · PUEB REMA

Solutions: Data Cleaning & Imputation

Code walkthroughs for every exercise — five bug fixes included.

Data Cleaning · Task 1

Audit missing values


msno.bar(df_lokale, figsize=(12, 4), color='steelblue')
plt.title('Non-null counts per column')
plt.tight_layout()
plt.show()

msno.matrix(df_lokale, figsize=(12, 5))
plt.title('Nullity matrix')
plt.tight_layout()
plt.show()

msno.heatmap(df_lokale, figsize=(8, 6))
plt.title('Missing-value correlation heatmap')
plt.tight_layout()
plt.show()

bar — per-column fill rate at a glance
matrix — row-level pattern; reveals whether entire rows are block-missing together
heatmap — correlated missingness: which columns go missing together

Data Cleaning · Task 2 🐛 Bug 1

Bug 1 — int64 cannot hold NaN


for col, dtype in COLUMN_TYPES.items():
    if dtype in ('float64', 'Int64'):
        df[col] = pd.to_numeric(df[col], errors='coerce')
        if dtype == 'Int64':
            df[col] = df[col].astype('int64')  # ← BUG

Line 5: 'int64' cannot hold NaN — crashes when liczbaIzb or nrKondygnacji have missing values.

Data Cleaning · Task 2 ✓ Fix 1

Bug 1 — int64 cannot hold NaN


for col, dtype in COLUMN_TYPES.items():
    if dtype in ('float64', 'Int64'):
        df[col] = pd.to_numeric(df[col], errors='coerce')
        if dtype == 'Int64':
            df[col] = df[col].astype('Int64')  # ← FIXED

NumPy int64 has no NaN representation — raises ValueError when NaN is present
Pandas nullable Int64 (capital I) stores <NA> safely
pd.to_numeric(errors='coerce') first converts non-parsable strings to NaN before casting

Data Cleaning · Task 3

Decode `funkcjaLokalu` codes


FUNKCJA_LABELS = {
    1: 'mieszkalna',
    2: 'użytkowa',
    3: 'mieszkalno-użytkowa',
    4: 'rekreacji indywidualnej',
    5: 'zbiorowego zamieszkania',
    6: 'garażowa',
    7: 'inne',
} df['funkcjaLokalu_label'] = df['funkcjaLokalu'].map(FUNKCJA_LABELS).fillna('nieznana')

Integer codes are not self-describing in reports or plots
.map() with a dict is vectorised — no loops needed
.fillna('nieznana') catches any code absent from the dict

Data Cleaning · Task 4

Detect & flag outliers


def iqr_bounds(series, factor=3.0):
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = q3 - q1
    return q1 - factor * iqr, q3 + factor * iqr

price_lo, price_hi = iqr_bounds(df['cenaLokaluBrutto'].dropna())
area_lo,  area_hi  = iqr_bounds(df['powUzytkowaLokalu'].dropna())

df['is_outlier'] = (
    (df['cenaLokaluBrutto'] < price_lo) | (df['cenaLokaluBrutto'] > price_hi) |
    (df['powUzytkowaLokalu'] < area_lo)  | (df['powUzytkowaLokalu'] > area_hi)
)

IQR ×3 is a lenient bound — flags only extreme outliers, not just unusual values
Always compute bounds from raw data with .dropna(), not from a visualisation-clipped series
Boolean flag is_outlier preserves the rows for inspection rather than silently dropping them

Data Cleaning · Task 4 🐛 Bug 2

Bug 2 — IQR bounds on clipped data


price_lo, price_hi = iqr_bounds(clip_series(df['cenaLokaluBrutto']))
area_lo,  area_hi  = iqr_bounds(clip_series(df['powUzytkowaLokalu']))

Lines 1–2: clip_series compresses the range to the 1st–99th percentile — the resulting IQR is much tighter than the true one.

Data Cleaning · Task 4 ✓ Fix 2

Bug 2 — IQR bounds on clipped data


price_lo, price_hi = iqr_bounds(df['cenaLokaluBrutto'].dropna())
area_lo,  area_hi  = iqr_bounds(df['powUzytkowaLokalu'].dropna())

Clipping is for visualisation only — never feed clipped data into statistical computations
Compressed IQR → tighter bounds → legitimate values flagged as outliers
.dropna() on the raw column gives the correct Q1/Q3

Data Imputation · Task 1 🐛 Bug 3

Bug 3 — global vs grouped median


df_bug3 = df.copy()
for col in IMPUTE_COLS:
    df_bug3[col] = df_bug3[col].fillna(df_bug3[col].median())  # global median

Line 3: the global median mixes residential apartments, commercial units, garages — very different price distributions.

Data Imputation · Task 1 ✓ Fix 3

Bug 3 — global vs grouped median


df_median = df.copy()
for col in IMPUTE_COLS:
    group_median = df_median.groupby('funkcjaLokalu')[col].transform('median')
    df_median[col] = df_median[col].fillna(group_median)

Global median ignores premise type — mixes incomparable distributions
.groupby().transform('median') returns a Series aligned to the original index — drop-in for .fillna()
Each missing value is filled with the median of its own premise category

Data Imputation · Task 2

Predictive Mean Matching (PMM)


def pmm_impute(df, target_col, predictors, k=5, random_state=42):
    rng   = np.random.default_rng(random_state)
    preds = [p for p in predictors if p != target_col and p in df.columns]

    mask_obs  = df[target_col].notna() & df[preds].notna().all(axis=1)
    mask_miss = df[target_col].isna()  & df[preds].notna().all(axis=1)

    X_obs  = df.loc[mask_obs,  preds].values.astype(float)
    y_obs  = df.loc[mask_obs,  target_col].values.astype(float)
    X_miss = df.loc[mask_miss, preds].values.astype(float)

Both masks require all predictors to be non-null — prevents silent NaN propagation
Separate arrays for observed rows (X_obs, y_obs) and missing rows (X_miss)

Data Imputation · Task 2

Predictive Mean Matching (PMM)


    X_obs_a  = np.column_stack([np.ones(len(X_obs)),  X_obs])
    X_miss_a = np.column_stack([np.ones(len(X_miss)), X_miss])

    beta       = np.linalg.lstsq(X_obs_a, y_obs, rcond=None)[0]
    y_hat_obs  = X_obs_a  @ beta
    y_hat_miss = X_miss_a @ beta

    donors  = [y_obs[np.argsort(np.abs(y_hat_obs - p))[:k]] for p in y_hat_miss]
    imputed = np.array([rng.choice(d) for d in donors])
    df.loc[mask_miss, target_col] = imputed

Prepend a ones column → design matrix; lstsq fits OLS in one call
For each missing row: find the k observed rows with the closest predicted value
Randomly pick one of their actual values — only real observed values are ever imputed

Data Imputation · Task 2 🐛 Bug 4

Bug 4 — PMM ignores missing predictors


mask_obs  = df[target_col].notna() & df[preds].notna().all(axis=1)
mask_miss = df[target_col].isna()   # BUG: doesn't check predictors

Line 2: if a predictor is also NaN, the matrix multiply produces NaN predictions that are silently written back.

Data Imputation · Task 2 ✓ Fix 4

Bug 4 — PMM ignores missing predictors


mask_obs  = df[target_col].notna() & df[preds].notna().all(axis=1)
mask_miss = df[target_col].isna()  & df[preds].notna().all(axis=1)  # FIXED

If a predictor is NaN, the matrix product produces NaN predictions
NaN is silently written back — .isnull().sum() still reports the column as fully imputed
Fix: require both the target and all predictors to be non-null before attempting imputation

Data Imputation · Task 3

Stochastic Regression Imputation


beta      = np.linalg.lstsq(X_obs_a, y_obs, rcond=None)[0]
residuals = y_obs - X_obs_a @ beta
sigma     = np.std(residuals)                        # residual σ

y_pred = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))

Plain regression imputes the conditional mean — variance is compressed toward the fitted line
Adding N(0, σ_residual) noise restores the spread of the original distribution
σ must come from residuals — the unexplained part after fitting — not the full column

Data Imputation · Task 3 🐛 Bug 5

Bug 5 — sigma from full column


residuals = y_obs - X_obs_a @ beta
sigma     = np.std(df[target_col].dropna())   # BUG: full-column std
y_pred    = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))

Line 2: the full-column std includes explained variation — far larger than the residual std — producing implausibly wide imputed distributions.

Data Imputation · Task 3 ✓ Fix 5

Bug 5 — sigma from full column


residuals = y_obs - X_obs_a @ beta
sigma     = np.std(residuals)                 # FIXED: residual std
y_pred    = X_miss_a @ beta + rng.normal(0, sigma, size=len(X_miss))

Full-column std includes the variance explained by the model — much larger
Residual std is only the unexplained part — the correct noise scale
Oversized sigma produces imputed values far outside the plausible range

Data Imputation · Task 4

Comparing three imputation strategies

Group median

Simple & fast
Respects category structure
No variance preservation — creates artificial spikes at the median

PMM

Only real observed values are imputed
Preserves distribution shape
Requires enough observed neighbours (k donors)

Regresja stochastyczna

Restores variance via residual noise
Parametric — assumes linear relationship
Sensitive to predictor availability

Data Imputation · Task 5

Final inspection & export


msno.bar(df_clean, figsize=(12, 4), color='seagreen')
plt.title('Null counts after cleaning')
plt.tight_layout()
plt.show()

df_median.to_csv('../data/lokale_median.csv', index=False)
df_pmm.to_csv('../data/lokale_pmm.csv', index=False)
df_stochastic.to_csv('../data/lokale_stochastic.csv', index=False)
df_clean.to_csv('../data/lokale_clean.csv', index=False)

Re-run msno.bar as a final sanity check — no unexpected nulls remaining
Save one CSV per method for downstream comparison
index=False prevents an unnamed index column appearing in the output file

Data Cleaning & Preparation

Introduction to XML

What is XML?

Why it is used for exchange

Where you see XML

RCN export (excerpt)

Format overview

CSV

JSON

XML

Building blocks

Elements

Elements in RCN data

Attributes

Common attributes

Document structure

XML is a tree

One feature (schematic)

RCN_Lokal (trimmed)

Namespaces

Why prefixes?

xmlns: on the root

URIs and ElementTree

RCN in the file

What is in the export?

Six feature types (domain)

How records tie together

One transaction in XML

Why Data Preparation Matters

The Foundation of Analysis

Best Practices

What to pay attention to

Live Demo: XML to DataFrame

1. Loading & Filtering

2. Flattening the Data

3. Building the DataFrame

Now let's parse it

Solutions: Data Cleaning & Imputation

Audit missing values

Bug 1 — int64 cannot hold NaN

Bug 1 — int64 cannot hold NaN

Decode funkcjaLokalu codes

Detect & flag outliers

Bug 2 — IQR bounds on clipped data

Bug 2 — IQR bounds on clipped data

Bug 3 — global vs grouped median

Bug 3 — global vs grouped median

Predictive Mean Matching (PMM)

Predictive Mean Matching (PMM)

Bug 4 — PMM ignores missing predictors

Bug 4 — PMM ignores missing predictors

Stochastic Regression Imputation

Bug 5 — sigma from full column

Bug 5 — sigma from full column

Comparing three imputation strategies

Final inspection & export

`RCN_Lokal` (trimmed)

`xmlns:` on the root

Decode `funkcjaLokalu` codes