00
Poznań University of Economics and Business · Real Estate Market Analysis




Data Cleaning & Preparation

Understanding the format behind GML spatial data files
and loading them into Python

01
Data Cleaning & Preparation · PUEB REMA

Introduction to XML

Vocabulary, then a short RCN GML fragment. Syntax and namespaces follow in later sections.

Introduction to XML

What is XML?

Hierarchical text · elements and attributes

XML is plain-text markup: nested elements, optional attributes, one document tree. You define tags or adopt a standard (e.g. GML); a parser builds the tree.

Why it is used for exchange

  • Self-describing — tags name the data; you can read structure without a separate data dictionary.
  • Validation — schemas (e.g. XSD) define allowed elements and types before data enters a pipeline.
  • Rich fields — attributes carry units, IDs, language, or “nil” flags next to values.
  • Interoperability — long track record in government, GIS (OGC/ISO), and regulated sectors.
Introduction to XML

Where you see XML

Beyond this course

  • Geodata and graphics: GML, KML, SVG
  • Office and tools: OOXML (.docx / .xlsx), RSS, Android layouts, build files
  • Enterprise: SOAP, B2B payloads where schemas matter
Introduction to XML

RCN export (excerpt)

Declaration (line 1) · comments (2–6, ignored by parsers) · <gml:boundedBy … /> = empty element

02
Data Cleaning & Preparation · PUEB REMA

Format overview

One example record (LOK_1) as CSV, JSON, and XML. Later material assumes XML because GML is XML.

Format overview · CSV

CSV

Comma-separated values · flat tables

Comma-Separated Values — one row per record, columns for fields. Typical for spreadsheets and SQL exports.


id,pow,cena
LOK_1,60.25,750000
  • Works well for quick analysis in pandas or Excel.
  • No native hierarchy: nested data usually means extra columns, encoding tricks, or another file and a join.
  • Units and rich geometry have no standard column — often WKT in text or separate layers.
Format overview · JSON

JSON

JavaScript Object Notation · nested objects

JSON — objects { }, arrays [ ], key–value pairs. Common in REST APIs and document stores.


{
  "id": "LOK_1",
  "pow": 60.25,
  "uom": "m2",
  "cena": 750000
}
  • Nesting maps cleanly to Python dicts and JavaScript objects.
  • Units or typed values often need extra keys or wrapper objects.
  • GeoJSON is widely used; cadastre-grade pipelines often still use GML/XML.
Format overview · XML

XML

eXtensible Markup Language · tagged trees

XML — elements in angle brackets, attributes, a strict document tree. Vocabularies such as GML are defined with schemas (e.g. XSD).

  • Attributes such as uom="m2" can sit on the same element as the value.
  • Child elements express containment without a second table.
  • GML, KML, SVG, CityGML, and many government feeds use XML; XSD can validate structure before ingest.
03
Data Cleaning & Preparation · PUEB REMA

Building blocks

Elements and attributes — toy examples first, then the same ideas in RCN GML.

Building blocks

Elements

Opening tag · content · closing tag


<city>Poznań</city>
  • <city> — opening tag
  • Poznań — text content
  • </city> — closing tag
Building blocks

Elements in RCN data

Same structure — longer names, namespace prefix

rcn: is a namespace prefix — the element is still opening tag · text · closing tag. Full meaning of prefixes in the namespaces section.

Building blocks

Attributes

Bare value → unit metadata on the opening tag


<pow>60.25</pow>

60.25 has no unit (m², ha, km²…) — for analysis, units must travel with the number, not only in a separate column or comment.

Put metadata on the same element


<pow uom="m2">60.25</pow>

Value unchanged; uom="m2" is unit of measure (real pattern from RCN_Lokal).

Syntax: name="value" inside <…>; quotes; several attributes = space-separated (order does not matter).

Building blocks

Common attributes

Fragments of real geometry and property fields

  • srsName, gml:id — CRS + stable id
  • count, srsDimension — how many coordinates, 2D vs 3D
  • uom — units (check before comparing numbers)
  • xsi:nil — “no value here” (e.g. missing geometry)
04
Data Cleaning & Preparation · PUEB REMA

Document structure

One root element, parent–child nesting, and paths to data — then a real RCN_Lokal fragment.

Document structure

XML is a tree

Not a single table — a hierarchy of elements

Each element has at most one parent (except the root, which has none). Children are fully inside the parent — that nesting is how “this address belongs to this flat” is represented.

  • Root — one outer element wraps the whole document (GML: gml:FeatureCollection).
  • Records — wrappers like gml:featureMember and typed features such as rcn:RCN_Lokal.
  • Paths — nested fields are reached by walking the tree (e.g. … → RCN_Adresmiejscowosc).
  • Well-formed — matching tags, proper nesting, one root; invalid XML fails in the parser (official RCN exports are well-formed).
Document structure

One feature (schematic)

Scalars vs nested block (address)


FeatureCollection
 └─ featureMember
     └─ RCN_Lokal
          ├─ idLokalu · rooms · floor · area · price …
          └─ adresBudynkuZLokalem
               └─ RCN_Adres → miejscowosc · ulica · numer …

Flat fields sit directly under RCN_Lokal; the address is a subtree (container → RCN_Adres → fields).

Document structure

RCN_Lokal (trimmed)

Real excerpt — indentation follows depth

  • 1 / 14 — feature open/close + gml:id (stable id in the file).
  • 2–6 — scalar fields; area carries uom="m2".
  • 7–13 — nested address block (adresBudynkuZLokalemRCN_Adres → fields).
05
Data Cleaning & Preparation · PUEB REMA

Namespaces

Prefixes (gml:, rcn:, …) point at standard vocabularies; you bind them once, then query with full names in code.

Namespaces

Why prefixes?

Many standards · same local names · different meaning

GML, RCN, XSD, and others all define tags like id or name. A namespace ties each prefix to one vocabulary so gml:id (geometry id) and rcn:id (cadastre id) never clash. prefix:localName is a QName — the colon is only punctuation, not special syntax.

  • gml: — OGC geometry / exchange shell (FeatureCollection, Polygon, …).
  • rcn: — Polish RCN types (RCN_Lokal, RCN_Transakcja, …).
  • xsi: — schema instance: xsi:nil, schemaLocation, …
Namespaces

xmlns: on the root

Each line binds a prefix to a URI — a stable vocabulary id

  • gml / rcn / xsi — main geometry, RCN features, schema hints (see previous slide).
  • xlink:xlink:href points at another feature's gml:id.
  • gmd / gco / gts — ISO metadata / time; often declared; fewer direct paths in exercises.

Declared once on FeatureCollection — like imports for the whole document.

Namespaces

URIs and ElementTree

Identifiers — not necessarily pages to open in a browser

http://… and urn:… strings are globally unique vocabulary names. Parsing does not require downloading them.


ns = {
    'gml':   'http://www.opengis.net/gml/3.2',
    'rcn':   'urn:gugik:specyfikacje:gmlas:rejestrcennieruchomosci:1.0',
    'xlink': 'http://www.w3.org/1999/xlink',
}
tree.findall('gml:featureMember/rcn:RCN_Lokal', ns)

Elements may appear as {http://www.opengis.net/gml/3.2}Polygon — Clark notation — in errors and repr.

06
Data Cleaning & Preparation · PUEB REMA

RCN in the file

What the Poznań .gml represents in business terms: sales, properties, parcels, buildings, flats, and deeds — as linked records, not one nested story.

RCN in the file

What is in the export?

Rejestr Cen Nieruchomości — price and transaction facts

The file is a long sequence of gml:featureMember blocks. Each block is one business object (one transaction, one parcel, one apartment, …). The same real-world deal is split across several such objects, connected by identifiers — not by putting everything inside one big XML subtree.

  • You will filter by type (RCN_Transakcja, RCN_Lokal, …) and follow links to assemble a full case.
  • Geometry and address detail live on parcel / building / unit features; the transaction row carries price and pointers.
RCN in the file

Six feature types (domain)

Polish tag names — plain-language role

  • RCN_Transakcja — one reported deal: transaction price, type (e.g. sale vs other), links to the legal deed and to the property aggregate.
  • RCN_Nieruchomosc — the economic property sold as a bundle (may combine plot, building, unit; totals and roll-up attributes).
  • RCN_Dzialka — a cadastral plot (land parcel): identifiers, land use, polygon geometry.
  • RCN_Budynek — a building on a plot: footprint geometry, type/classification, address hooks.
  • RCN_Lokal — a dwelling unit (flat): rooms, floor, usable area, unit price — what you use for typical apartment analytics.
  • RCN_Dokument — the notarial deed (number, date, notary office) that is the legal basis of the transaction.
RCN in the file

How records tie together

A graph of IDs — not one nested XML document per deal

Each feature has a gml:id. Elsewhere, xlink:href attributes store references to another feature's id (deed, property bundle, parcel, …). So the business view is: many rows + foreign-key-style links, same idea as joins in a database.

Typical path to “full picture”: Transakcjanieruchomosc href → RCN_Nieruchomosc → hrefs to Dzialka / Budynek / Lokal; podstawaPrawnaDokument.

RCN in the file

One transaction in XML

Price on the row · rest via xlink:href

Wrapper + id · scalars (incl. gross price) · link to deed · link to property aggregate — then follow those ids elsewhere in the file.

07
Up next

Now let's parse it

Open gml_xml_tasks.ipynb and work with
Baza_danych_RCN_Poznan_2021-2025.gml