Datasets:

leggedrobotics
/

funthor-dataset

image imagewidth (px) 1.2k 1.2k	label class label 2 classes
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	0color
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth
	1depth

End of preview. Expand in Data Studio

FunTHOR

📄 Paper | 🌐 Project Page

FunTHOR is a synthetic dataset for functional 3D scene understanding, built on top of the AI2-THOR simulator. It provides part-level ground-truth geometry and dense, rule-based functional-relation annotations (e.g. knife slices apple, handle pulls to open door, stove knob turns on/off burner) for 12 indoor scenes, together with posed RGB-D sequences for each scene.

The dataset was introduced as a benchmark in our work on constructing probabilistic, open-vocabulary functional 3D scene graphs from posed RGB-D images. Compared to existing manually annotated datasets, FunTHOR is designed to provide dense annotations covering both part–object relations (a part operating its parent object) and object–object relations (one object acting on another).

Dataset at a glance

12 scenes (kitchens, living rooms, bedrooms, bathrooms) selected from AI2-THOR.
621 ground-truth nodes total (objects + functional parts), 92 of which are functional parts.
164 functional-relation edges total, annotated by transparent, inspectable rules.
60 posed RGB-D frames per scene (1200×680), randomly sampled from reachable viewpoints.
Object- and part-centric point clouds and an object–part hierarchy per scene.
A visible subset per scene that retains only nodes/edges observable from the sampled RGB-D frames.

Scene	nodes	parts	visible nodes	edges	frames
FloorPlan1_physics	117	32	113	45	60
FloorPlan5_physics	111	29	107	45	60
FloorPlan12_physics	76	6	73	12	60
FloorPlan202_physics	26	1	25	4	60
FloorPlan205_physics	39	1	39	5	60
FloorPlan206_physics	34	1	34	3	60
FloorPlan311_physics	41	3	41	8	60
FloorPlan313_physics	31	1	31	3	60
FloorPlan321_physics	28	1	28	3	60
FloorPlan401_physics	34	1	34	8	60
FloorPlan405_physics	40	6	37	12	60
FloorPlan422_physics	44	10	43	16	60
Total	621	92	605	164	720

Dataset structure

.
├── dataset_unique_labels.json       # all distinct object/part labels across the dataset
├── dataset_unique_relations.json    # all distinct functional-relation strings across the dataset
├── dataset_functional_labels.json   # labels that are categorized as functional elements for evaluation
├── annotation_rules/                # the rules used to auto-generate the functional edges (see below)
│   ├── functional_relations_config.json
│   └── manual_annotations/
│       └── FloorPlan*_physics.json
└── FloorPlan<ID>_physics/           # one folder per scene
    ├── node_list.pkl                # list of all ground-truth nodes (objects + parts)
    ├── object_metadata.json         # per-object metadata + object→parts hierarchy
    ├── annotations/                 # one JSON per node: maps node → point indices in pointcloud.ply
    │   └── node_XXXX_<Label>.json
    ├── annotations_aggregated.json  # all per-node annotations aggregated into one file
    ├── functional_relations.json    # ground-truth functional edges for this scene
    ├── pointcloud.ply               # dense scene point cloud (annotation indices reference this)
    ├── visible/                     # the visibility-filtered subset (see below)
    │   ├── node_list.pkl            # only nodes observable from the sampled frames
    │   └── visibility_stats.json    # per-node visible-point counts and visibility ratios
    └── dataset/                     # posed RGB-D capture for this scene
        ├── intrinsics.npy           # 3×3 pinhole camera intrinsics (shared by all frames)
        ├── color/000000.png … 000059.png   # RGB images, 1200×680, uint8
        ├── depth/000000.png … 000059.png   # depth images, 1200×680, uint16 (millimeters)
        └── pose/000000.npy … 000059.npy     # 4×4 camera-to-world matrices

Nodes (`node_list.pkl`)

Each scene's node_list.pkl is a pickled Python list of node dictionaries. Each node is either a whole object or a functional part. Fields:

key	type	description
`node_id`	int	unique node index within the scene
`object_id`	int	id of the owning object (a part shares its parent's `object_id`)
`label`	str	AI2-THOR label in UpperCamelCase, e.g. `StoveKnob`, `LightSwitch`
`is_part`	bool	`True` for functional parts (handles, knobs, buttons …), `False` for objects
`pcd`	`(N, 3)` float64	points sampled on the node's mesh surface, in world coordinates (meters)
`colors`	`(N, 3)` uint8	per-point RGB
`center`	`(3,)` float64	centroid of `pcd`

Nodes with label Undefined are placeholders and are typically skipped by loaders.

Object metadata & part hierarchy (`object_metadata.json`)

A list of object records (one per object_id). Fields are object_id, label, has_parts_annotation, and parts (the list of part labels belonging to the object). This encodes the object -> part hierarchy referenced by the node list.

Per-node point annotations (`annotations/`)

Each annotations/node_XXXX_<Label>.json maps a node to its supporting points as indices into the scene's pointcloud.ply:

{ "node_id": 0, "object_id": 0, "label": "Tomato", "is_part": false, "point_indices": [50000, 50001, ...] }

annotations_aggregated.json contains the same information for all nodes in a single file.

Functional relations (`functional_relations.json`)

A list of directed functional edges. Each edge connects a subject node (first_*) to an object node (second_*):

{
  "relation_type": "exact_match",
  "first_node_id": 40,  "first_object_id": 31, "first_label": "Faucet",
  "relation": "fill with water",
  "second_node_id": 96, "second_object_id": 68, "second_label": "Kettle"
}

relation_type is one of exact_match, proximity_based, part_based, or manual_annotation (see Annotation rules below). Node ids reference node_list.pkl. The dataset-level dataset_unique_relations.json lists all distinct relation strings.

Visible subset (`visible/`)

Because some objects are never observed from the sampled viewpoints (e.g. items inside closed cabinets), each scene also ships a visibility-filtered version intended for evaluation. visible/node_list.pkl holds the visible nodes and visible/visibility_stats.json records the per-node visible-point counts and visibility ratios.

Coordinate systems

World / scene frame. All node point clouds, centers, and pointcloud.ply are expressed in a single, metric, right-handed, Z-up world frame (units: meters). Note this differs from AI2-THOR's native Unity convention (left-handed, Y-up); the released data has already been converted to the Z-up right-handed frame above.

Camera frame. dataset/pose/NNNNNN.npy is a 4×4 camera-to-world transform T_wc (rotation has determinant +1). The camera uses the OpenCV convention: +x right, +y down, +z forward (into the scene).

Intrinsics. Shared 3×3 pinhole matrix for all frames. Depth PNGs are 16-bit and stored in millimeters (divide by 1000 for meters); a depth of 0 denotes a missing/invalid measurement.

Annotation rules (`annotation_rules/`)

The functional edges are produced automatically and transparently from a small set of inspectable rules, rather than hand-labeled per scene. We include the exact rule configuration used to generate this release in annotation_rules/ so that the annotation logic is fully reproducible and auditable.

Each rule is a functional triplet (first_label, relation, second_label). Rules are grouped by matching strategy (annotation_rules/functional_relations_config.json):

exact_match_relations — annotate an edge whenever a node's label exactly matches first_label and another node's label exactly matches second_label (e.g. Knife → can slice or cut → Apple; Faucet → fill with water → Kettle).
proximity_based_relations — for each subject node, connect it to the nearest node matching second_label, provided the distance between centers is below a threshold (default 1 m, with optional per-rule distance_threshold overrides). Matching is greedy and globally distance-ordered so the result does not depend on rule ordering (e.g. Faucet → run water into → Sink; Faucet → run water into → Bathtub).
part_based_relations — for objects with toggleable/openable AI2-THOR properties that expose explicit functional parts, connect the part to its parent object (e.g. Lever → push down to start toasting → Toaster; Handle → pull to open → Door).
manual annotations — a few semantically ambiguous associations are recorded by hand in annotation_rules/manual_annotations/<scene>.json (currently only stove-knob → stove-burner pairings, with the relation turn on/off).

The functional-relation rule set was constructed by referencing the AI2-THOR object type documentation, in particular each object type's Actionable Properties (e.g. sliceable, toggleable, openable, fillable), to decide which functional triplets are physically plausible.

Credits and acknowledgements

Ground-truth meshes and scenes. The object CAD models and AI2-THOR scenes used to generate FunTHOR's object- and part-centric ground-truth annotations come from the hssd/ai2thor-hab dataset (AI2-THOR–Habitat). We decomposed and re-annotated relevant assets into semantically meaningful parts to build the part-aware geometry. We gratefully acknowledge the HSSD / AI2-THOR–Habitat authors.
AI2-THOR. Scenes and the simulation infrastructure are based on AI2-THOR (Kolve et al., 2017).

Citation

If you use FunTHOR, please cite our paper:

@inproceedings{Fu_2026_funfact,
  title     = {FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning},
  author    = {Fu, Zhengyu and Zurbr\"ugg, Ren\'e and Qu, Kaixian and Pollefeys, Marc and Hutter, Marco and
               Blum, Hermann and Bauer, Zuria},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026}
}