Geospatial (Web Search) Query Detector

A binary SetFit classifier that distinguishes geospatial from non-geospatial web search queries. Trained on 1,200 gold-labelled MS MARCO web search queries with weak supervision from Llama 3.1, then manually verified. See COSIT 2026 paper preprint here - https://arxiv.org/abs/2605.11336

Achieves F1 = 0.931 on a held-out test set of 800 samples (421 non-spatial, 379 spatial), with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.

What counts as a geospatial query?

As per Mai et al. (2021) and Kefalidis et al. (2024), a query is geospatial if it requires qualitative or quantitative geographic knowledge of Earth-bound features to be answered.

This is usually the case if the query involves:

  • A geographic entity (named place on Earth: city, country, river, POI, address)
  • A geographic concept (place type: city, lake, mountain, park, building)
  • A spatial relation (near, within, north of, between, borders, crosses, distance)

Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract 'where' questions; queries needing no geographic knowledge.

Model details

  • Sentence Transformer body: BAAI/bge-small-en-v1.5
  • Classification head: LogisticRegression
  • Training data: 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
  • Labels: 1 = geospatial, 0 = non-geospatial

Usage

from setfit import SetFitModel

model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
preds = model([
  "nearest hospital",
  "far from the truth",
  "close to my heart",
  "flood risk in this area"
])
# => [1, 0, 0, 1]

Training

Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3, then manually verified. The SetFit model was trained for 3 epochs with batch size 64 and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation, then retrained on the full gold dataset (1,200 samples) for production inference.

Downloads last month
8
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilyankou/is-geospatial-query

Finetuned
(365)
this model

Paper for ilyankou/is-geospatial-query