Three Stanford graduate students developed a project called Predicting Image Geolocations (PIGEON), an artificial intelligence (AI) capable of accurately geolocating photos, even those the program has never seen before. Originally designed to identify locations on Google Street View, PIGEON can now guess the location of a Google Street View image anywhere on Earth with high accuracy.
Global geolocation of images remains a difficult problem due to the diversity of images originating from all over the world. Although vision transformer-based approaches have led to significant advances in geolocation accuracy, the success of previous literature is limited to narrow distributions of landmark images and the performance has not been generalized. Places not seen.
Although PIGEON technology may have useful applications, such as identifying locations in old photos or facilitating biological studies of biodiversity, it also raises privacy concerns. Experts fear that this feature could be used for government surveillance, corporate persecution or harassment. Despite its potential benefits, PIGEON’s effectiveness raises questions about privacy and its future use.
Prediction pipeline and main contributions from PIGEON
Researchers at Stanford University have introduced a new geolocation system that integrates semantic geocell creation, contrastive multitasking pre-training, and an innovative loss function. Their work represents the first attempt to examine groups of sites to refine estimates. The first model, called PIGEON, was trained on data from the Geoguessr game and demonstrated the ability to place more than 40% of its estimates within 25 kilometers of the target on a global scale. The researchers also developed a bot and had PIGEON conduct a blind experiment against human players, placing them in the top 0.01% of players.
In a series of six games broadcast to millions of viewers, they challenged one of Geoguessr’s greatest professional players and won every game. Their second model, PIGEOTTO, stands out for its training on a dataset of images from Flickr and Wikipedia. It outperformed the legacy SOTA model by 7.7 percentage points in city accuracy and 38.8 percentage points at the country level, and demonstrated exceptional performance on various image geolocation benchmarks. The results suggest that PIGEOTTO is the first image geolocation model capable of efficiently generalizing unseen locations, paving the way for high-accuracy image geolocation systems on a global scale.
Specifications of geocells around Paris, France
Administrative data and educational information are hierarchically structured, grouped and divided into semantic geographical cells using the Vorono method. The geographic cell labels are then used to create continuous labels without smoothing transitions. The interpretive clustering models CLIP and OPTICS are used to generate representations of location clusters.
During the inference phase, the density of the moving image is calculated and first passed to a linear layer to make geometry cell predictions, thereby identifying higher geometry cell candidates. This density is then fed into our refinement process to improve predictions within and between geometry cells and to minimize the L2 distance between the inference image density and the location cluster representations relative to geometry cells. Finally, the predictions are refined within the identified top grouping to generate output geographic coordinates.
Definition of the image geolocation problem
Image localization presents the challenge of matching moving images with coordinates to determine where they were captured. The complexity of the problem lies not only in its overall formulation, but also in the difficulty of determining an accurate location due to variations in day, weather, season, time, lighting, climate, traffic, viewing angle and other factors.
The first modern attempt at global image localization dates back to IM2GPS (2008) (Hays & Efros, 2008), an approach based on searching for manually created features. However, the reliance on nearest neighbor search methods (Zamir & Shah, 2014) using manually created visual features (Crandall et al., 2009) implies the need for a large database of reference images, making precise geographic location determination on a global scale virtually impossible . Therefore, subsequent work has opted for a more limited approach and focused on specific cities such as Orlando and Pittsburgh (Zamir & Shah, 2010) or San Francisco (Berton et al., 2022).
Some have chosen to target specific countries such as the United States or even more specific geographical features such as mountain ranges, deserts and beaches for security and privacy reasons.
Hierarchical image geolocation with labels based on distance smoothing: By approaching the discretization of the image geolocation problem, we create a trade-off between the granularity of the geocells and the accuracy of the predictions. Although finer geocells allow for more accurate predictions, they make classification more complex due to higher cardinality. Previous literature has addressed this challenge by producing different geolocation predictions at multiple levels of geographic granularity and refining the estimates at each subsequent level. also present architectures that share certain model parameters between different hierarchy levels, thereby improving geolocation performance.
However, all of these previous approaches share a common limitation: the models operate in isolation and ignore which geocells are adjacent to each other. The approach overcomes this significant limitation by sharing all parameters between multiple implicit levels of geographic hierarchies. This is achieved by a loss function that connects neighboring gocells by adjusting the label based on the haversin distance. The latter measures the distance between two points on the earth’s surface. For two given points p1=(λ1,ϕ1) and p2=(λ2,ϕ2) with longitude λ and latitude ϕ, the haversine distance Hav(p1,p2) in kilometers is calculated as follows:
We then smooth the original classification label of the geocell once using this distance metric according to the following equation for a given sample n and geocells:
Where gi are the coordinates of the center of the polygon of cell i, gn are the coordinates of the center of the true cell, xn are the true coordinates of the example for which the label is calculated, and τ is a temperature parameter, which is set to 75 for PIGEON in our experiments and 65 for PIGEOTTO. It is important to note that our Haversines smoothing is different from classic label smoothing because the labels are not decomposed based on a constant factor, but are based on both the distance from the correct geocell and the actual location.
Since for each training example multiple geocells have a target yn that is significantly larger than zero, our model simultaneously learns to predict the correct geocell as well as an even coarser level of geographic granularity. We design the following loss function based on haversine smoothing for a given training sample:
Here pn, i represents the probability that our model assigns geocell i to sample n. An additional advantage of using the loss defined by the previous equation is its positive impact on generalization, since the definitions of the hierarchy vary depending on the training sample. Additionally, when a sample is near the boundary between two gocells, this reality is reflected by approximately equal target designations for all gocells.
This is particularly useful for large geocells containing up to ten rural cells. Furthermore, since each target label yn,i is now continuous and the complexity of the classification problem can be freely adjusted using τ, any number of Go cells can be used, provided that these Go cells remain meaningful to the contextual point view and a contain minimum number of samples.
Finally, note that our classification loss now depends directly on the distance from the actual location xn of a given sample, thus circumventing the difficulties associated with regression in previous literature.
Effects of applying haversin smoothing to neighboring gocells for a site in Accra, Ghana
The three Stanford students presented a new multitasking approach to global image localization that ensures state-of-the-art performance while demonstrating robustness to distribution fluctuations. To validate the effectiveness of the approach, they train and evaluate two different image localization models. First, we collect global data from StreetView to train PIGEON, a multitasking model that ranks among the top 0.01% of human players in the game Geoguessr.
For a dataset with 5,000 StreetView locations, PIGEON efficiently uses 40.4% of the set space for image localization. We then assemble a global dataset of over 4 million images from Flickr and Wikipedia to train the general PIGEOTTO model, significantly improving the quality of results for a broader range of geolocation datasets.
Looking forward, the question remains whether image tracking technologies will achieve a truly global reach or focus on specific population distributions. Regardless, insights into the importance of semantic creation of geometric cells, multimodal training in contrastive interpretation, and precise refinement of geometric cells, among others, highlight critical foundational elements for such systems.
However, as image localization technology continues to be deployed, the potential benefits must be balanced with the potential risks to ensure the rational development of future computer surveillance systems.
The conclusion of the researcher’s work provides a positive overview of the work of Stanford doctoral students on geolocating images on a global scale. However, some points deserve critical consideration.
Critical analysis of Stanford students’ approach to image localization
First, the claim that their approach ensures state-of-the-art performance and robustness to distributional fluctuations may require additional details and concrete evidence to support this claim. Comparisons with other existing approaches or explanations of how their model specifically addresses distributional challenges would be beneficial.
As for the specific results of the PIGEON and PIGEOTTO models, although the percentage of sentence space used for image localization is mentioned, further evaluation of the performance, precision and recall metrics, as well as comparisons with other existing models, could strengthen the credibility of these results.
Furthermore, the question raised as to whether image geolocation technologies will be global or focused on specific distributions is relevant but requires further investigation. Considerations about the ethical, social and political implications of such technologies would also be a valuable addition to the discussion.
Although the conclusion highlights critical foundational elements for image geolocation systems, a more nuanced approach, more detailed comparative data, and a deeper examination of the future impact of this technology could strengthen the quality and robustness of Stanford students’ work.
Source: Stanford University graduates
And you :
Are the conclusions of Stanford University’s work relevant?
What is your opinion on this topic?
See also:
94% of Generation Z support geolocation, with 72% of women saying sharing their location makes them feel better physically
Google officially launches the Equiano submarine cable in Cape Town, South Africa, which is expected to create 1.8 million jobs by 2025
Google introduces Passkey, a new authentication option that represents a step toward a passwordless future. However, the company is careful not to set its limits