WiFi-based fingerprinting is promising for practical indoor localization with smartphones because this technique provides absolute estimates of the current position, while the WiFi infrastructure is ubiquitous in the majority of indoor environments. However, the application of WiFi fingerprinting for positioning requires pre-surveyed signal maps and is getting more restricted in the recent generation of smartphones due to changes in security policies. Therefore, we sought new sources of information that can be fused into the existing indoor positioning framework, helping users to pinpoint their position, even with a relatively low-quality, sparse WiFi signal map. In this paper, we demonstrate that such information can be derived from the recognition of camera images. We present a way of transforming qualitative information of image similarity into quantitative constraints that are then fused into the graph-based optimization framework for positioning together with typical pedestrian dead reckoning (PDR) and WiFi fingerprinting constraints. Performance of the improved indoor positioning system is evaluated on different user trajectories logged inside an office building at our University campus. The results demonstrate that introducing additional sensing modality into the positioning system makes it possible to increase accuracy and simultaneously reduce the dependence on the quality of the pre-surveyed WiFi map and the WiFi measurements at run-time.