Building information extraction utilizing remote sensing technology has vital applications in many domains, such as urban planning, cadastral mapping, geographic information censuses, and land-cover change analysis. In recent years, deep learning algorithms with strong feature construction ability have been widely used in automatic building extraction. However, most methods using semantic segmentation networks cannot obtain object-level building information. Some instance segmentation networks rely on predefined detectors and have weak detection ability for buildings with complex shapes and multiple scales. In addition, the advantages of multi-modal remote sensing data have not been effectively exploited to improve model performance with limited training samples. To address the above problems, we proposed a CNN framework with an adaptive center point detector for the object-level extraction of buildings. The proposed framework combines object detection and semantic segmentation with multi-modal data, including high-resolution aerial images and LiDAR data, as inputs. Meanwhile, we developed novel modules to optimize and fuse multi-modal features. Specifically, the local spatial–spectral perceptron can mutually compensate for semantic information and spatial features. The cross-level global context module can enhance long-range feature dependence. The adaptive center point detector explicitly models deformable convolution to improve detection accuracy, especially for buildings with complex shapes. Furthermore, we constructed a building instance segmentation dataset using multi-modal data for model training and evaluation. Quantitative analysis and visualized results verified that the proposed network can improve the accuracy and efficiency of building instance segmentation.