Extracting buildings from high-resolution remote sensing images is essential for many geospatial applications, such as building change detection, urban planning, and disaster emergency assessment. Due to the diversity of geometric shapes and the blurring of boundaries among buildings, it is still a challenging task to accurately generate building footprints from the complex scenes of remote sensing images. The rapid development of convolutional neural networks is presenting both new opportunities and challenges with respect to the extraction of buildings from high-resolution remote sensing images. To capture multilevel contextual information, most deep learning methods extract buildings by integrating multilevel features. However, the differential responses between such multilevel features are often ignored, leading to blurred contours in the extraction results. In this study, we propose an end-to-end multitask building extraction method to address these issues; this approach utilizes the rich contextual features of remote sensing images to assist with building segmentation while ensuring that the shape of the extraction results is preserved. By combining boundary classification and boundary distance regression, clear contour and distance transformation maps are generated to further improve the accuracy of building extraction. Subsequently, multiple refinement modules are used to refine each part of the network to minimize the loss of image feature information. Experimental comparisons conducted on the SpaceNet and Massachusetts building datasets show that the proposed method outperforms other deep learning methods in terms of building extraction results.