Remote sensing products with high temporal and spatial resolution can be hardly obtained under the constrains of existing technology and cost. Therefore, the spatiotemporal fusion of remote sensing images has attracted considerable attention. Spatiotemporal fusion algorithms based on deep learning have gradually developed, but they also face some problems. For example, the amount of data affects the model`s ability to learn, and the robustness of the model is not high. The features extracted through the convolution operation alone are insufficient, and the complex fusion method also introduces noise. To solve these problems, we propose a multi-stream fusion network for remote sensing spatiotemporal fusion based on Transformer and convolution, called MSNet. We introduce the structure of the Transformer, which aims to learn the global temporal correlation of the image. At the same time, we also use a convolutional neural network to establish the relationship between input and output and to extract features. Finally, we adopt the fusion method of average weighting to avoid using complicated methods to introduce noise. To test the robustness of MSNet, we conducted experiments on three datasets and compared them with four representative spatiotemporal fusion algorithms to prove the superiority of MSNet (Spectral Angle Mapper (SAM) < 0.193 on the CIA dataset, erreur relative global adimensionnelle de synthese (ERGAS) < 1.687 on the LGC dataset, and root mean square error (RMSE) < 0.001 on the AHB dataset).