Self-supervised Pre-training for Robust and Generic Spatial-Temporal Representations