To address data imbalance in medical document classification, we propose a probabilistic dictionary-based data augmentation approach by oversampling on the minority class and creating new documents with high variety by using synonyms’ similarities with the original medical entity term.