sac2021

5th solution for Samsung AI Challenge for Scientific Discovery (2021)


Keywords
artificial, intelligence, cheminformatics, molecular, property, prediction, quantum, chemistry, challenge, dacon, molecular-property-prediction, molecule, quantum-chemistry, solution
License
MIT
Install
pip install sac2021==0.0.3

Documentation

Samsung AI Challenge solution

๋ณธ repository๋ฅผ ํ†ตํ•ด 2021๋…„ DACON์„ ํ†ตํ•ด ๊ฐœ์ตœ๋œ Samsung AI Challenge for Scientific Discovery ๊ฒฝ์ง„๋Œ€ํšŒ์˜ 5์œ„ ์†”๋ฃจ์…˜ ์ฝ”๋“œ๋ฅผ ์ •๋ฆฌํ•˜์—ฌ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.

1. ๊ฐœ์š”

๋ณธ ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” ๋ถ„์ž์˜ 3์ฐจ์› ๊ตฌ์กฐ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ S1-T1 ์‚ฌ์ด์˜ ์—๋„ˆ์ง€ ๊ฐญ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋Š” Machine Learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ์„ ๊ฒจ๋ฃน๋‹ˆ๋‹ค.

2. ์ ‘๊ทผ

๋ชจ๋ธ ์„ค๋ช…

WIP

ํ•™์Šต ๋ฐฉ๋ฒ•

Pretraining

  • ์•„๋ž˜์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ HOMO ๋ฐ LUMO๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ์‚ฌ์ „ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • Pretraining์— ์‚ฌ์šฉ๋˜๋Š” molecule sdf ๋ฐ์ดํ„ฐ์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ(pretrain_metadata.csv)๋Š” ์—ฌ๊ธฐ์—์„œ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Pretraining์„ ์œ„ํ•ด ํ”„๋กœ์„ธ์‹ฑ ์™„๋ฃŒ๋œ molecule sdf๋“ค์„ ๋ชจ์•„ ๋‘” ๋””๋ ‰ํ† ๋ฆฌ pretrain_sdf๋Š” ์—ฌ๊ธฐ์—์„œ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Fine-tuning

  • ์‚ฌ์ „ํ•™์Šต๋œ stem์„ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์ฒซ 9 epoch์€ pretrained weight๋ฅผ freeze ์‹œํ‚จ ์ƒํƒœ๋กœ ํ•™์Šตํ•˜๊ณ , 10 epoch ๋ถ€ํ„ฐ weight unfreeze ํ›„ ๋ชจ๋“  weight๋ฅผ ์—…๋ฐ์ดํŠธ ์‹œํ‚ต๋‹ˆ๋‹ค.
  • ์ œ๊ณต๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ S1-T1 gap๊ณผ, S1, T1 ๊ฐ๊ฐ์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” regression head๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
    • Gap, S1, T1 regression์€ MSE loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • Gap์˜ weight๋Š” 1.0์ด๊ณ , S1, T1 regression์˜ weight๋Š” 0.05๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • Optimizer = AdamW(lr=3e-5)
  • Scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=15, threshold=0.005, threshold_mode='rel')
  • Batch size = 64

3. ์„ค์น˜ ๋ฐ ์‚ฌ์šฉ๋ฒ•

๋ณธ ์†”๋ฃจ์…˜ ์ฝ”๋“œ ๋ฐ ๋ชจ๋ธ์€ PyPI์— ๋ฐฐํฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์‚ฌ์šฉ์— ์•ž์„œ์„œ openbabel ํŒจํ‚ค์ง€๊ฐ€ ํ™˜๊ฒฝ์— ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. conda install -c conda-forge openbabel๋กœ ์„ค์น˜ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
$ pip install sac2021

Pretraining

  • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ pretrain_metadata.csv์™€ sdf ํŒŒ์ผ ๋””๋ ‰ํ† ๋ฆฌ pretrain_sdf๋ฅผ ๋‹ค์šด๋กœ๋“œ ํ›„, ์•„๋ž˜ ๋ช…๋ น์˜ --meta์™€ --data ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์ ˆํžˆ ์„ค์ •ํ•˜์—ฌ ์‚ฌ์ „ํ•™์Šต์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
$ python -m sac2021.pretrain \
    --meta [path/to/pretrain_metadata.csv] \
    --data [path/to/pretrain_sdf] \
    --output [OUTPUT_CHECKPOINT] \
    --model-id [ID] \
    --fold 0 \  # For validation purpose. (2.5% of the data will be held out)
    --loss mse \

Pretraining ํ•™์Šต ๋กœ๊ทธ๋Š” ์ด Weight & Biases Project์—์„œ ํ™•์ธ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Fine-tuning

  • ํ•™์Šต ๋ฐ์ดํ„ฐ traindev.csv์™€ sdf ํŒŒ์ผ ๋””๋ ‰ํ† ๋ฆฌ traindev_sdf๋ฅผ ๋‹ค์šด๋กœ๋“œ ํ›„, ์•„๋ž˜ ๋ช…๋ น์˜ --meta์™€ --data ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์ ˆํžˆ ์„ค์ •ํ•˜์—ฌ fine-tuning์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
$ python -m sac2021.finetune \
    --meta [path/to/traindev.csv] \
    --data [path/to/traindev_sdf] \
    --ckpt [path/to/pretrained_checkpoint] \
    --output [OUTPUT_CHECKPOINT] \ 
    --model-id [ID] \ 
    --fold 0 \
    --loss mse