728x90
반응형
데이터 준비
root@server:~# ls -al /home/data
-rw-rw-r-- 1 data data 175393 8월 30 08:40 test.en
-rw-rw-r-- 1 data data 247383 8월 30 08:40 test.ko
-rw-rw-r-- 1 data data 15706983 8월 30 08:40 train.en
-rw-rw-r-- 1 data data 21556330 8월 30 08:40 train.ko
-rw-rw-r-- 1 data data 184784 8월 30 08:40 valid.en
-rw-rw-r-- 1 data data 252103 8월 30 08:40 valid.ko
데이터 전처리
바이너리로 변환, 딕셔너리 구축한다.
root@server:~# fairseq-preprocess \
--source-lang ko \
--target-lang en \
--trainpref /data/train \
--validpref /data/valid \
--destdir /data/bin \
--workers 20
...
2022-08-31 16:44:51 | INFO | fairseq_cli.preprocess | Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='/home/data/bin', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, simul_type=None, source_lang='ko', srcdict=None, suppress_crashes=False, target_lang='en', task='translation', tensorboard_logdir=None, testpref=None, tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='/home/data/train', use_plasma_view=False, user_dir=None, validpref='/home/data/valid', wandb_project=None, workers=20)
2022-08-31 16:44:55 | INFO | fairseq_cli.preprocess | [ko] Dictionary: 50264 types
2022-08-31 16:45:03 | INFO | fairseq_cli.preprocess | [ko] /home/data/train.ko: 163721 sents, 4125915 tokens, 0.0% replaced (by <unk>)
2022-08-31 16:45:03 | INFO | fairseq_cli.preprocess | [ko] Dictionary: 50264 types
2022-08-31 16:45:04 | INFO | fairseq_cli.preprocess | [ko] /home/data/valid.ko: 1958 sents, 48765 tokens, 0.73% replaced (by <unk>)
2022-08-31 16:45:04 | INFO | fairseq_cli.preprocess | [en] Dictionary: 51776 types
2022-08-31 16:45:09 | INFO | fairseq_cli.preprocess | [en] /home/data/train.en: 163721 sents, 3395821 tokens, 0.0% replaced (by <unk>)
2022-08-31 16:45:09 | INFO | fairseq_cli.preprocess | [en] Dictionary: 51776 types
2022-08-31 16:45:09 | INFO | fairseq_cli.preprocess | [en] /home/data/valid.en: 1958 sents, 40396 tokens, 0.871% replaced (by <unk>)
2022-08-31 16:45:09 | INFO | fairseq_cli.preprocess | Wrote preprocessed data to /home/data/bin
root@server:~# ls -al /home/data
drwxrwxr-x 2 data data 4096 8월 31 16:45 ./
drwxrwxr-x 3 data data 4096 8월 31 16:44 ../
-rw-rw-r-- 1 data data 592011 8월 31 16:44 dict.en.txt
-rw-rw-r-- 1 data data 551126 8월 31 16:44 dict.ko.txt
-rw-rw-r-- 1 data data 3544 8월 31 16:45 preprocess.log
-rw-rw-r-- 1 data data 6791642 8월 31 16:45 train.ko-en.en.bin
-rw-rw-r-- 1 data data 1964678 8월 31 16:45 train.ko-en.en.idx
-rw-rw-r-- 1 data data 8251830 8월 31 16:45 train.ko-en.ko.bin
-rw-rw-r-- 1 data data 1964678 8월 31 16:45 train.ko-en.ko.idx
-rw-rw-r-- 1 data data 80792 8월 31 16:45 valid.ko-en.en.bin
-rw-rw-r-- 1 data data 23522 8월 31 16:45 valid.ko-en.en.idx
-rw-rw-r-- 1 data data 97530 8월 31 16:45 valid.ko-en.ko.bin
-rw-rw-r-- 1 data data 23522 8월 31 16:45 valid.ko-en.ko.idx
데이터 학습
root@server:~# fairseq-train \
/data/bin \
--arch transformer \
--dropout 0.1 \
--attention-dropout 0.1 \
--activation-dropout 0.1 \
--encoder-embed-dim 256 \
--encoder-ffn-embed-dim 512 \
--encoder-layers 3 \
--encoder-attention-heads 8 \
--encoder-learned-pos \
--decoder-embed-dim 256 \
--decoder-ffn-embed-dim 512 \
--decoder-layers 3 \
--decoder-attention-heads 8 \
--decoder-learned-pos \
--max-epoch=100 \
--optimizer adam \
--lr 0.0001 \
--seed 1 \
--no-progress-bar \
--save-interval 1 \
--save-dir /model/ \
--batch-size 30
...
2022-08-31 08:54:42 | INFO | fairseq.data.iterators | grouped total_num_itrs = 1336
2022-08-31 08:54:42 | INFO | fairseq.trainer | begin training epoch 1
2022-08-31 08:54:42 | INFO | fairseq_cli.train | Start iterating over samples
2022-08-31 08:54:46 | INFO | torch.nn.parallel.distributed | Reducer buckets have been rebuilt in this iteration.
2022-08-31 08:55:16 | INFO | train_inner | epoch 001: 100 / 1336 loss=12.192, ppl=4679.31, wps=5764.3, ups=3.26, wpb=1765.8, bsz=192, num_updates=100, lr=0.0001, gnorm=1.982, train_wall=31, gb_free=10.8, wall=34
2022-08-31 08:55:46 | INFO | train_inner | epoch 001: 200 / 1336 loss=9.397, ppl=674.37, wps=5845.4, ups=3.32, wpb=1762.1, bsz=192, num_updates=200, lr=0.0001, gnorm=1.46, train_wall=30, gb_free=10.8, wall=65
2022-08-31 08:56:16 | INFO | train_inner | epoch 001: 300 / 1336 loss=8.55, ppl=374.72, wps=6351.1, ups=3.32, wpb=1910.3, bsz=191.8, num_updates=300, lr=0.0001, gnorm=1.178, train_wall=30, gb_free=10.8, wall=95
2022-08-31 08:56:46 | INFO | train_inner | epoch 001: 400 / 1336 loss=8.186, ppl=291.2, wps=6069.1, ups=3.33, wpb=1820.6, bsz=192, num_updates=400, lr=0.0001, gnorm=1.098, train_wall=30, gb_free=10.8, wall=125
2022-08-31 08:57:17 | INFO | train_inner | epoch 001: 500 / 1336 loss=7.846, ppl=230.02, wps=6122.7, ups=3.32, wpb=1842.5, bsz=192, num_updates=500, lr=0.0001, gnorm=1.07, train_wall=30, gb_free=10.8, wall=155
2022-08-31 08:57:47 | INFO | train_inner | epoch 001: 600 / 1336 loss=7.645, ppl=200.1, wps=6206.6, ups=3.32, wpb=1870.6, bsz=192, num_updates=600, lr=0.0001, gnorm=1.086, train_wall=30, gb_free=10.8, wall=185
2022-08-31 08:58:17 | INFO | train_inner | epoch 001: 700 / 1336 loss=7.515, ppl=182.91, wps=5773.5, ups=3.33, wpb=1733.7, bsz=192, num_updates=700, lr=0.0001, gnorm=1.066, train_wall=30, gb_free=10.7, wall=215
2022-08-31 08:58:47 | INFO | train_inner | epoch 001: 800 / 1336 loss=7.44, ppl=173.66, wps=6056, ups=3.3, wpb=1837.6, bsz=192, num_updates=800, lr=0.0001, gnorm=1.033, train_wall=30, gb_free=10.8, wall=245
2022-08-31 08:59:17 | INFO | train_inner | epoch 001: 900 / 1336 loss=7.279, ppl=155.34, wps=6038, ups=3.29, wpb=1835.4, bsz=192, num_updates=900, lr=0.0001, gnorm=0.998, train_wall=30, gb_free=10.8, wall=276
2022-08-31 08:59:48 | INFO | train_inner | epoch 001: 1000 / 1336 loss=7.154, ppl=142.42, wps=5871.9, ups=3.29, wpb=1787, bsz=192, num_updates=1000, lr=0.0001, gnorm=1.028, train_wall=30, gb_free=10.8, wall=306
2022-08-31 09:00:18 | INFO | train_inner | epoch 001: 1100 / 1336 loss=7.053, ppl=132.8, wps=6091.3, ups=3.29, wpb=1852.8, bsz=192, num_updates=1100, lr=0.0001, gnorm=1.034, train_wall=30, gb_free=10.8, wall=337
2022-08-31 09:00:49 | INFO | train_inner | epoch 001: 1200 / 1336 loss=7.023, ppl=130.05, wps=5885.2, ups=3.31, wpb=1779.5, bsz=192, num_updates=1200, lr=0.0001, gnorm=1.047, train_wall=30, gb_free=10.7, wall=367
2022-08-31 09:01:19 | INFO | train_inner | epoch 001: 1300 / 1336 loss=6.931, ppl=122.03, wps=5821.9, ups=3.29, wpb=1767.5, bsz=192, num_updates=1300, lr=0.0001, gnorm=1.043, train_wall=30, gb_free=10.8, wall=397
2022-08-31 09:01:30 | INFO | fairseq_cli.train | begin validation on "valid" subset
2022-08-31 09:01:39 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 7.216 | ppl 148.7 | wps 54562.5 | wpb 1128.6 | bsz 191.9 | num_updates 1336
...
728x90
반응형
'AI 인공지능 > NLP' 카테고리의 다른 글
바이트 페어 인코딩 (BPE - Byte Pair Encoding) (0) | 2022.09.08 |
---|---|
한글 형태소 분석기 종류 (0) | 2022.08.31 |
NLP 토큰화 (Tokenization) 처리 방법 (0) | 2022.08.31 |
기계 번역 - 오픈 소스 벤치마킹 (0) | 2022.08.09 |
트랜스포머 (Transformer) (0) | 2021.01.19 |