'Preprocessing' 태그의 글 목록

Preprocessing

[코드리뷰] Attention is All You Need-Pytorch [1] Preprocess 2022.08.29

[코드리뷰] Attention is All You Need-Pytorch [1] Preprocess

dongdong93 2022. 8. 29. 21:03

2022. 8. 29. 21:03

728x90

Attention is all you need논문의 pytorch Implementation 코드를 리뷰하는 포스팅이다.

https://github.com/jadore801120/attention-is-all-you-need-pytorch

GitHub - jadore801120/attention-is-all-you-need-pytorch: A PyTorch implementation of the Transformer model in "Attention is All

A PyTorch implementation of the Transformer model in "Attention is All You Need". - GitHub - jadore801120/attention-is-all-you-need-pytorch: A PyTorch implementation of the Transformer mo...

github.com

Transformer 코드 리뷰는 세 개의 포스팅으로 나누어 진행을 할 것이고 본 포스팅은 그 첫 번째로 전처리 부분이다.

1. Prepocess

먼저 필요한 라이브러리를 임포트해주고 파라미터를 EasyDict를 통해 선언해주자.

import spacy
import torchtext.data # torchtext의 버전은 0.3.1을 사용
import torchtext.datasets
import dill as pickle
import easydict

opt = easydict.EasyDict({
        "lang_src" : 'de_core_news_sm',# source 언어 / spacy 사용시
        "lang_trg" : 'en_core_web_sm', # target 언어 / spacy 사용시
        "save_data" : True, 
        "data_src" : None, 
        "data_trg" : None,
        "max_len" :100, # 한 문장에 들어가는 최대 token 수 / 해당 갯수 이상이면 버림
        "min_word_count" : 3, # vocab을 만들어 줄때 최소 갯수 : 해당 갯수 미만이면 <unk>로 vocab이 형성됨
        "keep_case": True, # 대소문자 구분 
        "share_vocab" : "store_true",  # target vocab과 source vocab을 하나로 합치는지 여부
    })

1.1 Tokenizers

tokenizers는 문장을 개별 토큰으로 변환해주는 데 사용된다. 본 코드에서는 nlp를 쉽게 처리할 수 있도록 도와주는 패키지인 spacy를 이용하여 토큰화를 한다. 먼저 영어와 독일어 전처리 모듈을 설치하자.

!python -m spacy download en
!python -m spacy download de

모듈을 설치한 후 독일어와 영어를 토큰화 해주자.

src_lang_model = spacy.load('de_core_news_sm') # 독일어 토큰화
trg_lang_model = spacy.load('en_core_web_sm') # 영어 토큰화

토큰화의 예시는 다음과 같다.

# 토큰화 예시
tokenized = trg_lang_model.tokenizer('I am a student.')

for i, token in enumerate(tokenized):
    print(f'index {i}: {token.text}')

index 0: I
index 1: am
index 2: a
index 3: student
index 4: .

독일어와 영어의 토큰화 함수를 정의한다.

# 독일어 문장을 토큰화하는 함수
def tokenize_src(text):
    return [tok.text for tok in src_lang_model.tokenizer(text)]

# 영어 문장을 토큰화하는 함수
def tokenize_trg(text):
    return [tok.text for tok in trg_lang_model.tokenizer(text)]

default 토큰(padding unknown, start, end token)을 정의하고 torchtext의 Field 라이브러리를 사용하여 데이터 처리를 준비한다.

소스(SRC) : 독일어
목표(TRG) : 영어

Field는 데이터타입과 이를 텐서로 변환할 지시사항과 함께 정의하는 것이다. Field는 텐서로 표현될 수 있는 덱스트 데이터 타입을 처리하고, 각 토큰을 숫자 인덱스로 맵핑시켜주는 단어장(vocab) 객체가 있다. 또한 토큰화 하는 함수, 전처리 등을 지정할 수 있다.

PAD_WORD = '<blank>' # padding token
UNK_WORD = '<unk>' # unknown token
BOS_WORD = '<s>' # start token
EOS_WORD = '</s>' # end token

SRC = torchtext.data.Field(
    tokenize=tokenize_src, lower=False,
    pad_token=PAD_WORD, init_token=BOS_WORD, eos_token=EOS_WORD)

TRG = torchtext.data.Field(
    tokenize=tokenize_trg, lower=False,
    pad_token=PAD_WORD, init_token=BOS_WORD, eos_token=EOS_WORD)

이제 영어-독일어 번역 데이터셋인 Multi30k를 불러오자. (약 3만개의 영어, 독일어 문장)

MAX_LEN = opt.max_len # 문장의 최대 허용 token 갯수, token의 갯수보다 문장내 토큰의 갯수가 return False
MIN_FREQ = opt.min_word_count # vocab을 만들어 줄때 최소 갯수 : 해당 갯수 미만이면 <unk>로 vocab이 형성됨

def filter_examples_with_length(x):
    return len(vars(x)['src']) <= MAX_LEN and len(vars(x)['trg']) <= MAX_LEN

train, val, test = torchtext.datasets.Multi30k.splits(
        exts = ('.de', '.en'),
        fields = (SRC, TRG),
        filter_pred = filter_examples_with_length)
        
print(f"학습 데이터셋(training dataset) 크기: {len(train.examples)}개")
print(f"평가 데이터셋(validation dataset) 크기: {len(val.examples)}개")
print(f"테스트 데이터셋(testing dataset) 크기: {len(test.examples)}개")

학습 데이터셋(training dataset) 크기: 29000개
평가 데이터셋(validation dataset) 크기: 1014개
테스트 데이터셋(testing dataset) 크기: 1000개

학습 데이터 중 하나를 선택해 출력해보자.

# 학습 데이터 중 하나를 선택해 출력
print(vars(train.examples[25])['src'])
print(vars(train.examples[25])['trg'])

['Eine', 'Person', 'in', 'einem', 'blauen', 'Mantel', 'steht', 'auf', 'einem', 'belebten', 'Gehweg', 'und', 'betrachtet', 'ein', 'Gemälde', 'einer', 'Straßenszene', '.']
['A', 'person', 'dressed', 'in', 'a', 'blue', 'coat', 'is', 'standing', 'in', 'on', 'a', 'busy', 'sidewalk', ',', 'studying', 'painting', 'of', 'a', 'street', 'scene', '.']

'Eine Person in einem blauen Mantel steht auf einem belebten Gehweg und betrachtet ein Gemälde einer Straßenszene .'
=> 'A person in a blue coat stands on a busy sidewalk and looks at a painting of a street scene'
(파란 코트를 입은 사람이 붐비는 인도에 서서 거리의 풍경을 보고 있다.)

1.2 Build Vocab

field 객체의 build_vocab 메서드를 이용해 영어와 독일어 단어 사전을 생성한다. 여기서는 최소 3번 이상 등장한 단어만을 선택하는데 이때, 2번 이하로 나오는 단어는 '<unk>'token으로 변환된다.

SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TRG.build_vocab(train.trg, min_freq=MIN_FREQ)

print(f'len(SRC) : {len(SRC.vocab)}')
print(f'len(TRG) : {len(TRG.vocab)}')

len(SRC) : 5497
len(TRG) : 4727

영어 단어의 토큰화 예시를 살펴보자.

stoi : 토큰 문자열로 매핑하는 문자열 식별자.
itos : 숫자 식별자로 인덱싱 된 토큰 문자열 목록

print(TRG.vocab.stoi['abcabc']) # 없는 단어 : 0
print(TRG.vocab.stoi[TRG.pad_token]) # 패딩 : 1
print(TRG.vocab.stoi['<s>']) # <s> : 2
print(TRG.vocab.stoi['</s>']) # </s> : 3
print(TRG.vocab.stoi['python']) # 2번 이하 단어
print(TRG.vocab.stoi["world"])

본 코드에서는 share_vocab함수를 사용한다. 이 함수는 여러 개의 tokenizer를 사용할 때만 사용하는데 여러 개일 경우 같은 단어가 복수의 토큰 값을 가지게 되기 때문에 하나로 합치는 과정이 필요하기 때문에 사용한다. 본 논문은 독어와 영어에 중복되는 단어가 있기 때문에 사용한 것 같다.

for w, _ in SRC.vocab.stoi.items():
    # TODO: Also update the `freq`, although it is not likely to be used.
    if w not in TRG.vocab.stoi:
        TRG.vocab.stoi[w] = len(TRG.vocab.stoi)
TRG.vocab.itos = [None] * len(TRG.vocab.stoi)
for w, i in TRG.vocab.stoi.items():
    TRG.vocab.itos[i] = w
SRC.vocab.stoi = TRG.vocab.stoi
SRC.vocab.itos = TRG.vocab.itos
print('[Info] Get merged vocabulary size:', len(TRG.vocab))

[Info] Get merged vocabulary size: 10079

1.3 Data save

위의 과정을 통해 생성한 option과 vocab(src, trg), train, valid, test data를 저장해야 한다.

data = {
    'settings': opt,
    'vocab': {'src': SRC, 'trg': TRG},
    'train': train.examples,
    'valid': val.examples,
    'test': test.examples}

print('[Info] Dumping the processed data to pickle file', opt.save_data)
with open("m30k_deen_shr.pkl","wb") as f:
    pickle.dump(data,f)

[Info] Dumping the processed data to pickle file True

다음 포스팅에서는 Attention is all you need의 메인이 되는 Transformer 아키택쳐에 대해 리뷰해보겠다.

728x90

저작자표시

'딥러닝 > 코드리뷰' 카테고리의 다른 글

[코드리뷰] Attention is All You Need-Pytorch [3] 모델 학습 및 번역 (0)	2022.12.16
[코드리뷰] Attention is All You Need-Pytorch [2] Model Architecture (0)	2022.09.29

PREV 이전 1 NEXT 다음

dongdong's devlog