Skip to content

Latest commit

ย 

History

History
181 lines (113 loc) ยท 6.88 KB

File metadata and controls

181 lines (113 loc) ยท 6.88 KB

Image segmentation : Fully Convolutional Network (FCN) Inference


Image segmentation ๊ธฐ์ˆ ์€ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๊ฐ ํ”ฝ์…€์ด ์–ด๋–ค ํด๋ž˜์Šค์— ์†ํ•˜๋Š”์ง€๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋„คํŠธ์›Œํฌ์ฃ .

์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ธฐ์ˆ ์ด ์กด์žฌํ•˜์ง€๋งŒ, ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์˜ˆ์ œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํŒŒ์ดํ† ์น˜์—์„œ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ œ๊ณตํ•˜๋Š” FCN์„ ์‚ฌ์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.



Inference [Image]

์—ญ์‹œ๋‚˜ ํฐ ํ‹€์—์„œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„๋งŒ ์งš๊ณ  ๋„˜์–ด๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค.

label_map = [
               (0, 0, 0),  # background
               (128, 0, 0), # aeroplane
               (0, 128, 0), # bicycle
               (128, 128, 0), # bird
               (0, 0, 128), # boat
               (128, 0, 128), # bottle
               (0, 128, 128), # bus
               (128, 128, 128), # car
               (64, 0, 0), # cat
               (192, 0, 0), # chair
               (64, 128, 0), # cow
               (192, 128, 0), # dining table
               (64, 0, 128), # dog
               (192, 0, 128), # horse
               (64, 128, 128), # motorbike
               (192, 128, 128), # person
               (0, 64, 0), # potted plant
               (128, 64, 0), # sheep
               (0, 192, 0), # sofa
               (128, 192, 0), # train
               (0, 64, 128) # tv/monitor
]

FCN์—์„œ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์…‹์€ PASCAL-VOC ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

PASCAL-VOC ๋ฐ์ดํ„ฐ์…‹์€ ์ด๋ฏธ์ง€์—์„œ ํ”ฝ์…€๋ณ„ ์ด 21๊ฐœ ํด๋ž˜์Šค๋ฅผ ๊ตฌ๋ถ„ํ•ด๋‹ˆ๋‹ค.

utils.py์— ์žˆ๋Š” label_map์— ๊ฐ ํด๋ž˜์Šค ์ •๋ณด์™€, ์ถ”๋ก  ์ดํ›„ ๊ฐ€์‹œํ™”๋ฅผ ์œ„ํ•œ ์ƒ‰์ƒ ์ •๋ณด๊ฐ€ ๋‹ด๊ฒจ์žˆ์Šต๋‹ˆ๋‹ค.


def get_segment_labels(image, model, DEVICE):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])
    image = transform(image).to(DEVICE).unsqueeze(0)
    outputs = model(image)
    return outputs

์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉํ•˜์˜€๋˜ ๋„คํŠธ์›Œํฌ๋“ค์€ ImageNet ์ด๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜์˜€๋Š”๋ฐ, FCN์—์„œ๋Š” PASCAL-VOC๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ๋ง์”€๋“œ๋ ธ์ฃ .

๊ทธ๋Ÿฐ๋ฐ๋„ transform์€ ๋™์ผํ•œ ํ˜•ํƒœ๋กœ ์‚ฌ์šฉ์ด ๋˜๋„ค์š”.

์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์™€ ์ •ํ™•ํ•˜๊ฒŒ ๋™์ผํ•œ ํ˜•ํƒœ๋กœ ์ถ”๋ก ์„ ์‹ค์‹œํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ๋ถ„๋ฅ˜์˜ ๊ฒฐ๊ณผ๋Š” ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ ํ•˜๋‚˜์˜€๋‹ค๋ฉด,

์„ธ๊ทธ๋จผํ…Œ์ด์…˜์˜ ๊ฒฐ๊ณผ๋Š” ๊ฐ ํ”ฝ์…€๋ณ„๋กœ ๋ฒกํ„ฐ๊ฐ€ ๋‚˜์˜ค๊ฒ ์ฃ  (๊ฐ€๋กœ x ์„ธ๋กœ x ํด๋ž˜์Šค ์ˆ˜).


model = torchvision.models.segmentation.fcn_resnet50(pretrained=True)
model.to(DEVICE)
model.eval()

image = Image.open(args.input)

outputs = utils.get_segment_labels(image, model, DEVICE)
outputs = outputs['out'] # output์€ ํด๋ž˜์Šค์— ์†ํ•˜๋Š” ํ™•๋ฅ ์ด ๋‹ด๊ธด 'out'๊ณผ auxilliary loss๊ฐ€ ๋‹ด๊ธด 'aux'๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ.

overlay, segmented = utils.visualize(image, outputs)

ํŒŒ์ดํ† ์น˜์˜ pretrained ๋ชจ๋ธ, fcn_resnet50์„ ์‚ฌ์šฉํ• ํ…๋ฐ, ํ™•์ธํ•ด๋ณด๋‹ˆ ์ถ”๋ก  ๊ฒฐ๊ณผ๋กœ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๋‹ด์€ ๋ฒกํ„ฐ out๊ณผ loss๊ฐ€ ๋‹ด๊ธด aux๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋„ค์š”.

๊ฐ€์‹œํ™”๋ฅผ ์œ„ํ•œ out๋งŒ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.


def visualize(input, outputs):
    labels = torch.argmax(outputs.squeeze(), dim=0).detach().cpu().numpy()
    segmented_image = np.zeros([len(labels), len(labels[0]), 3]).astype(np.uint8)

    for label_num in range(0, len(label_map)):
        index = labels == label_num
        segmented_image[index] = np.array(label_map)[label_num]

    image = np.array(input)
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    segmented_image = cv2.cvtColor(segmented_image, cv2.COLOR_RGB2BGR)
    cv2.addWeighted(segmented_image, 0.5, image, 0.5, 0, image)
    return image, segmented_image

์„ธ๊ทธ๋จผํ…Œ์ด์…˜์€ ๊ฐ€์‹œํ™”๋„ ๋‚œ๊ด€์ด์ฃ . ํด๋ž˜์Šค์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ€์‹œํ™”์šฉ ์ƒ‰์ƒ ์ •๋ณด๋ฅผ ๋ชจ๋“  ํ”ฝ์…€์— ๋Œ€ํ•ด ์ž…ํ˜€์ค๋‹ˆ๋‹ค.

์ž…๋ ฅ ์ด๋ฏธ์ง€ input๊ณผ ํด๋ž˜์Šค์— ์†ํ•˜๋Š” ํ™•๋ฅ ์„ ๋‹ด์€ outputs๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์ค๋‹ˆ๋‹ค.

labels = torch.argmax(outputs.squeeze(), dim=0).detach().cpu().numpy() ๋Š” gpu๋กœ ์—ฐ์‚ฐํ•œ ๋„คํŠธ์›Œํฌ์˜ ๊ฒฐ๊ณผ๋ฅผ cpu์—, ์ดํ›„ numpyํ˜•์œผ๋กœ ๋ฐ˜ํ™˜ํ•˜๋„๋ก ํ•˜๋Š” ๊ตฌ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ดํ›„ segmented_image = np.zeros([len(labels), len(labels[0]), 3]).astype(np.uint8) ๊ตฌ๋ฌธ์„ ํ†ตํ•ด ์ถœ๋ ฅ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์™€ ๋™์ผํ•œ ๋นˆ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ ,

for label_num in range(0, len(label_map)):
    index = labels == label_num
    segmented_image[index] = np.array(label_map)[label_num]

์—์„œ ๋ผ๋ฒจ์— ํ•ด๋‹นํ•˜๋Š” ์ƒ‰์ƒ ์ •๋ณด๋ฅผ ํ”ฝ์…€๋ณ„๋กœ ์ž…ํ˜€์ค๋‹ˆ๋‹ค.

Input Segmentation result Visualize
input_image result_image overlay_image



Inference [Video]

Video

์˜์ƒ์— ๋Œ€ํ•œ ์ถ”๋ก ๋„ ์‚ฌ์‹ค์ƒ ๋‹ค๋ฅผ ๊ฒŒ ์—†์Šต๋‹ˆ๋‹ค. ์ด์ •๋„ ์˜ˆ์ œ๋ฅผ ๊ตฌํ˜„ํ•˜์‹œ๋Š” ๋ถ„์ด๋ผ๋ฉด opencv์˜ ์˜์ƒ์— ๋Œ€ํ•ด์„  ์ œ๊ฐ€ ๋‹ค๋ฃฐ ์ด์œ ๊ฐ€ ์—†์„ ๊ฒƒ ๊ฐ™์•„ ์„ค๋ช…์€ ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ํ•œ ๊ฐ€์ง€ ์งš๊ณ  ๋„˜์–ด๊ฐ„๋‹ค๋ฉด,

frame = cv2.resize(frame, (640, 360))

์œผ๋กœ ์˜์ƒ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์—ฌ ๋„คํŠธ์›Œํฌ์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์˜€๋‹ค๋Š” ์  ์ด๊ฒ ๋„ค์š”. ์›๋ณธ ์˜์ƒ ๊ทธ๋Œ€๋กœ (FHD, 1920 x 1080) ์‚ฌ์šฉํ•˜์—ฌ๋„ ๋˜๋‚˜, ์ถ”๋ก  ์†๋„๊ฐ€ ๋งค์šฐ ๋А๋ ค์ง€๊ธฐ์— ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ ˆํ•ด์„œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ ๊ณต๋ถ€๋ฅผ ์ œ๋Œ€๋กœ ํ•˜์…จ๋‹ค๋ฉด ๋‹น์—ฐํ•œ ๋‚ด์šฉ์ธ๋ฐ์š”, ์ด๋ฏธ์ง€๋Š” [1280 x 853] ํฌ๊ธฐ์ธ๋ฐ ์˜์ƒ์„ ์ถ”๋ก ํ•  ๋•Œ๋Š” [1920 x 1080] ์ด๋‚˜ [640 x 360]์‚ฌ์ด์ฆˆ ๋„ฃ์–ด๋„ ๋ผ?

๋ผ๋Š” ์˜๋ฌธ์„ ๊ฐ–์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋Š” Fully Convolutional Network์˜ ํŠน์ง• ๋•๋ถ„์— ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด์ฃ .

๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ Fully Connected Layers, FCL๋ฅผ ์‚ฌ์šฉํ•˜๋˜ ๊ธฐ์กด Classification ๋„คํŠธ์›Œํฌ๋“ค๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋ชจ๋“  ๋ ˆ์ด์–ด๋ฅผ Convolution ์œผ๋กœ ๋ฐ”๊ฟ” ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ ์ƒ๊ด€ ์—†๋Š” ํ•™์Šต/์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•ด ์ค๋‹ˆ๋‹ค.

๋˜ํ•œ, FCN์€ semantic segmentaiton์ด๋ฏ€๋กœ, ํ”ฝ์…€์ด ์–ด๋А ํด๋ž˜์Šค์— ์†ํ•˜๋Š”์ง€๋งŒ์„ ๊ตฌ๋ถ„ํ•ฉ๋‹ˆ๋‹ค (ex:์‚ฌ๋žŒ)

instance segmentation์€ ํ”ฝ์…€์ด ๋ช‡ ๋ฒˆ์žฌ ๊ฐ์ฒด์— ์†ํ•˜๋Š”์ง€๊นŒ์ง€ ๊ตฌ๋ถ„ํ•˜๋ฏ€๋กœ semantic segmentation๊ณผ๋Š” ๋‹ค๋ฅธ ๊ฒฐ๊ณผ ๋ฐ ๊ฐ€์‹œํ™”๋ฅผ ํ•ด์ค˜์•ผ๊ฒ ์ฃ  (ex: ์‚ฌ๋žŒ1, ์‚ฌ๋žŒ2, ...)



์ฐธ๊ณ ๋กœ ์˜ˆ์ œ์— ์‚ฌ์šฉํ•œ ์˜์ƒ์€ Cambridge Landmarks dataset ์˜ Shop Facade ์˜ ์˜์ƒ์ž…๋‹ˆ๋‹ค. [Link]

์ด๋ฏธ์ง€ ์„ธ๊ทธ๋จผํ…Œ์ด์…˜์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„ ๊ฐ€์‹œํ™”๋กœ ์ ๋‹นํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ–๊ณ ์žˆ์ง€ ์•Š์•„ ๊ทธ๋Ÿด์‹ธํ•œ ์˜์ƒ์„ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค!

Camera localization (Visual localization)์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์ด๊ธฐ์— ๋‹น์—ฐํ•˜๊ฒŒ๋„ PASCAL-VOC๋กœ ํ•™์Šตํ•œ FCN์˜ ํ‰๊ฐ€๋กœ๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ˆ ์ฐธ๊ณ ๋งŒ ํ•ด์ฃผ์„ธ์š”!