はじめに

NTTドコモサービスイノベーション部の福島です。

画像を表示したいけどビューアーが無い... コンソール上でどうしても画像を確認したい... そんな事態にも対処するため、メモ帳からでも見られるようにUnicode絵文字で画像を表現するプログラムを作ってみます。

使用例：絵文字クイズを出す

これは何の画像を絵文字化したものでしょうか？とある有名な画像です。答えは本記事の一番下です。

実現方法

下記の2工程で実現出来そうです。

画像に対して物体検出を行い、どこに・何が写っているかの情報を取得する
検出された各物体に対し、座標に対応する位置に絵文字を配置する
実現方法

前準備

今回、物体検出はCOCOというデータセットで学習されたモデルを用いました。

こちらは人間、車、犬など80クラスの物体にラベル付与されたデータセットのため、この80種類が画像から検出可能な物体の一覧となります。（参考：COCOのクラス一覧）

どの物体をどの絵文字に対応させるべきか？そもそも対応する絵文字がない物体もあるのでは？という点は事前に検討が必要です。

結論を書くと、70種類のクラスについては似た絵文字が見つかったので下記のような変換テーブルを作成しました。 ※環境によっては絵文字が表示されない可能性があります。

COCOクラス名	Unicode絵文字	COCOクラス名	Unicode絵文字
person	🧑	bicycle	🚲
car	🚔	motorcycle	🏍
airplane	✈	bus	🚌
train	🚆	truck	🚚
boat	🛥	trafficlight	🚥
stopsign	🛑	bench	🪑
bird	🐦	cat	🐈
dog	🐕	horse	🐎
sheep	🦌	cow	🐄
elephant	🐘	bear	🐻
zebra	🦓	giraffe	🦒
backpack	🎒	umbrella	☂
handbag	👜	tie	👔
suitcase	💼	frisbee	🥏
skis	🎿	snowboard	🏂
sportsball	⚾	kite	🪁
baseballbat	🏏	baseballglove	🧤
skateboard	🛹	surfboard	🏄
tennisracket	🎾	bottle	🍼
wineglass	🍷	cup	🥤
fork	🍴	knife	🍴
spoon	🥄	bowl	🥣
banana	🍌	apple	🍏
sandwich	🥪	orange	🟠
broccoli	🥦	carrot	🥕
hotdog	🌭	pizza	🍕
donut	🍩	cake	🎂
chair	🪑	couch	🛋
bed	🛏	toilet	🚽
tv	📺	laptop	💻
mouse	🐁	remote	🛰
keyboard	⌨	cellphone	📱
microwave	🔭	book	📘
clock	⏰	vase	⛲
scissors	✂	teddybear	🧸

Unicode絵文字は全部で1000個以上、派生も合わせると5000個以上あるため、目視での紐付けは非現実的でした。今回は下記のような方法でテーブルを作成しています。目視修正で明らかにおかしなものは除いたため、80クラス中70クラスとなっています。

COCOのクラス名、絵文字の名称（例えば👍なら「THUMBS UP SIGN」）の2つをWord2vecでベクトル化
COCOの各クラスに対して、最もベクトルの向きが近い絵文字を特定して紐付ける

半自動な方法ですが、表を見てみると概ね正しい対応が出来ていそうです。mouseがPCのマウスでなくネズミになってしまっているくらいでしょうか。

実装

下記の実装となりました。100行弱。

上記のCOCOクラス名と絵文字の対応ファイルを下記のように作成し、同フォルダに置いています（dic_emoji.txt）。
- 各行が「COCOクラス名, 絵文字名称, Unicode絵文字, 類似度スコア」の4列のcsv。使っているのは1列目と3列目のみなので、2列目と4列目は空でも良いです。
物体検出はOpenCVからMobileNet-SSD v2（Tensorflowで学習済みモデル）を読み込んで推論しています。もっと精度の良いモデルは近年大量にありますが、今回の用途ではそれほど精度は必要ないですし、軽量さを重視しました。
- OpenCVから他フレームワークのモデルを読み込んで推論する方法はOpenCVのチュートリアルを参考にしました。より高精度を求める場合は、こちらのチュートリアルでも使われているyolov3のモデルを用いるのが良いです。
- 学習済みモデルはこちらから取得しました。pbファイル、pbtxtファイルを./models配下に配置しています。

つまりフォルダ構成はこんな感じです。

├ image2emoji.py
├ dic_emoji.txt
└ models
  ├ frozen_inference_graph.pb
  └ ssd_mobilenet_v2_coco_2018_03_29.pbtxt

import math
import sys
import cv2

H_SIZE = 5 # 出力の縦幅
W_SIZE = 6 # 出力の横幅
TH_DETECT = 0.1 # 物体検出のスコアしきい値
CLASS_NAMES = {0: 'background',
              1: 'person', 2: 'bicycle', 3: 'car', 4: 'motorcycle', 5: 'airplane', 6: 'bus',
              7: 'train', 8: 'truck', 9: 'boat', 10: 'trafficlight', 11: 'firehydrant',
              13: 'stopsign', 14: 'parkingmeter', 15: 'bench', 16: 'bird', 17: 'cat',
              18: 'dog', 19: 'horse', 20: 'sheep', 21: 'cow', 22: 'elephant', 23: 'bear',
              24: 'zebra', 25: 'giraffe', 27: 'backpack', 28: 'umbrella', 31: 'handbag',
              32: 'tie', 33: 'suitcase', 34: 'frisbee', 35: 'skis', 36: 'snowboard',
              37: 'sportsball', 38: 'kite', 39: 'baseballbat', 40: 'baseballglove',
              41: 'skateboard', 42: 'surfboard', 43: 'tennisracket', 44: 'bottle',
              46: 'wineglass', 47: 'cup', 48: 'fork', 49: 'knife', 50: 'spoon',
              51: 'bowl', 52: 'banana', 53: 'apple', 54: 'sandwich', 55: 'orange',
              56: 'broccoli', 57: 'carrot', 58: 'hotdog', 59: 'pizza', 60: 'donut',
              61: 'cake', 62: 'chair', 63: 'couch', 64: 'pottedplant', 65: 'bed',
              67: 'diningtable', 70: 'toilet', 72: 'tv', 73: 'laptop', 74: 'mouse',
              75: 'remote', 76: 'keyboard', 77: 'cellphone', 78: 'microwave', 79: 'oven',
              80: 'toaster', 81: 'sink', 82: 'refrigerator', 84: 'book', 85: 'clock',
              86: 'vase', 87: 'scissors', 88: 'teddybear', 89: 'hairdrier', 90: 'toothbrush'}

def detection(image):
    model = cv2.dnn.readNetFromTensorflow('models/frozen_inference_graph.pb',
                                          'models/ssd_mobilenet_v2_coco_2018_03_29.pbtxt')
    image_height, image_width, _ = image.shape

    model.setInput(cv2.dnn.blobFromImage(image, size=(300, 300), swapRB=True))
    output = model.forward()

    objects = []
    for detection in output[0, 0, :, :]:
        confidence = detection[2]
        if confidence > TH_DETECT:
            class_id = detection[1]
            class_name = CLASS_NAMES[class_id]
            xmin = int(detection[3] * image_width)
            ymin = int(detection[4] * image_height)
            xmax = int(detection[5] * image_width)
            ymax = int(detection[6] * image_height)
            objects.append([class_name, xmin, ymin, xmax, ymax])
    print(objects)
    return objects

def load_dic_emoji(fn):
    with open(fn) as f:
        dic_emoji = {}
        for row in f:
            class_name, _, emoji, _ = row.split(',')
            dic_emoji[class_name] = emoji
    return dic_emoji

def draw_emoji_image(image_list, dic_emoji, dst):
    for row in image_list:
        for val in row:
            if val not in dic_emoji.keys():
                dst.write('  ')
            else:
                dst.write(dic_emoji[val])
        dst.write('\n')
    return

def main():
    args = sys.argv
    assert len(args)==3 # this.py input_image output_txt
    dic_emoji = load_dic_emoji('./dic_emoji.txt')
    image = cv2.imread(args[1])

    objects = detection(image)
    image_height, image_width, _ = image.shape

    image_list = [[0 for i in range(W_SIZE)] for j in range(H_SIZE)]
    with open(args[2], 'w') as dst:
        list_objects = []
        for obj in objects:
            class_id, xmin, ymin, xmax, ymax = obj
            xmin = int(xmin)
            ymin = int(ymin)
            xmax = int(xmax)
            ymax = int(ymax)
            if class_id in dic_emoji.keys():
                list_objects.append([class_id, (xmin+xmax)/(2*image_width), (ymin+ymax)/(2*image_height)])

        for obj in list_objects:
            class_id, x, y = obj
            x_round = math.floor(x * W_SIZE)
            y_round = math.floor(y * H_SIZE)
            image_list[y_round][x_round] = class_id
            print(x_round, y_round, class_id)
        draw_emoji_image(image_list, dic_emoji, dst)

if __name__ == '__main__':
    main()