This is the Github page for the 2022 ACL paper "Learning to Mediate Disparities Towards Pragmatic Communication".
Human communication is a collaborative process. Speakers, on top of conveying their own intent, adjust the content and language expressions by taking the listeners into account, including their knowledge background, personality, and physical capabilities. Towards building AI agents that have similar abilities in language communication, we propose a novel rational reasoning framework, Pragmatic Rational Speaker (PRS), where the speaker attempts to learn the speaker-listener disparity and adjust the speech accordingly, by adding a light-weighted disparity adjustment layer into working memory on top of speaker’s long-term memory system. By fixing the long-term memory, the PRS only needs to update its working memory to learn and adapt to different types of listeners. To validate our framework, we create a dataset that simulates different types of speaker-listener disparities in the context of referential games. Our empirical results demonstrate that the PRS is able to shift its output towards the language that listeners are able to understand, significantly improve the collaborative task outcome, and learn the disparity faster than joint training.
We modified the Abstract Scenes (Gilberto Mateos Ortiz et al., 2015) dataset for our experiments. There are 10020 images, each including 3 ground truth captions, and a median of 6 to 7 objects.
We assembled ∼35k pairs of images that differ by ≤ 4 objects as the Hard set(h), ∼25k pairs that differ by > 4 objects as the Easy set(s), and together as the Combined set(b). The image pairs were split into training, validation and testing by a ratio of 8:1:1.
The paired image dataset can be round in the input folder: [split]_[difficulty]_IDX.txt. e.g. TRAIN_s_IDX.txt. Each file includes three columns: {img1_idx, img2_idx, #diff_obj}.
To create disparities, run buildGT.py to create corresponding datasets and intermediate files.
Options:
-d, --disparity TEXT Disparity type: hypernym(knowledge), catog(limited visual)
-i, --inpath TEXT The input file path, default 'input/'
-o, --outpath TEXT The output file path, default 'input/'
-img, --imgpath TEXT Input image file folder, default 'AbstractScenes_v1.1/RenderedScenes/'
-l, --maxlen INTEGER max sentence length, default 25
--help
The Literal Speaker is an object detection based image captioning module that generates caption candidates for the target image.
-
Objection Detection
We retrained YOLO v3 from scratch using individual images in the Abstract Scene dataset. The inference time code can be found in yolo.py
-
Image Captioning
We adapted the Show, Attend, and Tell code, retrained the captioning module from scratch using individual images in the Abstract Scene dataset. The inference time code can be found in speaker.py
-
Internal Listener Simulation
Without disparity concerns, the Rational Speaker fulfills the task goal by simulating the Rational Listener’s behavior, and rank the candidate captions generated by the Literal Speaker according to how well they can describe the target image apart from the distractors.
Rational Listener picks out the image that they believe is the target. We reuse the same Fixed pre-trained Training-mode Transformer module to decide which image does the caption ground better in. The model can be found in listener.py
To create listeners with disparities, retrain the image captioning model from previous step using the new dataset for each type of disparity.
On top of the Rational Speaker, the Pragmatic Rational Speaker incorporates a disparity adjustment layer to learn and accommodate the listener’s disparity through REINFORCE interactions. The model can be found in pragmatic.py
Options:
-d, --disparityin TEXT Disparity type: hypernym, catog
-s, --simplicity TEXT Simplicity of the dataset, b: both, s: simple, h:
hard. Default 'b'
-t, --testime BOOLEAN Train or Test mode, default Train
-r, --repeat INTEGER Number of tests to repeat, default 1
-i, --inpath TEXT The input file path, default 'input/'
-c, --ckptin TEXT Checkpoint for previously saved model, default None
--help Show this message and exit.
Distributed under the MIT License. See LICENSE.txt
for more information.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For further questions, please contact [email protected].