Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate conversations across train/valid/test #2

Open
moyapchen opened this issue Mar 22, 2021 · 0 comments
Open

Duplicate conversations across train/valid/test #2

moyapchen opened this issue Mar 22, 2021 · 0 comments

Comments

@moyapchen
Copy link

Hi CMU_DoG dataset authors,

While doing some analysis of the dataset, we noticed that there were a few conversations duplicated between train/valid/test. For example, see

https://github.com/festvox/datasets-CMU_DoG/blob/master/Conversations/test/7747dbdeaeb5c9082abe54c0231fcbf1d9907d38.json

and

https://github.com/festvox/datasets-CMU_DoG/blob/master/Conversations/valid/7747dbdeaeb5c9082abe54c0231fcbf1d9907d38.json

which are the same conversation, except that one is in test and one is in valid (see the URLs).

The following are the 110 conversations that show up as duplicated. None of them overlap between all 3, but they overlap between different combinations of train/test, train/valid, and test/valid.

7747dbdeaeb5c9082abe54c0231fcbf1d9907d38
a7db00cab02b9513fa1d7172d35573e6f23630ab
017f651588118f8794349b3c9bd027c63d4226cc
01e70e7c454d15f3408e516cce788a4e8b24694a
02a3b61b613f7d2ba733881c4b732bbbad73ff7b
04d985b10ce191de275f9c4d1f9f4d809478b707
07a0a2126e37ea5ed83748483e5a2deed2bc120a
088b88b115140214c3e1b3d955c772a69613211c
09bbda4742d7603907c9e6ed27466d6745dffe31
0ecc4fb0efcc2362ecbe89cb2c3d6fc3012508f2
0f8ab05299aeed625e0304069ebcb3908d9430b4
0ff4a418d41d7755b992acd4b48c16ce536f90c1
116c5d7e7dd946a6eed95ff7838230656876761f
1359558ae032c547fac59406d33a449f6a338960
1381a18b60a35681a78620dc9479b5f019c72bb0
13b8d82a55192c0e48afe06caf1b398840fac5b5
16ea8e6ad0f90cc30fccde2106163305501bd1f7
17f5a0806eeddf1c865d255246194cc6c16ceab2
1a5666222c56bd5ca757ceebd3a0425757063b83
1bc2cda069820df5e8c1a50e9f687a21fe557c4e
20703fb140627f1bdfffa8d22f45dc9b70284327
20dc13f012d2ff943880f1f7b2a1364cc8805b76
26d5f0381a0b415201d84caff5bbe31a6746286d
2a7e8fcd644d0b4e38a9919d67898a05f4efcdd1
2f7ae28c28b287014f857391254d62251334ea2c
3001cb92b91e7ccad98b533f6998ada0bf8d2935
30a0fed0f23bae0b04110f3cb25e7a960a12ba21
30f5ee0dbb86b2ad38b8ff648a261f8c759730f7
33815de08497bb9e2dbb5d1799fc9e6747f153b4
34a28dcaca5730a4f2046f509315d802616a2ca9
36183c3c2f123e5dc3d8ac126d02a118d1fe1f75
3a823ace51029edf277e620c039045deb9b8afb7
3c9e09be88afdd52fd96538ec0cbaae6667f8117
42397f53285165f64e51b932334ce24cbd73c992
4665d98463d996b96c5436f331e361b51014adb0
4abe690df9089bb3673ff1233c3acbf53a29ff76
4c6ef11a1411d94af7abfe56547bed5139f80df2
549e28fd4e2f48c5a8919e87e883b388937973a4
56c4f87acf58a8d2454a6a814a0d463f6100502c
574a0d6263e6bcf555b6121a20cecd4aa35ba331
5d1a22f369b4d4edf10522f0dce98ba3fe4ba7b8
5e59f8bcb3f6b14c0ab462c2fba0f393e8dfa153
60d21f582f3707b37e616ee859e4ed08a814f918
62059e7d014539546fc8af24f72b9c67a2cb50be
6205d01434cd688417331a822c3206ed96abaeda
623c99e0a14e05afc92e0ef717955f6db5e9540f
637e22cb9527ca9dc29f45a8ac63934889c46bf1
64096861b9834df2eb307aa585b929bf1d047147
6483a0c5154147823d9dd06d45204a43e1c84c68
653b3d41abc2a261b6b52cb055721e56c44ffb9b
6876ce7a8fda3da5f889de7647037a1c200d5f6f
6bfbef62ca380281cc074eb69be092f315239083
6e4d4181d03021379822e2fc05df6c7294652bfe
70a119f7dcc93b5503eb2a9bf2fa8bc81e1b20fb
726ff4eb9abddd3963f06cd6ae980cf9370ce283
78f08dac1ce14021c8b3159c3d0c81cc55771ff6
7df0e0d70733082d5f18738086f92599750861f2
80f367e76c4e3c7dcc8a1004fdcd261b5a2f13ce
8281f7c60a2cabab0231a08554a57900a4ea3a49
84fd06134e332f4cdc8db3f868d5828d4cacb0f3
852b189a5da869ae04ff5c6de05d8a8bb51ef126
855d496060a757e64bc6a7d267432e91b5e61ad3
8d6c8ba49709839220139c9db1fe1a217a8ec298
9edc25e2b1a930aa2eb6e70332ea3cbf4862f583
a283e5ab4fea6aac663a12ed377b3005cf316c2e
a56156a210ca64e0ece5d4bb70eddb6702ba33d8
a592659327e51d654fcddd2763b63b5072620c6e
af408e2d131f34d20779c44ae83b39a02646f9a8
af52c68e6fb510a5192017a30c432e4e77f6043c
af8108609c557500687c6665afdae4835c7aff74
af8b3886d7f2b22b0880fd26135b05226c2138bf
afd8d2f054f068cf17e9dd2f70424114810d9e33
b0a42297255fb0abfca1e72004a1320fb0d8bac2
b301692297dda701377420861b26b3840a2d41dd
b30491ded83cc6aac65d905a01f8dbecb51bc60d
b7909659ab1157476bb62ba0998910f431792d9d
b890a435f85d9385d8092a75caa53a0af35f345d
bb460128bb7440368fcf78389e1d1bd7ef32cad4
bb60032589c61c8f866b88831f5171b1a026c7ed
bc3e8e30e47e3192f7ae41ff4bca8f2de221cb20
bcdd24739c3da8bd147e9ac52581f59370ff6722
bdd812d5582b2fa57d833678780ed23b649ebcc5
be1a6cdb20115526c9b11aae15b7a20af66d3319
be534aad2c42a8ca1c8a0a0c0dcaa5ee061dcda1
c0c0b679ea13cce1ddfd674b6bd9bba07e81c421
c3c6d8d44a5c79576344f41268b049812764fb9f
c896c54da7fd56ae14f66387a58f91b250ddea71
ca84f2086537b5cbe4c7ae68b4b30e6b8539dbe2
cdb1dd3bf587a66afe44ae1c80837389dfc5d528
d0f5e4da2a4115eef595310121a74762f89ff959
d1484f4c978275de43f27f72c5dca18b42e1ea1c
d2baca41b51ed0dbe5a6b88f2087a05d7fe3c081
d3d7c3ddca11ed89a8d4474eea6bc7ef7bab84d5
d4e9d17dc51fdbf4e8fba842b0007781b99660c2
d84c1088c8de7faf0fb73ee1d725aab8e6881946
d95df65e15b78100b4bea60c39532a1ff80ebc3d
da8546a7c874693a4083147ab86ac8921a9e38d5
db3ab055421f0e8b81987db3e46f9100c69e36b4
e791c0984063fcb07ea69a3d074a3eb52c033f52
e83a9ec6538046cc00f58a00cb3582de6c660def
e88e40589b4bebedad9b2bca7d174a8d894635ba
e8a57b826a25f1a163b771c14be4be0a2a46cb44
ed9a1995fff41fc6accf7e816633e1c0b5a19905
f080bc4443de70e3bf9441a98bbbde4224a1dcbb
f1bb15df63c0520602f3787f6c1541a1df6a1753
f2e29d17eb5d85a21b5704c59ffb20ad92ecdd83
f5a754e4d923d271de4e7f6ddc4ecd39c280ca4c
f86f196f304d52050e1a3786ca6b49c9a6e13d9e
f8d9ed8a56714098567c10107c29d73fd2fe805a
fd698fb98d1eb6436d2e5f2155d1332f494ebecc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant