Skip to content
This repository has been archived by the owner on Oct 26, 2022. It is now read-only.

Data preparation #130

Open
zwx8981 opened this issue Sep 17, 2018 · 2 comments
Open

Data preparation #130

zwx8981 opened this issue Sep 17, 2018 · 2 comments

Comments

@zwx8981
Copy link

zwx8981 commented Sep 17, 2018

Hi, thank you for you great work. I have a question of data preparation. To be specific, if I want to use the CNN-based sequence encoder and decoder as standalone modules which can be inserted to other translation models, how should I prepare source dictionary file which can be successfully loaded by fairseq.data.Dictionary.load() method? I read the source code where I find comments in Dictionary.load() method:

    """Loads the dictionary from a text file with the format:

    ```
    <symbol0> <count0>
    <symbol1> <count1>
    ...
    ```
    """

What is the count0 means?

@mls1999725
Copy link

I want to know it, too

@jgehring
Copy link
Contributor

jgehring commented May 4, 2020

I'm not sure which section of the code you're referring to here, but, generally speaking, the dictionary contains an index-to-symbol mapping as well as frequencies of symbols (in the form of raw counts over the respective source corpus).

@jgehring jgehring closed this as completed May 4, 2020
@jgehring jgehring reopened this May 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants