-
Requirement
The LLM module of Deepke calls the EasyInstruct tookit(An Easy-to-use Framework to Instruct Large Language Models).
>> pip install git+https://github.com/zjunlp/EasyInstruct >> pip install hydra-core
-
Data
The data here refers to the examples data used for in-context learning, which is stored in the
data
folder. The.json
files in it are the default examples data for various tasks. Users can customize the examples in them, but they need to follow the given data format. -
Configuration
The
conf
folder stores the set parameters. The parameters required to call the GPT3 interface are passed in through the files in this folder.-
In the Named Entity Recognition (
ner
) task,text_input
parameter is the prediction text;domain
is the domain of the prediction text, which can be empty;label
is the entity label set, which can also be empty. -
In the Relation Extraction (
re
) task,text_input
parameter is the text;domain
indicates the domain to which the text belongs, and it can be empty;labels
is the set of relationship type labels. If there is no custom label set, this parameter can be empty;head_entity
andtail_entity
are the head entity and tail entity of the relationship to be predicted, respectively;head_type
andtail_type
are the types of the head and tail entities to be predicted in the relationship. -
In the Event Extraction (
ee
) task,text_input
parameter is the prediction text;domain
is the domain of the prediction text, which can also be empty. -
In the Relational Triple Extraction (
rte
) task,text_input
parameter is the prediction text;domain
is the domain of the prediction text, which can also be empty. -
The specific meanings of other parameters are as follows:
task
parameter is used to specify the task type, wherener
represents named entity recognition task,re
represents relation extraction task,ee
represents event extraction task, andrte
represents triple extraction task;language
indicates the language of the task, whereen
represents English extraction tasks, andch
represents Chinese extraction tasks;engine
indicates the name of the large model used, which should be consistent with the model name specified by the OpenAI API;api_key
is the user's API key;zero_shot
indicates whether zero-shot setting is used. When it is set toTrue
, only the instruction prompt model is used for information extraction, and when it is set toFalse
, in-context form is used for information extraction;instruction
parameter is used to specify the user-defined prompt instruction, and the default instruction is used when it is empty;data_path
indicates the directory where in-context examples are stored, and the default is thedata
folder.
-
We use the EasyInstruct tool, a user-friendly framework for instructing large language models, to complete this task. Please refer to Chapter 1 for the environment and data.
Once the parameters are set, you can directly run the run.py
:
>> python run.py
Below are input and output examples for different tasks:
Task | Input | Output |
---|---|---|
NER | Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday. | [{'E': 'Country', 'W': 'Japan'}, {'E': 'Country', 'W': 'Syria'}, {'E': 'Competition', 'W': 'Asian Cup'}, {'E': 'Competition', 'W': 'Group C championship'}] |
RE | The Dutch newspaper Brabants Dagblad said the boy was probably from Tilburg in the southern Netherlands and that he had been on safari in South Africa with his mother Trudy , 41, father Patrick, 40, and brother Enzo, 11. | parents |
EE | In Baghdad, a cameraman died when an American tank fired on the Palestine Hotel. | event_list: [ event_type: [arguments: [role: "cameraman", argument: "Baghdad"], [role: "American tank", argument: "Palestine Hotel"]] ] |
RTE | The most common audits were about waste and recycling. | [['audit', 'type', 'waste'], ['audit', 'type', 'recycling']] |
To compensate for the lack of labeled data in few-shot scenarios for relation extraction, we have designed prompts with data style descriptions to guide large language models to automatically generate more labeled data based on existing few-shot data.
- Set
task
toda
; - Set
text_input
to the relationship label to be enhanced, such asorg:founded_by
; - Set
zero_shot
toFalse
and set the low-sample example in the corresponding file under thedata
folder for theda
task; - The range of entity labels can be specified in
labels
.
We use the EasyInstruct tool, a user-friendly framework for instructing large language models, to complete this task. Please refer to Chapter 1 for the environment and data.
Once the parameters are set, you can directly run the run.py
:
>> python run.py
Here is an example of a data augmentation prompt
:
'''
One sample in relation extraction datasets consists of a relation, a context, a pair of head and tail entities in the context and their entity types.
The head entity has the relation with the tail entity and entities are pre-categorized as the following types: URL, LOCATION, IDEOLOGY, CRIMINAL CHARGE, TITLE, STATE OR PROVINCE, DATE, PERSON, NUMBER, CITY, DURATION, CAUSE OF DEATH, COUNTRY, NATIONALITY, RELIGION, ORGANIZATION, MISCELLANEOUS.
Here are some samples for relation 'org:founded_by':
Relation: org:founded_by. Context: Talansky is also the US contact for the New Jerusalem Foundation , an organization founded by Olmert while he was Jerusalem 's mayor . Head Entity: New Jerusalem Foundation. Head Type: ORGANIZATION. Tail Entity: Olmert. Tail Type: PERSON.
Relation: org:founded_by. Context: Sharpton has said he will not endorse any candidate until hearing more about their views on civil rights and other issues at his National Action Network convention next week in New York City . Head Entity: National Action Network. Head Type: ORGANIZATION. Tail Entity: his. Tail Type: PERSON.
Relation: org:founded_by. Context: `` We believe that we can best serve our clients by offering a single multistrategy hedge fund platform , '' wrote John Havens , who was a founder of Old Lane with Pandit and is president of the alternative investment group . Head Entity: Old Lane. Head Type: ORGANIZATION. Tail Entity: John Havens. Tail Type: PERSON.
Generate more samples for the relation 'org:founded_by'.
'''
The following is a baseline description of the ChatGPT/GPT-4 for the Instruction-based Knowledge Graph Construction task in the CCKS2023 Open Environment Knowledge Graph Construction and Completion Evaluation competition.
Extract relevant entities and relations according to user input instructions to construct a knowledge graph. This task may include knowledge graph completion, where the model is required to complete missing triples while extracting entity-relation triples.
Below is an example of a Knowledge Graph Construction Task. Given an input text input
and an instruction
(including the desired entity types and relationship types), output all relationship triples output
in the form of (ent1, rel, ent2)
found within the input
:
instruction="使用自然语言抽取三元组,已知下列句子,请从句子中抽取出可能的实体、关系,抽取实体类型为{'专业','时间','人类','组织','地理地区','事件'},关系类型为{'体育运动','包含行政领土','参加','国家','邦交国','夺得','举办地点','属于','获奖'},你可以先识别出实体再判断实体之间的关系,以(头实体,关系,尾实体)的形式回答"
input="2006年,弗雷泽出战中国天津举行的女子水球世界杯,协助国家队夺得冠军。2008年,弗雷泽代表澳大利亚参加北京奥运会女子水球比赛,赢得铜牌。"
output="(弗雷泽,获奖,铜牌)(女子水球世界杯,举办地点,天津)(弗雷泽,属于,国家队)(弗雷泽,国家,澳大利亚)(弗雷泽,参加,北京奥运会女子水球比赛)(中国,包含行政领土,天津)(中国,邦交国,澳大利亚)(北京奥运会女子水球比赛,举办地点,北京)(女子水球世界杯,体育运动,水球)(国家队,夺得,冠军)"
Here are some readily processed datasets:
Name | Download | Quantity | Description |
---|---|---|---|
KnowLM-IE.json | Google drive HuggingFace |
281,860 | Dataset mentioned in InstructIE |
train.json, valid.json | Google drive | 5,000 | Preliminary training set and test set for the task "Instruction-Driven Adaptive Knowledge Graph Construction" in CCKS2023 Open Knowledge Graph Challenge, randomly selected from instruct_train.json |
KnowLM-IE.json
: Contains 'id'
(unique identifier), 'cate'
(text category), 'instruction'
(extraction instruction), 'input'
(input text), 'output'
(output text) and 'relation'
(triples) fields, allowing for the flexible construction of extraction instructions and outputs through 'relation'
, 'instruction'
has 16 formats (4 prompts * 4 output formats), and 'output'
is generated according to the specified output format in 'instruction'
.
train.json
: Same fields as KnowLM-IE.json
, 'instruction'
and 'output'
have only one format, and extraction instructions and outputs can also be freely constructed through 'relation'
.
valid.json
: Same fields as train.json
, but with more accurate annotations achieved through crowdsourcing.
Here is an explanation of each field:
Field | Description |
---|---|
id | Unique identifier |
cate | text topic of input (12 topics in total) |
input | Model input text (need to extract all triples involved within) |
instruction | Instruction for the model to perform the extraction task |
output | Expected model output |
relation | Relation triples(head, relation, tail) involved in the input |
For more information on data processing and data formats, please refer to ../InstructKGC/kg2instruction
This evaluation task is essentially a triple extraction (rte) task. Detailed parameters and configuration for using this module can be found in the Environment and Data section above. The main parameter settings are as follows:
- Set
task
torte
, indicating a triple extraction task; - Set
language
toch
, indicating that the task is based on Chinese data; - Set
engine
to the desired OpenAI large model name (since the OpenAI GPT-4 API is not fully open, this module currently does not support the use of GPT-4 API); - Set
text_input
to thetext
field in the dataset; - Set
zero_shot
as needed; if set toFalse
, examples for in-context learning need to be set in the/data/rte_ch.json
file in a specific format; - Set
instruction
to theinstruction
field in the dataset; if set toNone
, the default instruction for the module will be used; - Set
labels
to the entity types, or leave it empty;
Other parameters can be left at their default values.
We have provided a conversion script for the CCKS2023 competition data format, LLMICL/ccks2023_convert.py
We use the EasyInstruct tool, a user-friendly framework for instructing large language models, to complete this task. Please refer to Chapter 1 for the environment and data.
After setting the parameters, simply run the run.py
file:
>> python run.py
Input and output examples for making predictions using ChatGPT:
Input | Output |
---|---|
task="rte" language="ch" engine="gpt-3.5-turbo" text_input="2006年,弗雷泽出战中国天津举行的女子水球世界杯,协助国家队夺得冠军。2008年,弗雷泽代表澳大利亚参加北京奥运会女子水球比赛,赢得铜牌。" instruction="使用自然语言抽取三元组,已知下列句子,请从句子中抽取出可能的实体、关系,抽取实体类型为{'专业','时间','人类','组织','地理地区','事件'},关系类型为{'体育运动','包含行政领土','参加','国家','邦交国','夺得','举办地点','属于','获奖'},你可以先识别出实体再判断实体之间的关系,以(头实体,关系,尾实体)的形式回答" |
[[弗雷泽,获奖,铜牌],[女子水球世界杯,举办地点,天津],[弗雷泽,属于,国家队],[弗雷泽,国家,澳大利亚],[弗雷泽,参加,北京奥运会女子水球比赛],[中国,包含行政领土,天津],[中国,邦交国,澳大利亚],[北京奥运会女子水球比赛,举办地点,北京],[女子水球世界杯,体育运动,水球],[国家队,夺得,冠军)] |
We conducted a simple 5-shot in-context learning evaluation on the CCKS dataset using ChatGPT, and the results are shown in the table below:
Metric | Result |
---|---|
F1 | 0.3995 |
Rougen_2 | 0.7730 |
score (0.5*F1+0.5*Rougen_2) |
0.5863 |