Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
generate_dataset.py		generate_dataset.py
requirements.txt		requirements.txt

README.md

LogicInference Dataset

This repository contains the Python code used to generate the LogicInference dataset. LogicInference is a dataset designed to evaluate the ability of models to perform logical inference. The dataset focuses on inference using propositional logic and a small subset of first-order logic, represented both in semi-formal logical notation, and in natural language. LogicInference has two main long-term goals: (1) to evaluate the ability of models to perform logical inference, and the degree to which inference chains are real or hallucinated, and (2) to assess whether learning logical inference abilities in the abstract (e.g., getting better in this dataset) would then transfer to other real-world tasks.

Note: to run this code you also need the other files from the original LogicInference project here. The generate_dataset script in this directory is a drop-in replacement for the original generate_dataset script which outputs data in Open Assistant instruct format.

For a detailed description of the dataset, please check the following paper: https://openreview.net/pdf?id=HAGeIS_Lcg9 (arXiv preprint: https://arxiv.org/abs/2203.15099 )

Please cite as:

@inproceedings{ontanon2022logicinference,
  url = {https://openreview.net/pdf?id=HAGeIS_Lcg9},
  author = {Onta\~{n}\'{o}n, Santiago and Ainslie, Joshua and Cvicek, Vaclav and Fisher, Zachary},
  title = {{LogicInference}: A New Dataset for Teaching Logical Inference to seq2seq Models},
  booktitle={Proceedings of ICLR 2022 workshop on Objects, Structure and Causality},
  year={2022}
}

This is an re-produce of the dataset from LogicInference Dataset in paper: https://openreview.net/pdf?id=HAGeIS_Lcg9.

The github page of LogicInference Dataset: https://github.com/google-research/google-research/tree/master/logic_inference_dataset.

This dataset is aimed to offer more dataset for Open Assistant project, depending on their demands, there three columns: INSTRUCTION, RESPONSE, SOURCE.

The results in this dataset is a little different from which was introduced in the original paper:

1.For all three splits (IID/OOD/length), only IID is used. In the original paper, it seems that model can reach better performance with data generated by this split method.

2.In the original paper, there are two form of responses: LOGICINFERENCE_b (with the answer at the beginning) and LOGICINFERENCE_e (with the answer at the end). This dataset uses LOGICINFERENCE_e, that means: for all questions, the model will first do logic inference, and give the final answer at the end.

3.The original paper, some parameters in generate_dataset.py are:

N_INFERENCE_PROBLEMS = 5000

N_VARIATIONS = 25

N_EXAMPLES = 200000

TRAIN_RATIO = 0.9

LENGTH_SPLIT_THRESHOLD = 4

RANDOM_SEED = 0

I choose some new parameters:

N_INFERENCE_PROBLEMS = 10000

N_VARIATIONS = 25

N_EXAMPLES = 55000

TRAIN_RATIO = 1

LENGTH_SPLIT_THRESHOLD = 4

RANDOM_SEED = 1111

The original script generated 4814 different inference problems and extended all those inference problems to around 200,000 Q-A pairs. My settings generated 5491 different inference problems and extended them to around 54,607 Instruction-Response pairs. I think for Open Assistant projects, maybe the number of different inference problems is more important, and generated many similar Instruction-Response pairs will only add training time and doesn't make much sense.

4.I only keep the generate_dataset.py file in this directory, because the coding format of the original project does not fit OA project which need flake8 format. I only change the coding format of generate_dateset.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logicreference_OA

logicreference_OA

README.md

LogicInference Dataset

Files

logicreference_OA

Directory actions

More options

Directory actions

More options

Latest commit

History

logicreference_OA

Folders and files

parent directory

README.md

LogicInference Dataset