项目作者: BYU-PCCL

项目描述 :
Conversational dataset from the Chit-Chat Challenge
高级语言: Python
项目地址: git://github.com/BYU-PCCL/chitchat-dataset.git
创建时间: 2019-05-30T01:25:20Z
项目社区:https://github.com/BYU-PCCL/chitchat-dataset

开源协议:MIT License

下载


chitchat-dataset

PyPI - Python Version
PyPI
PyPI - Wheel

CI
Code style: black

Open-domain conversational dataset from the BYU
Perception, Control & Cognition lab’s Chit-Chat Challenge.

install

  1. pip3 install chitchat_dataset

or simply download the raw dataset:

  1. curl -LO https://raw.githubusercontent.com/BYU-PCCL/chitchat-dataset/master/chitchat_dataset/dataset.json

usage

More formal docs should be coming soon, but for now, see chitchat_dataset/__init__.py for more options.

  1. import chitchat_dataset as ccc
  2. dataset = ccc.Dataset()
  3. # Dataset is a subclass of dict()
  4. for convo_id, convo in dataset.items():
  5. print(convo_id, convo)

Or get the messages in a flat list:

  1. messages = list(ccc.MessageDataset())

See examples/ for other languages.

stats

  • 7,168 conversations
  • 258,145 utterances
  • 1,315 unique participants

format

The dataset is a mapping from conversation UUID to a conversation:

  1. {
  2. "prompt": "What's the most interesting thing you've learned recently?",
  3. "ratings": { "witty": "1", "int": 5, "upbeat": 5 },
  4. "start": "2018-04-20T01:57:41",
  5. "messages": [
  6. [
  7. {
  8. "text": "Hello",
  9. "timestamp": "2018-04-19T19:57:51",
  10. "sender": "22578ac2-6317-44d5-8052-0a59076e0b96"
  11. }
  12. ],
  13. [
  14. {
  15. "text": "I learned that the Queen of England's last corgi died",
  16. "timestamp": "2018-04-19T19:58:14",
  17. "sender": "bebad07e-15df-48c3-a04f-67db828503e3"
  18. }
  19. ],
  20. [
  21. {
  22. "text": "Wow that sounds so sad",
  23. "timestamp": "2018-04-19T19:58:18",
  24. "sender": "22578ac2-6317-44d5-8052-0a59076e0b96"
  25. },
  26. {
  27. "text": "was it a cardigan welsh corgi",
  28. "timestamp": "2018-04-19T19:58:22",
  29. "sender": "22578ac2-6317-44d5-8052-0a59076e0b96"
  30. },
  31. {
  32. "text": "?",
  33. "timestamp": "2018-04-19T19:58:24",
  34. "sender": "22578ac2-6317-44d5-8052-0a59076e0b96"
  35. }
  36. ]
  37. ]
  38. }

This makes it convenient to represent multi-message conversational turns etc., preserving the structure/flow of the conversation.

how to cite

If you extend or use this work, please cite the paper where it was introduced:

  1. @article{myers2020conversational,
  2. title={Conversational Scaffolding: An Analogy-Based Approach to Response Prioritization in Open-Domain Dialogs},
  3. author={Myers, Will and Etchart, Tyler and Fulda, Nancy},
  4. year={2020}
  5. }