Conversation
Add type signatures for JMRD class and its nested Struct types (Dialogue, Questionnaire, QuestionnaireAnswers, Knowledge, Utterance, CheckedKnowledge)
| Knowledge = Struct.new( | ||
| :title, | ||
| :year, | ||
| :director_name, | ||
| :director_description, | ||
| :cast_names, | ||
| :cast_descriptions, | ||
| :genres, | ||
| :reviews, | ||
| :synopsis | ||
| ) |
There was a problem hiding this comment.
As I see the data in JSON, there is data like [知識なし]": "[知識なし]" in the knowledge section.
But I feel that it has no meaning. So I skipped it.
There was a problem hiding this comment.
It seems that [知識なし] has a special meaning:
https://github.com/ku-nlp/JMRD?tab=readme-ov-file#format
knowledge: External knowledge used in the dialogue. Only the recommender has access to this knowledge.[知識なし]indicates a special option chosen when the recommender does not use any knowledge, such as greetings.
It seems that knowledge has all options in checked_knowledge. So it seems that we should keep it.
It seems that the following should be satisfied:
dialogue.utterances.all? do |utterance|
dialog.knowledge.respond_to?(utterance.checked_knowledge.type)
end| speaker_label = utterance.speaker == "recommender" ? "[R]" : "[S]" | ||
| puts " #{speaker_label} #{utterance.text}" | ||
|
|
||
| if utterance.checked_knowledge && !utterance.checked_knowledge.empty? |
There was a problem hiding this comment.
We can simplify this:
| if utterance.checked_knowledge && !utterance.checked_knowledge.empty? | |
| if utterance.checked_knowledge&.any? |
| yield json_data | ||
| end | ||
|
|
||
| def parse_dialogue(data) |
There was a problem hiding this comment.
How about using other word than parse because this doesn't parse anything? It just builds an object from data. build_dialogue?
| checked_knowledge = nil | ||
| if data["checked_knowledge"] | ||
| checked_knowledge = data["checked_knowledge"].map do |ck| | ||
| CheckedKnowledge.new(ck["type"], ck["content"]) | ||
| end | ||
| end |
There was a problem hiding this comment.
How about creating parse_checked_knowledge like others?
| Knowledge = Struct.new( | ||
| :title, | ||
| :year, | ||
| :director_name, | ||
| :director_description, | ||
| :cast_names, | ||
| :cast_descriptions, | ||
| :genres, | ||
| :reviews, | ||
| :synopsis | ||
| ) |
There was a problem hiding this comment.
It seems that [知識なし] has a special meaning:
https://github.com/ku-nlp/JMRD?tab=readme-ov-file#format
knowledge: External knowledge used in the dialogue. Only the recommender has access to this knowledge.[知識なし]indicates a special option chosen when the recommender does not use any knowledge, such as greetings.
It seems that knowledge has all options in checked_knowledge. So it seems that we should keep it.
It seems that the following should be satisfied:
dialogue.utterances.all? do |utterance|
dialog.knowledge.respond_to?(utterance.checked_knowledge.type)
end| assert_equal(4575, dialogues.size) | ||
|
|
||
| first_dialogue = dialogues[0] | ||
| assert_equal("01884", first_dialogue.dialog_id) | ||
| assert_equal("時をかける少女", first_dialogue.movie_title) | ||
| assert_equal("recommender", first_dialogue.first_speaker) | ||
|
|
||
| # Check questionnaire | ||
| assert_not_nil(first_dialogue.questionnaire) | ||
| assert_equal(5, first_dialogue.questionnaire.recommender.q1) | ||
| assert_equal(4, first_dialogue.questionnaire.seeker.q1) | ||
|
|
||
| # Check knowledge | ||
| assert_not_nil(first_dialogue.knowledge) | ||
| assert_equal("時をかける少女", first_dialogue.knowledge.title) | ||
| assert_equal("2006年", first_dialogue.knowledge.year) | ||
| assert_equal("細田守", first_dialogue.knowledge.director_name) | ||
|
|
||
| # Check utterances | ||
| assert_equal(26, first_dialogue.utterances.size) | ||
| assert_equal("01884_00", first_dialogue.utterances[0].utterance_id) | ||
| assert_equal("recommender", first_dialogue.utterances[0].speaker) | ||
| assert_equal("こんにちは", first_dialogue.utterances[0].text) | ||
| assert_not_nil(first_dialogue.utterances[0].checked_knowledge) | ||
| assert_equal(1, first_dialogue.utterances[0].checked_knowledge.size) | ||
| assert_equal("[知識なし]", first_dialogue.utterances[0].checked_knowledge[0].type) |
There was a problem hiding this comment.
Could you use one assertion like other test
red-datasets/test/test-adult.rb
Lines 12 to 29 in aa9f29b
If we use the one assertion style, all related values are shown in failure message.
If we use the multiple assertions style, one value is only shown in failure message.
The former is easier to debug because we can see all related information.
| first_dialogue = dialogues[0] | ||
| assert_not_nil(first_dialogue.dialog_id) | ||
| assert_not_nil(first_dialogue.movie_title) | ||
| assert(["recommender", "seeker"].include?(first_dialogue.first_speaker)) |
There was a problem hiding this comment.
Could you use power assert style or assert_include for easy to debug?
assert do
["recommender", "seeker"].include?(first_dialogue.first_speaker)
endassert_include(["recommender", "seeker"], first_dialogue.first_speaker)The former is recommended in test-unit.
There was a problem hiding this comment.
Pull request overview
Adds a new Datasets::JMRD dataset (Japanese Movie Recommendation Dialogue Dataset) to the red-datasets collection, along with tests, an example script, and Steep/RBS type definitions.
Changes:
- Implement
Datasets::JMRDto download/parse JMRD JSON splits (train/valid/test) into structured dialogue objects. - Add a comprehensive Test::Unit suite validating sizes and representative parsed fields.
- Add an example script, lazy-loader registration, README entry, and an RBS signature for the dataset.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
lib/datasets/jmrd.rb |
Implements the JMRD dataset loader/parser and metadata. |
lib/datasets/lazy.rb |
Registers Datasets::JMRD for lazy loading. |
test/test-jmrd.rb |
Adds dataset behavior and metadata tests for all splits. |
example/jmrd.rb |
Demonstrates iterating dialogues and printing key fields. |
sig/datasets/jmrd.rbs |
Adds RBS types for JMRD structs and dataset methods. |
README.md |
Lists JMRD among available datasets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class Dialogue < Struct[untyped] | ||
| attr_accessor dialog_id(): String? | ||
| attr_accessor movie_title(): String? | ||
| attr_accessor first_speaker(): String? | ||
| attr_accessor questionnaire(): Questionnaire? | ||
| attr_accessor knowledge(): Knowledge? | ||
| attr_accessor utterances(): Array[Utterance] | ||
|
|
||
| def initialize: ( | ||
| ?String? dialog_id, | ||
| ?String? movie_title, | ||
| ?String? first_speaker, | ||
| ?Questionnaire? questionnaire, | ||
| ?Knowledge? knowledge, | ||
| ?Array[Utterance] utterances | ||
| ) -> void | ||
| end |
There was a problem hiding this comment.
In the RBS for Dialogue, utterances() is declared as non-nil Array[Utterance], but the class inherits from Struct and its initializer signature marks utterances as optional. With Struct’s positional initializer, omitting the last argument results in utterances being nil, so the current type definition is inconsistent with runtime behavior. Consider either (1) making utterances nilable in both the accessor and initializer types, or (2) changing the Ruby implementation to ensure utterances is always required (e.g., by not exposing the raw Struct initializer / using a custom initializer) so the non-nil type is accurate.
otegami
left a comment
There was a problem hiding this comment.
Thank you for your PR. I left a few comments on the parts that caught my attention.
|
|
||
| unless [:train, :valid, :test].include?(type) | ||
| raise ArgumentError, ":type must be one of [:train, :valid, :test]: #{type.inspect}" | ||
| end |
There was a problem hiding this comment.
nit: How about validating "type" before calling super() to avoid unnecessary setup if the argument is
invalid?
| end | ||
|
|
||
| def parse_questionnaire(data) | ||
| return nil if data.nil? |
There was a problem hiding this comment.
nit: Maybe we don't have to specify here.
| return nil if data.nil? | |
| return if data.nil? |
In #240, there are a lot of dialogs.
I chose JMRD to add as a new dataset.
This PR also includes type definitions with RBS.