Add support for JMRD by ericgpks · Pull Request #256 · red-data-tools/red-datasets

ericgpks · 2026-04-05T14:03:48Z

In #240, there are a lot of dialogs.
I chose JMRD to add as a new dataset.
This PR also includes type definitions with RBS.

Add type signatures for JMRD class and its nested Struct types (Dialogue, Questionnaire, QuestionnaireAnswers, Knowledge, Utterance, CheckedKnowledge)

ericgpks · 2026-04-05T14:05:40Z

+    Knowledge = Struct.new(
+      :title,
+      :year,
+      :director_name,
+      :director_description,
+      :cast_names,
+      :cast_descriptions,
+      :genres,
+      :reviews,
+      :synopsis
+    )


As I see the data in JSON, there is data like [知識なし]": "[知識なし]" in the knowledge section.
But I feel that it has no meaning. So I skipped it.

It seems that [知識なし] has a special meaning:

https://github.com/ku-nlp/JMRD?tab=readme-ov-file#format

knowledge: External knowledge used in the dialogue. Only the recommender has access to this knowledge. [知識なし] indicates a special option chosen when the recommender does not use any knowledge, such as greetings.

It seems that knowledge has all options in checked_knowledge. So it seems that we should keep it.

It seems that the following should be satisfied:

dialogue.utterances.all? do |utterance| dialog.knowledge.respond_to?(utterance.checked_knowledge.type) end

kou · 2026-04-06T00:39:22Z

+    speaker_label = utterance.speaker == "recommender" ? "[R]" : "[S]"
+    puts "  #{speaker_label} #{utterance.text}"
+
+    if utterance.checked_knowledge && !utterance.checked_knowledge.empty?


We can simplify this:

Suggested change

if utterance.checked_knowledge && !utterance.checked_knowledge.empty?

if utterance.checked_knowledge&.any?

kou · 2026-04-06T06:09:20Z

+      yield json_data
+    end
+
+    def parse_dialogue(data)


How about using other word than parse because this doesn't parse anything? It just builds an object from data. build_dialogue?

kou · 2026-04-06T06:11:05Z

+      checked_knowledge = nil
+      if data["checked_knowledge"]
+        checked_knowledge = data["checked_knowledge"].map do |ck|
+          CheckedKnowledge.new(ck["type"], ck["content"])
+        end
+      end


How about creating parse_checked_knowledge like others?

kou · 2026-04-06T06:22:17Z

+    Knowledge = Struct.new(
+      :title,
+      :year,
+      :director_name,
+      :director_description,
+      :cast_names,
+      :cast_descriptions,
+      :genres,
+      :reviews,
+      :synopsis
+    )


It seems that [知識なし] has a special meaning:

https://github.com/ku-nlp/JMRD?tab=readme-ov-file#format

knowledge: External knowledge used in the dialogue. Only the recommender has access to this knowledge. [知識なし] indicates a special option chosen when the recommender does not use any knowledge, such as greetings.

It seems that knowledge has all options in checked_knowledge. So it seems that we should keep it.

It seems that the following should be satisfied:

dialogue.utterances.all? do |utterance| dialog.knowledge.respond_to?(utterance.checked_knowledge.type) end

kou · 2026-04-06T06:25:28Z

+      assert_equal(4575, dialogues.size)
+
+      first_dialogue = dialogues[0]
+      assert_equal("01884", first_dialogue.dialog_id)
+      assert_equal("時をかける少女", first_dialogue.movie_title)
+      assert_equal("recommender", first_dialogue.first_speaker)
+
+      # Check questionnaire
+      assert_not_nil(first_dialogue.questionnaire)
+      assert_equal(5, first_dialogue.questionnaire.recommender.q1)
+      assert_equal(4, first_dialogue.questionnaire.seeker.q1)
+
+      # Check knowledge
+      assert_not_nil(first_dialogue.knowledge)
+      assert_equal("時をかける少女", first_dialogue.knowledge.title)
+      assert_equal("２００６年", first_dialogue.knowledge.year)
+      assert_equal("細田守", first_dialogue.knowledge.director_name)
+
+      # Check utterances
+      assert_equal(26, first_dialogue.utterances.size)
+      assert_equal("01884_00", first_dialogue.utterances[0].utterance_id)
+      assert_equal("recommender", first_dialogue.utterances[0].speaker)
+      assert_equal("こんにちは", first_dialogue.utterances[0].text)
+      assert_not_nil(first_dialogue.utterances[0].checked_knowledge)
+      assert_equal(1, first_dialogue.utterances[0].checked_knowledge.size)
+      assert_equal("[知識なし]", first_dialogue.utterances[0].checked_knowledge[0].type)


Could you use one assertion like other test

red-datasets/test/test-adult.rb

Lines 12 to 29 in aa9f29b

assert_equal({

:age => 39,

:work_class => "State-gov",

:final_weight => 77516,

:education => "Bachelors",

:n_education_years => 13,

:marital_status => "Never-married",

:occupation => "Adm-clerical",

:relationship => "Not-in-family",

:race => "White",

:sex => "Male",

:capital_gain => 2174,

:capital_loss => 0,

:hours_per_week => 40,

:native_country => "United-States",

:label => "<=50K"

},

@dataset.each.next.to_h)

for easy to debug?

If we use the one assertion style, all related values are shown in failure message.
If we use the multiple assertions style, one value is only shown in failure message.

The former is easier to debug because we can see all related information.

kou · 2026-04-06T06:29:06Z

+      first_dialogue = dialogues[0]
+      assert_not_nil(first_dialogue.dialog_id)
+      assert_not_nil(first_dialogue.movie_title)
+      assert(["recommender", "seeker"].include?(first_dialogue.first_speaker))


Could you use power assert style or assert_include for easy to debug?

assert do ["recommender", "seeker"].include?(first_dialogue.first_speaker) end

assert_include(["recommender", "seeker"], first_dialogue.first_speaker)

The former is recommended in test-unit.

Copilot

Pull request overview

Adds a new Datasets::JMRD dataset (Japanese Movie Recommendation Dialogue Dataset) to the red-datasets collection, along with tests, an example script, and Steep/RBS type definitions.

Changes:

Implement Datasets::JMRD to download/parse JMRD JSON splits (train/valid/test) into structured dialogue objects.
Add a comprehensive Test::Unit suite validating sizes and representative parsed fields.
Add an example script, lazy-loader registration, README entry, and an RBS signature for the dataset.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`lib/datasets/jmrd.rb`	Implements the JMRD dataset loader/parser and metadata.
`lib/datasets/lazy.rb`	Registers `Datasets::JMRD` for lazy loading.
`test/test-jmrd.rb`	Adds dataset behavior and metadata tests for all splits.
`example/jmrd.rb`	Demonstrates iterating dialogues and printing key fields.
`sig/datasets/jmrd.rbs`	Adds RBS types for JMRD structs and dataset methods.
`README.md`	Lists JMRD among available datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T06:34:24Z

+    class Dialogue < Struct[untyped]
+      attr_accessor dialog_id(): String?
+      attr_accessor movie_title(): String?
+      attr_accessor first_speaker(): String?
+      attr_accessor questionnaire(): Questionnaire?
+      attr_accessor knowledge(): Knowledge?
+      attr_accessor utterances(): Array[Utterance]
+
+      def initialize: (
+        ?String? dialog_id,
+        ?String? movie_title,
+        ?String? first_speaker,
+        ?Questionnaire? questionnaire,
+        ?Knowledge? knowledge,
+        ?Array[Utterance] utterances
+      ) -> void
+    end


In the RBS for Dialogue, utterances() is declared as non-nil Array[Utterance], but the class inherits from Struct and its initializer signature marks utterances as optional. With Struct’s positional initializer, omitting the last argument results in utterances being nil, so the current type definition is inconsistent with runtime behavior. Consider either (1) making utterances nilable in both the accessor and initializer types, or (2) changing the Ruby implementation to ensure utterances is always required (e.g., by not exposing the raw Struct initializer / using a custom initializer) so the non-nil type is accurate.

otegami

Thank you for your PR. I left a few comments on the parts that caught my attention.

otegami · 2026-04-06T06:31:45Z

+
+      unless [:train, :valid, :test].include?(type)
+        raise ArgumentError, ":type must be one of [:train, :valid, :test]: #{type.inspect}"
+      end


nit: How about validating "type" before calling super() to avoid unnecessary setup if the argument is
invalid?

https://github.com/red-data-tools/red-datasets/blob/main/lib/datasets/cifar.rb#L28-L36

otegami · 2026-04-06T06:54:09Z

+    end
+
+    def parse_questionnaire(data)
+      return nil if data.nil?


nit: Maybe we don't have to specify here.

Suggested change

return nil if data.nil?

return if data.nil?

ericgpks added 6 commits November 5, 2025 12:19

feat: add JMRD datasets

e6ae201

feat: add new JMRD datasets to references

27913d3

feat: add RBS type definitions for JMRD dataset

a4f72fa

Add type signatures for JMRD class and its nested Struct types (Dialogue, Questionnaire, QuestionnaireAnswers, Knowledge, Utterance, CheckedKnowledge)

fix: change some type to appropriate

762411f

fix: change error message to align with others

a0742a6

Merge branch 'red-data-tools:main' into add-jmrd-dataset

9a59b2c

ericgpks commented Apr 5, 2026

View reviewed changes

ericgpks marked this pull request as ready for review April 5, 2026 14:05

kou reviewed Apr 6, 2026

View reviewed changes

kou requested a review from Copilot April 6, 2026 06:30

Copilot started reviewing on behalf of kou April 6, 2026 06:30 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

otegami reviewed Apr 6, 2026

View reviewed changes

	if utterance.checked_knowledge && !utterance.checked_knowledge.empty?
	if utterance.checked_knowledge&.any?

	assert_equal({
	:age => 39,
	:work_class => "State-gov",
	:final_weight => 77516,
	:education => "Bachelors",
	:n_education_years => 13,
	:marital_status => "Never-married",
	:occupation => "Adm-clerical",
	:relationship => "Not-in-family",
	:race => "White",
	:sex => "Male",
	:capital_gain => 2174,
	:capital_loss => 0,
	:hours_per_week => 40,
	:native_country => "United-States",
	:label => "<=50K"
	},
	@dataset.each.next.to_h)

Conversation

ericgpks commented Apr 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

otegami left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants