Skip to content

Add support for JMRD#256

Open
ericgpks wants to merge 6 commits intored-data-tools:mainfrom
ericgpks:add-jmrd-dataset
Open

Add support for JMRD#256
ericgpks wants to merge 6 commits intored-data-tools:mainfrom
ericgpks:add-jmrd-dataset

Conversation

@ericgpks
Copy link
Copy Markdown
Contributor

@ericgpks ericgpks commented Apr 5, 2026

In #240, there are a lot of dialogs.
I chose JMRD to add as a new dataset.
This PR also includes type definitions with RBS.

Comment thread lib/datasets/jmrd.rb
Comment on lines +29 to +39
Knowledge = Struct.new(
:title,
:year,
:director_name,
:director_description,
:cast_names,
:cast_descriptions,
:genres,
:reviews,
:synopsis
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see the data in JSON, there is data like [知識なし]": "[知識なし]" in the knowledge section.
But I feel that it has no meaning. So I skipped it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that [知識なし] has a special meaning:

https://github.com/ku-nlp/JMRD?tab=readme-ov-file#format

knowledge: External knowledge used in the dialogue. Only the recommender has access to this knowledge. [知識なし] indicates a special option chosen when the recommender does not use any knowledge, such as greetings.

It seems that knowledge has all options in checked_knowledge. So it seems that we should keep it.

It seems that the following should be satisfied:

dialogue.utterances.all? do |utterance|
  dialog.knowledge.respond_to?(utterance.checked_knowledge.type)
end

@ericgpks ericgpks marked this pull request as ready for review April 5, 2026 14:05
Comment thread example/jmrd.rb
speaker_label = utterance.speaker == "recommender" ? "[R]" : "[S]"
puts " #{speaker_label} #{utterance.text}"

if utterance.checked_knowledge && !utterance.checked_knowledge.empty?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this:

Suggested change
if utterance.checked_knowledge && !utterance.checked_knowledge.empty?
if utterance.checked_knowledge&.any?

Comment thread lib/datasets/jmrd.rb
yield json_data
end

def parse_dialogue(data)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using other word than parse because this doesn't parse anything? It just builds an object from data. build_dialogue?

Comment thread lib/datasets/jmrd.rb
Comment on lines +163 to +168
checked_knowledge = nil
if data["checked_knowledge"]
checked_knowledge = data["checked_knowledge"].map do |ck|
CheckedKnowledge.new(ck["type"], ck["content"])
end
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating parse_checked_knowledge like others?

Comment thread lib/datasets/jmrd.rb
Comment on lines +29 to +39
Knowledge = Struct.new(
:title,
:year,
:director_name,
:director_description,
:cast_names,
:cast_descriptions,
:genres,
:reviews,
:synopsis
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that [知識なし] has a special meaning:

https://github.com/ku-nlp/JMRD?tab=readme-ov-file#format

knowledge: External knowledge used in the dialogue. Only the recommender has access to this knowledge. [知識なし] indicates a special option chosen when the recommender does not use any knowledge, such as greetings.

It seems that knowledge has all options in checked_knowledge. So it seems that we should keep it.

It seems that the following should be satisfied:

dialogue.utterances.all? do |utterance|
  dialog.knowledge.respond_to?(utterance.checked_knowledge.type)
end

Comment thread test/test-jmrd.rb
Comment on lines +7 to +32
assert_equal(4575, dialogues.size)

first_dialogue = dialogues[0]
assert_equal("01884", first_dialogue.dialog_id)
assert_equal("時をかける少女", first_dialogue.movie_title)
assert_equal("recommender", first_dialogue.first_speaker)

# Check questionnaire
assert_not_nil(first_dialogue.questionnaire)
assert_equal(5, first_dialogue.questionnaire.recommender.q1)
assert_equal(4, first_dialogue.questionnaire.seeker.q1)

# Check knowledge
assert_not_nil(first_dialogue.knowledge)
assert_equal("時をかける少女", first_dialogue.knowledge.title)
assert_equal("2006年", first_dialogue.knowledge.year)
assert_equal("細田守", first_dialogue.knowledge.director_name)

# Check utterances
assert_equal(26, first_dialogue.utterances.size)
assert_equal("01884_00", first_dialogue.utterances[0].utterance_id)
assert_equal("recommender", first_dialogue.utterances[0].speaker)
assert_equal("こんにちは", first_dialogue.utterances[0].text)
assert_not_nil(first_dialogue.utterances[0].checked_knowledge)
assert_equal(1, first_dialogue.utterances[0].checked_knowledge.size)
assert_equal("[知識なし]", first_dialogue.utterances[0].checked_knowledge[0].type)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use one assertion like other test

assert_equal({
:age => 39,
:work_class => "State-gov",
:final_weight => 77516,
:education => "Bachelors",
:n_education_years => 13,
:marital_status => "Never-married",
:occupation => "Adm-clerical",
:relationship => "Not-in-family",
:race => "White",
:sex => "Male",
:capital_gain => 2174,
:capital_loss => 0,
:hours_per_week => 40,
:native_country => "United-States",
:label => "<=50K"
},
@dataset.each.next.to_h)
for easy to debug?

If we use the one assertion style, all related values are shown in failure message.
If we use the multiple assertions style, one value is only shown in failure message.

The former is easier to debug because we can see all related information.

Comment thread test/test-jmrd.rb
first_dialogue = dialogues[0]
assert_not_nil(first_dialogue.dialog_id)
assert_not_nil(first_dialogue.movie_title)
assert(["recommender", "seeker"].include?(first_dialogue.first_speaker))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use power assert style or assert_include for easy to debug?

assert do
  ["recommender", "seeker"].include?(first_dialogue.first_speaker)
end
assert_include(["recommender", "seeker"], first_dialogue.first_speaker)

The former is recommended in test-unit.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Datasets::JMRD dataset (Japanese Movie Recommendation Dialogue Dataset) to the red-datasets collection, along with tests, an example script, and Steep/RBS type definitions.

Changes:

  • Implement Datasets::JMRD to download/parse JMRD JSON splits (train/valid/test) into structured dialogue objects.
  • Add a comprehensive Test::Unit suite validating sizes and representative parsed fields.
  • Add an example script, lazy-loader registration, README entry, and an RBS signature for the dataset.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
lib/datasets/jmrd.rb Implements the JMRD dataset loader/parser and metadata.
lib/datasets/lazy.rb Registers Datasets::JMRD for lazy loading.
test/test-jmrd.rb Adds dataset behavior and metadata tests for all splits.
example/jmrd.rb Demonstrates iterating dialogues and printing key fields.
sig/datasets/jmrd.rbs Adds RBS types for JMRD structs and dataset methods.
README.md Lists JMRD among available datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sig/datasets/jmrd.rbs
Comment on lines +74 to +90
class Dialogue < Struct[untyped]
attr_accessor dialog_id(): String?
attr_accessor movie_title(): String?
attr_accessor first_speaker(): String?
attr_accessor questionnaire(): Questionnaire?
attr_accessor knowledge(): Knowledge?
attr_accessor utterances(): Array[Utterance]

def initialize: (
?String? dialog_id,
?String? movie_title,
?String? first_speaker,
?Questionnaire? questionnaire,
?Knowledge? knowledge,
?Array[Utterance] utterances
) -> void
end
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the RBS for Dialogue, utterances() is declared as non-nil Array[Utterance], but the class inherits from Struct and its initializer signature marks utterances as optional. With Struct’s positional initializer, omitting the last argument results in utterances being nil, so the current type definition is inconsistent with runtime behavior. Consider either (1) making utterances nilable in both the accessor and initializer types, or (2) changing the Ruby implementation to ensure utterances is always required (e.g., by not exposing the raw Struct initializer / using a custom initializer) so the non-nil type is accurate.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@otegami otegami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR. I left a few comments on the parts that caught my attention.

Comment thread lib/datasets/jmrd.rb

unless [:train, :valid, :test].include?(type)
raise ArgumentError, ":type must be one of [:train, :valid, :test]: #{type.inspect}"
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about validating "type" before calling super() to avoid unnecessary setup if the argument is
invalid?

Comment thread lib/datasets/jmrd.rb
Comment thread lib/datasets/jmrd.rb
end

def parse_questionnaire(data)
return nil if data.nil?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe we don't have to specify here.

Suggested change
return nil if data.nil?
return if data.nil?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants