Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/#23 青空文庫のスクレイプのテストの作成 #35

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

TakenPt
Copy link

@TakenPt TakenPt commented Apr 20, 2024

No description provided.

Epub/KoeBook.Epub/Services/ScrapingAozoraService.cs Outdated Show resolved Hide resolved
Comment on lines 54 to 58
Assert.Single(document.Chapters);
Assert.Single(document.Chapters[^1].Sections);
Assert.Single(document.Chapters[^1].Sections);
Assert.IsType<Paragraph>(document.Chapters[^1].Sections[^1].Elements[^1]);
if (document.Chapters[^1].Sections[^1].Elements[^1] is Paragraph paragraph)
Copy link

@miyaji255 miyaji255 Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assert.Single(document.Chapters);
Assert.Single(document.Chapters[^1].Sections);
Assert.Single(document.Chapters[^1].Sections);
Assert.IsType<Paragraph>(document.Chapters[^1].Sections[^1].Elements[^1]);
if (document.Chapters[^1].Sections[^1].Elements[^1] is Paragraph paragraph)
var chapter = Assert.Single(document.Chapters);
var section = Assert.Single(chapter.Sections);
var paragraph = Assert.IsType<Paragraph>(section.Elements[^1]);

Assert.Single(document.Chapters);
Assert.Single(document.Chapters[^1].Sections);
Assert.Equal(expectedParagraphs.Count, document.Chapters[^1].Sections[^1].Elements.Count);
foreach ((var expectedParagraph, var actualElement) in expectedParagraphs.Zip(document.Chapters[^1].Sections[^1].Elements))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.Allを使ってください

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.Allって順番までチェックできますか?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

引数に取るのはIEnumerable<T>なので大丈夫です

KoeBook.Test/Epub/ScrapingAozoraServiceTest.cs Outdated Show resolved Hide resolved
KoeBook.Test/Epub/ScrapingAozoraServiceTest.cs Outdated Show resolved Hide resolved
KoeBook.Test/Epub/ScrapingAozoraServiceTest.cs Outdated Show resolved Hide resolved
}

[Theory]
[MemberData(nameof(ProcessChildrenlayout1TestCases))]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

html, expectedPragraphText, expectedScriptText, expectedClassNameにすればInlineDataでいけると思います

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessChildrenlayout2TestCasesInlineData にした方が良いですか?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

できればそのほうがわかりやすいです

Comment on lines 25 to 27
{
get { return new EpubDocument("", "", "", Guid.NewGuid()) { Chapters = [new Chapter() { Sections = [new Section("") { Elements = [new Paragraph()] }] }] }; }
}
Copy link

@miyaji255 miyaji255 Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ラムダ式を使うと簡単にかけます

Suggested change
{
get { return new EpubDocument("", "", "", Guid.NewGuid()) { Chapters = [new Chapter() { Sections = [new Section("") { Elements = [new Paragraph()] }] }] }; }
}
=> new("", "", "", Guid.NewGuid())
{
Chapters = [
new() {
Sections = [new("") { Elements = [new Paragraph()] }]
}
]
};

Comment on lines +116 to +118
var section = Assert.Single(chapter.Sections);
Assert.Equal(expectedParagraphs.Count, document.Chapters[^1].Sections[^1].Elements.Count);
Assert.All(expectedParagraphs.Zip(document.Chapters[^1].Sections[^1].Elements), v =>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document.Chapters[^1].Sections[^1] == sectionだと思います

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classのメンバの値が等しいことを == で確認できますか?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

そもそもsectionsの要素がsectionだけであることをAssert.Singleで示しているのでいちいちインデクサを使う必要はないということです。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

勘違いしてました。理解しました。

Comment on lines +135 to +144
internal class httpClientFactory : IHttpClientFactory
{
public HttpClient CreateClient(string name)
{
return httpClient;
}

private static readonly HttpClient httpClient = new HttpClient();

}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要だと思います

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

消し忘れです。消します

Comment on lines +57 to +62
var config = Configuration.Default.WithDefaultLoader();
using var context = BrowsingContext.New(config);
var doc = await context.OpenAsync(request => request.Content(html));
var mainText = doc.DocumentElement.LastElementChild?.LastElementChild as IHtmlDivElement;
if (mainText == null)
Assert.Fail();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

こんな感じはどうでしょうか?ここでは準備段階なのでnullになるかどうかはこちらの入力値で変更できます。よってnullチェックは必要ありません。

        var parser = new HtmlParser();
        var docs = parser.ParseDocument(html).Body!.Children.Single();

Comment on lines +80 to +94
// レイアウト2.1 1行だけの字下げ
(@"<div class=""jisage_3"" style=""margin-left: 3em"">text<br /></div><br>", [new Paragraph() { Text = "text", ClassName = "jisage_3", ScriptLine = new ScriptLine("text", "", "") }], [("jisage", (1, 3))]),
// レイアウト2.2 ブロックでの字下げ
(@"<div class=""jisage_3"" style=""margin-left: 3em"">text1<br />text2<br /></div><br>", [new Paragraph() { Text = "text1", ClassName = "jisage_3", ScriptLine = new ScriptLine("text1", "", "") }, new Paragraph() { Text = "text2", ClassName = "jisage_3", ScriptLine = new ScriptLine("text2", "", "") },], [("jisage", (1, 3))]),
// レイアウト2.3 凹凸の複雑な字下げ
(@"<div class=""burasage"" style=""margin-left: 3em; text_indent: -1em;"">Long Text</div>", [new Paragraph() { Text = "Long Text", ClassName = "jisage_3 text_indent_-1" }], [("jisage", (1, 3)), ("text_indent", (-1, 0))]),
// レイアウト2.4 は特定の書き方について述べていないので省略。
// レイアウト2.5 地付き
(@"<div class=""chitsuki_0"" style=""text-align:right; margin-right: 0em"">text</div>", [new Paragraph() { Text = "text", ClassName = "chitsuki_0", ScriptLine = new ScriptLine("text", "", "") }], [("chitsuki", (0, 0))]),


// </div>の後の<br />がないパターン
(@"<div class=""jisage_3"" style=""margin-left: 3em"">text<br /></div>", [new Paragraph() { Text = "text", ClassName = "jisage_3", ScriptLine = new ScriptLine("text", "", "") }], [("jisage", (1, 3))]),
// </div>の前の<br />がないパターン
(@"<div class=""burasage"" style=""margin-left: 1em; text_indent: -1em;"">text</div>", [new Paragraph() { Text = "text", ClassName = "jisage_3 text_indent_-1", ScriptLine = new ScriptLine("text", "", "") }], [("jisage", (1, 3)), ("text_indent", (-1, 0))]),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

見づらいので改行を増やしてください。new()を使えばより短くできると思います。

Comment on lines +104 to +111
var config = Configuration.Default.WithDefaultLoader();
using var context = BrowsingContext.New(config);
var doc = await context.OpenAsync(request => request.Content(html));
var mainText = doc.QuerySelector(".main_text") as IHtmlDivElement;
if (mainText == null)
Assert.Fail();
var document = EmptySingleParagraph;
_scrapingAozoraService._Classes().Clear();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上と同様の変更を入れてください


[Theory]
[MemberData(nameof(ProcessChildrenlayout2TestCases))]
public async void ProcessChildrenLayout2Test(string html, IReadOnlyCollection<Paragraph> expectedParagraphs, IEnumerable<(string key, (int min, int max) value)> expectedDictionary)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

確かに厳密にはReadonlyにするほうが良いですが、範囲が限定的なので配列にしたほうが見やすいと思います。

{
using var context = BrowsingContext.New(Configuration.Default);
using var doc = await context.OpenAsync(req => req.Content(input));
Assert.NotNull(doc.ParentElement);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

今見返すとミスだったので上と同様に修正してほしいです。

/// </summary>
/// <param name="document">書き込むEpubDocument</param>
/// <param name="mainText">class = "main_text" なdiv要素</param>
internal void ProcessMainText(EpubDocument document, IHtmlDivElement mainText)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

divである必要はありますか?キャストが増えてしまう割に対した制約が増えないのでいらないと思います

/// <param name="element">処理を行う要素</param>
/// <param name="appliedClasses">適用されるclassのリスト</param>
/// <param name="scrapingInfo"></param>
internal void ProcessChildren(EpubDocument document, IElement element, string appliedClasses, ref int headingId, SplittedLineBuilder paragraphLineBuilder, SplittedLineBuilder scriptLineLineBuilder, Dictionary<string, (int min, int max)> classes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

headingIdは戻り値にすることで対応できませんか?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

青空文庫のスクレイピングにおける読み込みの方法の変更
2 participants