Preview
Issue #2
- Sergiy Moskalenko
- First of all, you did great research work.
But in practice we have a few problems:
1) If we remove an author's name from an article, then part of essential content will become unclear.
If we remove a part of an article in square brackets, we will remove essential information about author.
If we save an author's name in square brackets, then there will be unnecessary duplication of information.
2) Sometimes an author's name position not coincide with your pattern. In this case, name is also missing from your IV page. For example:
- http://star.mk.co.kr/v2/view.php?sc=41000021&cm=%B9%E6%BC%DB&year=2017&no=442391&relatedcode=&mc=
- http://star.mk.co.kr/v2/view.php?sc=41000021&cm=%B9%E6%BC%DB&year=2017&no=442459&relatedcode=&mc=
- http://star.mk.co.kr/v2/view.php?sc=41000021&cm=%B9%E6%BC%DB&year=2017&no=442704&relatedcode=&mc=
- http://star.mk.co.kr/v2/view.php?sc=41000021&cm=%B9%E6%BC%DB&year=2017&no=441824&relatedcode=&mc=
- http://star.mk.co.kr/v2/view.php?sc=42300040&cm=%B3%AF%B0%B3_%C8%AD%C1%A6&year=2017&
- Declined by admin
- Type of issue
- IV page is missing essential content
- Reported
- Jun 17, 2017
There is a recurring pattern on this site: if an article begins with a string in square brackets then it contains an author's name.
Let's examine this one [매일경제 스타투데이 황승빈 인턴기자]:
• "매일경제 스타투데이" is the name of the source, translating as "Daily Economic Star Today"
• "황승빈" is the author's name itself ("Hwang Seung-bin")
• Finally, "인턴기자" translates as "intern reporter", which is the title of the author (sometimes that last bit just says "reporter")
Add this to the fact that 99% (not kidding, look it up) of Korean names have only three-syllables and you got yourself a neat regex.