THE TEMPLATE DETECTION AND CONTENT EXTRACTION BENCHMARK SUITE README ==================================================================== 1.Introduction ============== Template extraction and content extraction techniques are based on extracting information from real webpages. These techniques need to be continually tested in order to get new improvements in their results and also in their performance. This testing is done by using sets of webpages prepared for this purpose. Thus, a benchmarks suite is an important requirement in order to measure the performance of these techniques. 2.How to obtain TECO 5.0 ======================== TECO 5.0 can be downloaded from the following URL: https://mist.dsic.upv.es/teco 3.Structure =========== TECO 5.0 was created by downloading 150 websites from the Internet. Once all the webpages were downloaded, four different engineers explored the key page and the webpages accessible from it to decide what part of the webpage is the template and what part is the main content. Using the results of this experiment, each website was prepared for template extraction, content detection and menu detection.. On one hand, all elements from the key page not belonging to the template were included in an HTML class called TECO_notTemplate. This way, a template extraction tool can compare its output to the nodes not belonging to the TECO_notTemplate class. On the other hand, all elements belonging to the main content were included in an HTML class called TECO_mainContent. Therefore, a content extraction tool can easily compare its output to the nodes belonging to this class. In addition, the main menu of the key page was included in an HTML class called TECO_mainMenu. There are different kinds of websites such as blogs, companies, forums, personal websites, sports websites, newspapers, etc. Some of the websites are well known like the BBC website or the Unicef website and others are less known like personal blogs or small companies websites. TECO 5.0 is organized in directories. There is a main directory called pages which has 149 directories inside, a directory for each website domain. Note that there are two websites sharing the same domain. 4.How to use TECO ================= The installation is very simple, the zip file has to be extracted into the hard drive, pendrive or other media. Once extracted it will create a directory called pages. It is recommended to extract the file on Linux or OS X systems because Windows based systems do not allow the directory structure used to store the benchmarks. 5.Key pages =========== The following paths indicate the path to the key page of each benchmark: web.mit.edu/institute-events/visitor www.museodelprado.es/index.html www.u-tokyo.ac.jp/en/about/history.html www.savethechildren.net/what-we-do/our-humanitarian-work.html college.harvard.edu/financial-aid.html www.linuxfoundation.org/about.1.html clinicaltrials.gov/ct2/search/index/index.html cordis.europa.eu/fp7/ict/fire.html parents.berkeley.edu/advice/babies/laundry.html www.mit.edu/campus-life.1.html cpoepalencia.es/federaciones-y-asociaciones-confederadas-asociaciones/index.html www.icann.org/history.html www.gip-jci-justice.fr/en/about-us/support-council/index.html www.einstein.yu.edu/leadership/index.html www.americanacademy.de/about/index.html www.mensa.es/cms/pages/¿qué-es-mensa.html www.bcrf.org/breast-cancer-research.html www.ielts.org/what-is-ielts/ielts-introduction.html fr.unesco.org/about-us/introducing-unesco.html www.ccbe.eu/about/who-we-are/index.html www.fraud.org/get involved.html www.jdi.org.za/index.html www.premiere-urgence.org/qui-sommes-nous/index.html www.indiangaming.org/index.html hispalinux.es/QuienesSomos.html www.gktw.org/about/index.html www.apnic.net/about-apnic/organization/vision-mission-objectives/index.html www.unicef.org/where-we-work.html www.klimabuendnis.org/home.html www.isoc-es.org edition.cnn.com/index.html www.neoteo.com/star-wars-the-force-awakens-el-regreso-de-viejos-personajes/ riotimesonline.com/index.html www.turfparadise.com/index.html www.cleanclothes.org/index.html www.afp.com/es/contact.html www.history.com/index.html detroit.cbslocal.com/2018/12/04/high-school-newspaper-suspended-after-publishing-disruptive-investigation/index.html www.rocklists.com/91x-1983.html www.lashorasperdidas.com/index.html www.journalism.org/2014/03/13/social-search-direct/index.html www.socialmediatoday.com/news/facebook-adds-new-features-for-instant-articles-including-links-to-more-pu/569786/index.html www.diariodeburgos.es/Noticia/Z1C5D6DE9-D1E6-B03A-61236AF21520B8B2/202002/Un-programa-verde-dedicado-a-Felix-Rodriguez-de-la-Fuente.html wordofmouthmendo.com/word-of-mouth-stories/2018/5/31/travellers-fare.html www.usine-digitale.fr/article/la-start-up-americaine-clearview-ai-illustre-deja-les-derives-de-la-reconnaissance-faciale.N921119.html 1015fm.com.au/2020/02/steve-mickenbecker-interest-rates-on-hold-2020-02-07/index.html www.dw.com/de/lebron-james-vom-pflegekind-zum-basketball-superstar/a-52088565.html www.theday.com/movies–tv/20200203/super-bowl-ads-dialed-up-fun-as-antidote-to-politics.html nltimes.nl/2019/12/16/chocolate-spread-babies-wins-misleading-product-award.html www.bbc.co.uk/news/index.html techcrunch.com/gadgets biztechmagazine.com/article/2019/12/why-byod-makes-endpoint-security-crucial-small-businesses.html www.eeo.com.cn/2022/0506/533366.shtml.html www.wishtv.com/news/flu-is-widespread-across-the-us/index.html news.mit.edu/2021/grand-decoding-data-0909.html asia.nikkei.com/Spotlight/Sharing-Economy/New-Tokyo-homes-ditch-parking-spaces-but-offer-car-sharing.html www.rcnky.com/articles/2021/09/12/ft-mitchell-reflects-life-age-104.html news.discovery.com/tech/robotics/artificial-intelligences-hawkings-fears-stir-debate-141206.htm www.kathimerini.gr/society/561833251/koronoios-arsi-metron-i-megali-prova-kanonikotitas-enopsei-toy-kalokairioy/index.html news.un.org/en/content/navigate-news.html es.sharelatex.com/learn/Uploading a project github.com/DawidStankiewicz/forum.1 en.citizendium.org/index.html www.filmaffinity.com/es/main.html www.meneame.net/faq-es.html www.accountkiller.com/en/delete-activision-account.html study.com/learn/science-questions-and-answers.html c.mi.com/it/index.html alumni.harvard.edu/help/message-board.html www.spacetimestudios.com/forumdisplay.php?29-Websites-and-Forum-Discussion.html www.gimpforum.de/index.html www.emaildiscussions.com/index.html forums.debian.net/viewforum.php?f=5.html forums.mozillazine.org/viewforum.php?f=23.html forums.tomsguide.com/forums/laptop-general-discussion.15/index.html forums.mysql.com/list.php?21.html lawstudents.ca/forums.html www.japanesepod101.com/forum/viewforum.php?f=26.html forum.skyscraperpage.com/index.html forums.opera.com/index.html forums.linuxmint.com/viewforum.php?f=72.html frances.forosactivos.net/index.html www.wysiwygwebbuilder.com/forum/viewforum.php?f=10.html www.3dprintforums.com/index.html www.strangehorizons.com/2004/20040906/greenglass-f.shtml.html communities.apple.com/es/community/mac os/os x el capitan.html www.sloweurope.com/community/index.html community.ricksteves.com/travel-forum/spain.html hackercombat.com/forum/index.html www.scbwi.org/boards/index.php?board=62.0.html www.cocinaconmarta.com/2015/04/empanadillas-chinas-de-gambas-y-verduras.html www.trendencias.com googleblog.blogspot.com.es www.robyncarr.com/qa.html users.dsic.upv.es/∼jsilva/wwv2013/index2.html www.folj.com/puzzles/difficult-logic-problems.htm oneminutelist.com/16-browser-alternatives-to-desktop-programs/index.html artsonline.uwaterloo.ca/jburbidg/index.html benjamincongdon.me/blog.html michael.tsikerdekis.com/index.html www.beeorganisee.com/reprendre-en-main-le-nettoyage/index.html www.danielgrindrod.com/about.html ofdollarsanddata.com/index.html blog.mint.com/updates/enter-our-newdecadenewyou-meme-sweepstakes-for-a-chance-to-win-5000/index.html elainesir.com/best-korean-beauty-blogs-bloggers-follow/index.html www.vindame.com.br/semana-riesling/uva-riesling/index.html www.rosamontero.es/obra-rosa-montero.html www.almezzer.com/libros/literatura-infantil/a-partir-de-4-anos/index.html markahall.blogspot.com.es johnboyne.com/about/index.html users.dsic.upv.es/∼dinsa/en/index.html johngardnerathome.info/index.htm www.annmalaspina.com/index.html foodsense.is/a-list.html sites.google.com/a/ciencias.unam.mx/pagina-ana-meda/index.html whatever.scalzi.com/about/interviews-appearances-articles-and-etc/index.html www.javiercelaya.es/index.html diarium.usal.es/lguich/pagina-personal-de-luis-arturo-guichard www.jameslovelock.org/scientific-papers/index.html www.cipri.info/index.html today.java.net/pub/a/today/2004/07/06/3ddesktop.html clotheshor.se/index.html www.raspberrypi.org/resources/teach/index.html doodle.com/online-calendar.html www.newprosoft.com/web-content-extractor.htm worryfreelabs.com/about.1.html www.intelligencetest.com/index.htm www.ikea.com/gb/en.html www.nubbeo.com.ar/index.html www.mulberry.com/es/shop/sale/sale-mens-accessories.html www.tous.com/es-es/novedades/relojes/c/59.html preferenceweb.com/collections/all-sneakers.html www.trekbikes.com/us/en US/bikes/mountain-bikes/electric-mountain-bikes/c/B512/index.html addons.prestashop.com/es/2-modulos.html us.pandora.net/en/charm-bracelets/pandora-moments/pandora-moments-bracelets/index.html kawaiipenshop.com/index.html www.vam.ac.uk/shop/lindsay-philip-butterfield-blue-flower-silk-scarf.html shop.fendt.com/kids-toys/clothing/shirts.html www.euroholds.com/it/29-prese-arrampicata.html www.emmaclothes.com/index.html www.arduino.cc/en/Main/Software.html naranjascarcaixent.com/tienda.html www.technicalbookstoreonline.com/new-arrivals.php.html www.floridarealestatecollege.com/index.html www.basf.com/nl/nl/who-we-are/BASF-in-Nederland.html www.mcphersonoil.com/index.html www.thirteenhou.com/menu.php.html www.embalajesterra.com/precintadoras-manuales-168.html www.crypto.ch/en/about.html www.shopbookshop.com/index.htm