Commit a46d9212 authored by Renán Sosa Guillen's avatar Renán Sosa Guillen

crawlers

parent 962bbeec
Crawlers para medios de información en linea nacionales basados en [Scrapy](http://scrapy.org/) Crawlers para medios de información en linea nacionales basados en [Scrapy](http://scrapy.org/)
Se incluyen los siguientes medios: Se incluyen los siguientes medios:
* [Al Chile](http://alchile.com.mx/) * [Al Chile](http://alchile.com.mx)
Acceso por día: http://alchile.com.mx/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -11,9 +11,9 @@ Se incluyen los siguientes medios: ...@@ -11,9 +11,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22
``` ```
No se encontró notas antes del 2014.06.01. No se encontró notas antes del 2014.06.01.
* [Desde el Balcón](http://www.desdeelbalcon.com/) * [Desde el Balcón](http://www.desdeelbalcon.com)
Acceso por día: http://www.desdeelbalcon.com/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -23,7 +23,7 @@ Se incluyen los siguientes medios: ...@@ -23,7 +23,7 @@ Se incluyen los siguientes medios:
No se encontró notas antes del 2014.01.11. No se encontró notas antes del 2014.01.11.
* [Diario del Yaqui](http://diariodelyaqui.mx) * [Diario del Yaqui](http://diariodelyaqui.mx)
Acceso por día: http://diariodelyaqui.mx/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -31,7 +31,7 @@ Se incluyen los siguientes medios: ...@@ -31,7 +31,7 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2016 -a month=12 -a day=24 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2016 -a month=12 -a day=24
``` ```
No se encontró notas antes del 2016.12.30. No se encontró notas antes del 2016.12.30.
* [Diario de Yucatán](http://yucatan.com.mx/) * [Diario de Yucatán](http://yucatan.com.mx)
Uso: Uso:
...@@ -41,9 +41,9 @@ Se incluyen los siguientes medios: ...@@ -41,9 +41,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2016 -a month=12 -a day=24 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2016 -a month=12 -a day=24
``` ```
No se encontró notas antes del 2012.04.02. No se encontró notas antes del 2012.04.02.
* [El Grillo Porteño](http://grilloporteno.com/) * [El Grillo Porteño](http://grilloporteno.com)
Acceso por día: http://grilloporteno.com/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -51,9 +51,9 @@ Se incluyen los siguientes medios: ...@@ -51,9 +51,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=26 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=26
``` ```
No se encontró notas antes del 2013.11.04. No se encontró notas antes del 2013.11.04.
* [La Jornada (Nueva versión)](http://www.jornada.unam.mx) * [La Jornada](http://www.jornada.unam.mx)
Nueva versión del Crawler de La Jornada Acceso por día: http://www.jornada.unam.mx/2017/09/13/
Uso: Uso:
```bash ```bash
...@@ -61,9 +61,9 @@ Se incluyen los siguientes medios: ...@@ -61,9 +61,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=4 -a day=23 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=4 -a day=23
``` ```
No se encontró notas antes del 2005.02.01. No se encontró notas antes del 2005.02.01.
* [La Jornada Aguascalientes](http://www.lja.mx/) * [La Jornada Aguascalientes](http://www.lja.mx)
Acceso por día: http://www.lja.mx/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -82,16 +82,6 @@ Se incluyen los siguientes medios: ...@@ -82,16 +82,6 @@ Se incluyen los siguientes medios:
``` ```
No se encontró notas antes del 2015.02.09. No se encontró notas antes del 2015.02.09.
Esta versión descarga por fecha. Se recomienda usar principalmente para fechas recientes. Esta versión descarga por fecha. Se recomienda usar principalmente para fechas recientes.
* [La Jornada Baja California 2](http://jornadabc.mx)
Uso:
```bash
cd laJornadaBC2
scrapy crawl noticias -t json --nolog -o noticias.json
```
Esta versión descarga todas las noticias de cada las categorías.
* [La Jornada Guerrero](http://www.lajornadaguerrero.com.mx) * [La Jornada Guerrero](http://www.lajornadaguerrero.com.mx)
...@@ -112,9 +102,9 @@ Se incluyen los siguientes medios: ...@@ -112,9 +102,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=4 -a day=25 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=4 -a day=25
``` ```
No se encontró notas antes del 2015.03.12. No se encontró notas antes del 2015.03.12.
* [La Jornada de Oriente](http://www.lajornadadeoriente.com.mx/) * [La Jornada de Oriente](http://www.lajornadadeoriente.com.mx)
Acceso por día: http://www.lajornadadeoriente.com.mx/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -124,7 +114,7 @@ Se incluyen los siguientes medios: ...@@ -124,7 +114,7 @@ Se incluyen los siguientes medios:
No se encontró notas antes del 2013.06.01. No se encontró notas antes del 2013.06.01.
* [La Jornada San Luis](http://lajornadasanluis.com.mx) * [La Jornada San Luis](http://lajornadasanluis.com.mx)
Acceso por día: http://lajornadasanluis.com.mx/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -134,7 +124,7 @@ Se incluyen los siguientes medios: ...@@ -134,7 +124,7 @@ Se incluyen los siguientes medios:
No se encontró notas antes del 2014.10.08. No se encontró notas antes del 2014.10.08.
* [La Jornada Veracruz](http://www.jornadaveracruz.com.mx) * [La Jornada Veracruz](http://www.jornadaveracruz.com.mx)
Acceso por día: http://www.jornadaveracruz.com.mx/Archive.aspx?date=13/09/2017
Uso: Uso:
```bash ```bash
...@@ -144,7 +134,7 @@ Se incluyen los siguientes medios: ...@@ -144,7 +134,7 @@ Se incluyen los siguientes medios:
No se encontró notas antes del 2009.05.11. No se encontró notas antes del 2009.05.11.
* [La Jornada Zacatecas](http://ljz.mx) * [La Jornada Zacatecas](http://ljz.mx)
Acceso por día: http://ljz.mx/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -152,9 +142,9 @@ Se incluyen los siguientes medios: ...@@ -152,9 +142,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=26 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=26
``` ```
No se encontró notas antes del 2013.06.10. No se encontró notas antes del 2013.06.10.
* [La Verdad Yucatán](http://laverdadnoticias.com/) * [La Verdad Yucatán](http://laverdadnoticias.com)
Acceso por día: https://laverdadnoticias.com/2017/9/16/
Uso: Uso:
```bash ```bash
...@@ -162,18 +152,9 @@ Se incluyen los siguientes medios: ...@@ -162,18 +152,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=24 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=24
``` ```
No se encontró notas antes del 2015.12.01. No se encontró notas antes del 2015.12.01.
* [La Jornada](http://www.jornada.unam.mx) * [Lector MX](http://lectormx.com)
Uso:
```bash
cd lajornada
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2016 -a month=12 -a day=24
```
* [Lector MX](http://lectormx.com/)
Acceso por día: http://lectormx.com/2017/9/16/
Uso: Uso:
```bash ```bash
...@@ -181,9 +162,9 @@ Se incluyen los siguientes medios: ...@@ -181,9 +162,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=30 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=30
``` ```
No se encontró notas antes del 2015.10.23. No se encontró notas antes del 2015.10.23.
* [Mi Punto de Vista](http://www.mipuntodevista.com.mx/) * [Mi Punto de Vista](http://www.mipuntodevista.com.mx)
Acceso por día: http://www.mipuntodevista.com.mx/2017/9/16/
Uso: Uso:
```bash ```bash
...@@ -191,9 +172,9 @@ Se incluyen los siguientes medios: ...@@ -191,9 +172,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=28 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=28
``` ```
No se encontró notas antes del 2008.10.04. No se encontró notas antes del 2008.10.04.
* [Notirivas](http://gruporivas.com.mx/notirivas/) * [Notirivas](http://gruporivas.com.mx/notirivas)
Acceso por día: http://gruporivas.com.mx/notirivas/2017/9/16/
Uso: Uso:
```bash ```bash
...@@ -201,9 +182,9 @@ Se incluyen los siguientes medios: ...@@ -201,9 +182,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=30 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=30
``` ```
No se encontró notas antes del 2016.11.22. No se encontró notas antes del 2016.11.22.
* [Notisureste](http://www.notisureste.com/) * [Notisureste](http://www.notisureste.com)
Acceso por día: http://www.notisureste.com/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -211,9 +192,9 @@ Se incluyen los siguientes medios: ...@@ -211,9 +192,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22
``` ```
No se encontró notas antes del 2011.11.28. No se encontró notas antes del 2011.11.28.
* [Proceso](http://www.proceso.com.mx/) * [Proceso](http://www.proceso.com.mx)
Uso: Uso:
```bash ```bash
...@@ -221,9 +202,9 @@ Se incluyen los siguientes medios: ...@@ -221,9 +202,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3
``` ```
No se encontró notas antes de Nov. 1976. No se encontró notas antes de Nov. 1976.
* [Punto Medio](https://www.puntomedio.mx/) * [Punto Medio](https://www.puntomedio.mx)
Acceso por día: https://www.puntomedio.mx/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -231,9 +212,9 @@ Se incluyen los siguientes medios: ...@@ -231,9 +212,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=27 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=27
``` ```
No se encontró notas antes del 2015.08.02. No se encontró notas antes del 2015.08.02.
* [Sona 89.3](http://sona893.fm/) * [Sona 89.3](http://sona893.fm)
Acceso por día: http://sona893.fm/2017/09/13/
Uso: Uso:
```bash ```bash
...@@ -241,9 +222,9 @@ Se incluyen los siguientes medios: ...@@ -241,9 +222,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=24 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=24
``` ```
No se encontró notas antes del 2012.04.09. No se encontró notas antes del 2012.04.09.
* [Yucatán a la Mano](http://www.yucatanalamano.com/) * [Yucatán a la Mano](http://www.yucatanalamano.com)
Acceso por día: http://www.yucatanalamano.com/2017/9/13/
Uso: Uso:
```bash ```bash
...@@ -251,7 +232,7 @@ Se incluyen los siguientes medios: ...@@ -251,7 +232,7 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=25 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=25
``` ```
No se encontró notas antes del 2015.05.11. No se encontró notas antes del 2015.05.11.
* [Yucatán al Minuto](http://www.yucatanalminuto.com/) * [Yucatán al Minuto](http://www.yucatanalminuto.com)
Uso: Uso:
...@@ -261,9 +242,9 @@ Se incluyen los siguientes medios: ...@@ -261,9 +242,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=03 -a day=22 scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=03 -a day=22
``` ```
No se encontró notas antes del 2017.01.17. No se encontró notas antes del 2017.01.17.
* [Yucatán en Corto](http://florcastillo.mx/noticias/) * [Yucatán en Corto](http://florcastillo.mx/noticias)
Acceso por día: http://florcastillo.mx/noticias/2017/9/13
Uso: Uso:
```bash ```bash
......
...@@ -23,53 +23,6 @@ class NoticiasItem(scrapy.Item): ...@@ -23,53 +23,6 @@ class NoticiasItem(scrapy.Item):
url = scrapy.Field() url = scrapy.Field()
# class QuotesSpider(scrapy.Spider):
# name = "noticias"
# def start_requests(self):
# year = getattr(self, 'year', None)
# month = getattr(self, 'month', None)
# day = getattr(self, 'day', None)
# self.baseURL='http://yucatan.com.mx/'+year+'/'+month+'/'+day
# urls = [
# self.baseURL,
# ]
# for url in urls:
# yield scrapy.Request(url=url, callback=self.parse)
# def parse(self, response):
# pagination = response.css('div.pagination').css('a::attr(href)').extract()
# if ( len(pagination) > 0 ):
# pagination = pagination[-1]
# pages = int(pagination[pagination.rfind('/')+1:])
# for page in range(0,pages):
# if ( page == 0 ):
# yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True)
# else:
# yield scrapy.Request(url=self.baseURL+'/page/'+str(page+1), callback=self.parse_page)
# else:
# yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True)
# def parse_page(self, response):
# for link in response.css('div.bp-head').css('h2').css('a::attr(href)').extract():
# yield scrapy.Request(url=link, callback=self.parse_item)
# def parse_item(self, response):
# text = ''
# item = NoticiasItem()
# item['title'] = response.css('h1.entry-title::text').extract_first()
# item['date'] = response.css('div.base-box').css('span.entry-date::attr(datetime)').extract_first()
# for paragraph in response.css('div.entry-content').css('p').extract():
# text += remove_tags(paragraph) + '\n'
# item['text'] = text
# item['topic'] = response.xpath('//span[@itemprop="title"]/text()').extract()
# item['url'] = response.url
# # print item['title']
# yield item
class QuotesSpider(scrapy.Spider): class QuotesSpider(scrapy.Spider):
name = "noticias" name = "noticias"
...@@ -84,10 +37,8 @@ class QuotesSpider(scrapy.Spider): ...@@ -84,10 +37,8 @@ class QuotesSpider(scrapy.Spider):
self.date = date(int(year), int(month), int(day)) self.date = date(int(year), int(month), int(day))
self.parsing_month = { 'enero': 1, 'febrero': 2, 'marzo': 3, 'abril': 4, 'mayo': 5, 'junio': 6, 'julio': 7, self.parsing_month = { 'enero': 1, 'febrero': 2, 'marzo': 3, 'abril': 4, 'mayo': 5, 'junio': 6, 'julio': 7,
'agosto': 8, 'septiembre': 9, 'octubre': 10, 'noviembre': 11, 'diciembre': 12 } 'agosto': 8, 'septiembre': 9, 'octubre': 10, 'noviembre': 11, 'diciembre': 12 }
self.stop = False
urls = [ self.pages = 100
self.baseURL,
]
for s in section_list: for s in section_list:
yield scrapy.Request(url=self.baseURL+s, callback=self.parse) yield scrapy.Request(url=self.baseURL+s, callback=self.parse)
...@@ -102,28 +53,16 @@ class QuotesSpider(scrapy.Spider): ...@@ -102,28 +53,16 @@ class QuotesSpider(scrapy.Spider):
yield scrapy.Request(url=response.url, callback=self.parse_pagination, dont_filter=True) yield scrapy.Request(url=response.url, callback=self.parse_pagination, dont_filter=True)
def errback_http(self, failure):
if failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
self.stop = True
def parse_pagination(self, response): def parse_pagination(self, response):
pagination = response.xpath('//*[@class="pagination"]/a/@href').extract() pagination = response.xpath('//*[@class="pagination"]/a/@href').extract()
if ( len(pagination) > 0 ): if ( len(pagination) > 0 ):
pagination = pagination[-1]
pages = int(pagination[pagination.rfind('/')+1:])
p = 1 p = 1
while p <= pages: while p <= self.pages:
if ( self.stop ): if ( p == 1 ):
p = pages+1 yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
else: elif ( p > 1 ):
if ( p == 1 ): yield scrapy.Request(url=response.url+'/page/'+str(p+1), callback=self.parse_link)
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
elif ( p > 1 ):
yield scrapy.Request(url=response.url+'/page/'+str(p+1), callback=self.parse_link)
p += 1 p += 1
...@@ -142,16 +81,17 @@ class QuotesSpider(scrapy.Spider): ...@@ -142,16 +81,17 @@ class QuotesSpider(scrapy.Spider):
if news_date == self.date and link is not None: if news_date == self.date and link is not None:
yield scrapy.Request(url=link, callback=self.parse_item) yield scrapy.Request(url=link, callback=self.parse_item)
elif news_date < self.date:
self.stop = True
def parse_item(self, response): def parse_item(self, response):
text = '' text = ''
item = NoticiasItem() item = NoticiasItem()
item['title'] = response.css('h1.entry-title::text').extract_first() item['title'] = response.css('h1.entry-title::text').extract_first()
item['date'] = response.css('div.base-box').css('span.entry-date::attr(datetime)').extract_first() item['date'] = response.css('div.base-box').css('span.entry-date::attr(datetime)').extract_first()
if item['date'] is None:
item['date'] = response.xpath('//meta[@itemprop="datePublished"]/@content').extract_first()
if item['date'] is None:
item['date'] = response.xpath('//time[@class="updated"]/@datetime').extract_first()
for paragraph in response.css('div.entry-content').css('p').extract(): for paragraph in response.css('div.entry-content').css('p').extract():
text += remove_tags(paragraph) + '\n' text += remove_tags(paragraph) + '\n'
item['text'] = text item['text'] = text
......
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
import scrapy import scrapy
class LajornadagroItem(scrapy.Item): class Diarioyucatan2Item(scrapy.Item):
# define the fields for your item here like: # define the fields for your item here like:
# name = scrapy.Field() # name = scrapy.Field()
pass pass
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
from scrapy import signals from scrapy import signals
class LajornadagroSpiderMiddleware(object): class Diarioyucatan2SpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined, # Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the # scrapy acts as if the spider middleware does not modify the
# passed objects. # passed objects.
......
...@@ -6,6 +6,6 @@ ...@@ -6,6 +6,6 @@
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class LajornadagroPipeline(object): class Diarioyucatan2Pipeline(object):
def process_item(self, item, spider): def process_item(self, item, spider):
return item return item
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Scrapy settings for laJornadaGro project # Scrapy settings for diarioYucatan2 project
# #
# For simplicity, this file contains only settings considered important or # For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation: # commonly used. You can find more settings consulting the documentation:
...@@ -9,17 +9,17 @@ ...@@ -9,17 +9,17 @@
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'laJornadaGro' BOT_NAME = 'diarioYucatan2'
SPIDER_MODULES = ['laJornadaGro.spiders'] SPIDER_MODULES = ['diarioYucatan2.spiders']
NEWSPIDER_MODULE = 'laJornadaGro.spiders' NEWSPIDER_MODULE = 'diarioYucatan2.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'laJornadaGro (+http://www.yourdomain.com)' #USER_AGENT = 'diarioYucatan2 (+http://www.yourdomain.com)'
# Obey robots.txt rules # Obey robots.txt rules
ROBOTSTXT_OBEY = True # ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16) # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #CONCURRENT_REQUESTS = 32
...@@ -33,7 +33,7 @@ ROBOTSTXT_OBEY = True ...@@ -33,7 +33,7 @@ ROBOTSTXT_OBEY = True
#CONCURRENT_REQUESTS_PER_IP = 16 #CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default) # Disable cookies (enabled by default)
#COOKIES_ENABLED = False COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default) # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False #TELNETCONSOLE_ENABLED = False
...@@ -47,13 +47,13 @@ ROBOTSTXT_OBEY = True ...@@ -47,13 +47,13 @@ ROBOTSTXT_OBEY = True
# Enable or disable spider middlewares # Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = { #SPIDER_MIDDLEWARES = {
# 'laJornadaGro.middlewares.LajornadagroSpiderMiddleware': 543, # 'diarioYucatan2.middlewares.Diarioyucatan2SpiderMiddleware': 543,
#} #}
# Enable or disable downloader middlewares # Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = { #DOWNLOADER_MIDDLEWARES = {
# 'laJornadaGro.middlewares.MyCustomDownloaderMiddleware': 543, # 'diarioYucatan2.middlewares.MyCustomDownloaderMiddleware': 543,
#} #}
# Enable or disable extensions # Enable or disable extensions
...@@ -65,7 +65,7 @@ ROBOTSTXT_OBEY = True ...@@ -65,7 +65,7 @@ ROBOTSTXT_OBEY = True
# Configure item pipelines # Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = { #ITEM_PIPELINES = {
# 'laJornadaGro.pipelines.LajornadagroPipeline': 300, # 'diarioYucatan2.pipelines.Diarioyucatan2Pipeline': 300,
#} #}
# Enable and configure the AutoThrottle extension (disabled by default) # Enable and configure the AutoThrottle extension (disabled by default)
......
import scrapy
## scrapy crawl noticias -t json --nolog -o noticias.json
import re
# from datetime import datetime, date, timedelta
# from scrapy.spidermiddlewares.httperror import HttpError
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
date = scrapy.Field()
location = scrapy.Field()
author = scrapy.Field()
topic = scrapy.Field()
url = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "noticias"
def start_requests(self):
section_list = ['merida', 'yucatan', 'mexico', 'internacional', 'deportes',
'espectaculos', 'imagen', 'economia', 'tecnologia', 'salud']
# year = getattr(self, 'year', None)
# month = getattr(self, 'month', None)
# day = getattr(self, 'day', None)
self.baseURL='http://yucatan.com.mx/seccion/'
# self.date = date(int(year), int(month), int(day))
self.parsing_month = { 'enero': 1, 'febrero': 2, 'marzo': 3, 'abril': 4, 'mayo': 5, 'junio': 6, 'julio': 7,
'agosto': 8, 'septiembre': 9, 'octubre': 10, 'noviembre': 11, 'diciembre': 12 }
# self.pages = 100
for s in section_list:
yield scrapy.Request(url=self.baseURL+s, callback=self.parse)
def parse(self, response):
if ( len(response.xpath('//a[@class="show-more-link"]/@href').extract()) > 0 ):
for link in response.xpath('//a[@class="show-more-link"]/@href').extract():
yield scrapy.Request(url=link, callback=self.parse_pagination)
elif ( len(response.xpath('//a[@class="show-more-link"]/@href').extract()) == 0 ):
yield scrapy.Request(url=response.url, callback=self.parse_pagination, dont_filter=True)
def parse_pagination(self, response):
pagination = response.xpath('//*[@class="pagination"]/a/@href').extract()
if ( len(pagination) > 0 ):
pagination = pagination[-1]
pages = int(pagination[pagination.rfind('/')+1:])
p = 1
while p <= pages:
if ( p == 1 ):
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
elif ( p > 1 ):
yield scrapy.Request(url=response.url+'/page/'+str(p+1), callback=self.parse_link)
p += 1
else:
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
def parse_link(self, response):
for entry in response.xpath('//*[@class="bp-entry"]'):
# entry_date = entry.xpath('./*[@class="bp-head"]/div/span/text()').extract_first()
# entry_date = entry_date[entry_date.rfind(',')+2:][:entry_date[entry_date.rfind(',')+2:].rfind('-')-2]
# news_date = date(int(entry_date[-4:]), self.parsing_month[entry_date[:-8][entry_date[:-8].rfind(' ')+1:]], int(entry_date[:entry_date.find(' ')]))
link = entry.xpath('./*[@class="bp-head"]/h2/a/@href').extract_first()
if link is not None:
yield scrapy.Request(url=link, callback=self.parse_item)
def parse_item(self, response):
text = ''
item = NoticiasItem()
item['title'] = response.css('h1.entry-title::text').extract_first()
item['date'] = response.css('div.base-box').css('span.entry-date::attr(datetime)').extract_first()
if item['date'] is None:
item['date'] = response.xpath('//meta[@itemprop="datePublished"]/@content').extract_first()
if item['date'] is None:
item['date'] = response.xpath('//time[@class="updated"]/@datetime').extract_first()
for paragraph in response.css('div.entry-content').css('p').extract():
text += remove_tags(paragraph) + '\n'
item['text'] = text
item['topic'] = [response.xpath('//*[@class="breadcrumbs-plus"]/span/a/span/text()').extract()[1]]
item['url'] = response.url
# print item['title']
yield item
...@@ -4,8 +4,8 @@ ...@@ -4,8 +4,8 @@
# https://scrapyd.readthedocs.org/en/latest/deploy.html # https://scrapyd.readthedocs.org/en/latest/deploy.html
[settings] [settings]
default = laJornadaGro.settings default = diarioYucatan2.settings
[deploy] [deploy]
#url = http://localhost:6800/ #url = http://localhost:6800/
project = laJornadaGro project = diarioYucatan2
import scrapy
from datetime import datetime, date, timedelta
"""
USO:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=2 -a day=21
"""
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
date = scrapy.Field()
location = scrapy.Field()
author = scrapy.Field()
topic = scrapy.Field()
url = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "noticias"
def start_requests(self):
section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo',
'cultura', 'espectaculos', 'deportes']
year = getattr(self, 'year', None)
month = getattr(self, 'month', None)
day = getattr(self, 'day', None)
self.baseURL = 'http://jornadabc.mx/'
self.date = date(int(year), int(month), int(day))
self.pages = 100
for s in section_list:
yield scrapy.Request(url=self.baseURL+'seccion/'+s, callback=self.parse_pagination)
def parse_pagination(self, response):
pagination = response.xpath('//li[@class="pager-last odd last"]/a/@href').extract()
if len(pagination) > 0:
p = 1
while p <= self.pages:
if p == 1:
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
elif p > 1:
yield scrapy.Request(url=response.url+'?page='+str(p+1), callback=self.parse_link)
p += 1
else:
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
def parse_link(self, response):
section = response.url[response.url.rfind('/')+1:]
if section == 'espectaculos' or section == 'deportes':
path = '//*[@class="region region-soft-first"]'
else:
path = '//*[@class="region region-hard-first"]'
link_list = response.xpath(path).css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
if len(link_list) > 0:
for link in link_list:
news_date = datetime.strptime(link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:], '%d-%m-%Y').date()
if news_date == self.date:
yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item)
def parse_item(self, response):
item = NoticiasItem()
text = ''
item['date'] = response.xpath('//meta[@property="article:published_time"]/@content').extract_first()
title = response.xpath('//*[@class="block-content"]/h1/a/text()').extract()
if len(title) > 0:
item['title'] = title[0]
else:
item['title'] = response.xpath('//*[@class="block-content"]/h1/text()').extract_first()
item['topic'] = response.xpath('//span[@class="section"]/text()').extract_first()
for paragraph in response.xpath('//*[@class="field-item even"]/p/text()').extract():
text += paragraph
item['text'] = text
item['url'] = response.url
# print item['title']
yield item
import scrapy import scrapy, re
""" """
USO: USO:
...@@ -6,57 +6,62 @@ USO: ...@@ -6,57 +6,62 @@ USO:
scrapy crawl noticias -t json --nolog -o noticias.json scrapy crawl noticias -t json --nolog -o noticias.json
""" """
import re
TAG_RE = re.compile(r'<[^>]+>') TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text): def remove_tags(text):
return TAG_RE.sub('', text) return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item): class NoticiasItem(scrapy.Item):
title = scrapy.Field() title = scrapy.Field()
text = scrapy.Field() text = scrapy.Field()
date = scrapy.Field() date = scrapy.Field()
location = scrapy.Field() location = scrapy.Field()
author = scrapy.Field() author = scrapy.Field()
topic = scrapy.Field() topic = scrapy.Field()
url = scrapy.Field() url = scrapy.Field()
class QuotesSpider(scrapy.Spider): class QuotesSpider(scrapy.Spider):
name = 'noticias' name = 'noticias'
def start_requests(self): def start_requests(self):
self.section = ''
self.baseURL = 'http://jornadabc.mx' self.baseURL = 'http://jornadabc.mx'
section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo', section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo',
'cultura', 'espectaculos', 'deportes'] 'cultura', 'espectaculos', 'deportes']
for section in section_list: for section in section_list:
self.section = section
yield scrapy.Request(url=self.baseURL+'/seccion/'+section, callback=self.parse) yield scrapy.Request(url=self.baseURL+'/seccion/'+section, callback=self.parse)
def parse(self, response): def parse(self, response):
pagination = response.xpath('//ul[@class="pager"]/li[@class="pager-last odd last"]/a/@href').extract() pagination = response.xpath('//ul[@class="pager"]/li[@class="pager-last odd last"]/a/@href').extract()
if ( len(pagination) > 0 ): if ( len(pagination) > 0 ):
section = response.url[response.url.rfind('/')+1:]
pagination = pagination[0] pagination = pagination[0]
pages = int(pagination[pagination.rfind('=')+1:]) pages = int(pagination[pagination.rfind('=')+1:])
for page in range(0, pages): for page in range(0, pages):
if ( page == 0 ): if ( page == 0 ):
yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True) yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True)
else: else:
yield scrapy.Request(url=self.baseURL+'/seccion/'+self.section+'?page='+str(page), callback=self.parse_page, dont_filter=True) yield scrapy.Request(url=self.baseURL+'/seccion/'+section+'?page='+str(page), callback=self.parse_page, dont_filter=True)
else: else:
yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True) yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True)
def parse_page(self, response): def parse_page(self, response):
if ( self.section == 'espectaculos' or self.section == 'deportes' ): section = response.url[response.url.rfind('/')+1:]
path = '//*[@class="region region-soft-first"]' section = section[:section.rfind('?')]
if ( section == 'espectaculos' or section == 'deportes' ):
link_list = response.xpath('//*[@class="region region-soft-first"]').css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
else: else:
path = '//*[@class="region region-hard-first"]' link_list = response.xpath('//*[@class="region region-hard-first"]').css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
if len(link_list) <= 0:
link_list = response.xpath('//*[@class="view-content"]').css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
for link in response.xpath(path).css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract(): for link in link_list:
yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item) yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item)
......
...@@ -9,21 +9,22 @@ import re ...@@ -9,21 +9,22 @@ import re
TAG_RE = re.compile(r'<[^>]+>') TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text): def remove_tags(text):
return TAG_RE.sub('', text) return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item): class NoticiasItem(scrapy.Item):
title = scrapy.Field() title = scrapy.Field()
text = scrapy.Field() text = scrapy.Field()
date = scrapy.Field() date = scrapy.Field()
location = scrapy.Field() location = scrapy.Field()
author = scrapy.Field() author = scrapy.Field()
topic = scrapy.Field() topic = scrapy.Field()
url = scrapy.Field() url = scrapy.Field()
class QuotesSpider(scrapy.Spider): class QuotesSpider(scrapy.Spider):
name = "noticias" name = "noticias"
def start_requests(self): def start_requests(self):
# self.found = False # self.found = False
# self.flag = False # self.flag = False
...@@ -68,6 +69,7 @@ class QuotesSpider(scrapy.Spider): ...@@ -68,6 +69,7 @@ class QuotesSpider(scrapy.Spider):
else: else:
yield scrapy.Request(url=self.baseURL+'/notas?opinion', callback=self.parse_page) yield scrapy.Request(url=self.baseURL+'/notas?opinion', callback=self.parse_page)
def parse_2(self, response): # para las primeras noticias def parse_2(self, response): # para las primeras noticias
path_list = ['//h1[@class="title"]/a/@href', '//h2[@class="title"]/a/@href'] path_list = ['//h1[@class="title"]/a/@href', '//h2[@class="title"]/a/@href']
link_list = [] link_list = []
...@@ -106,6 +108,7 @@ class QuotesSpider(scrapy.Spider): ...@@ -106,6 +108,7 @@ class QuotesSpider(scrapy.Spider):
self.flag = True self.flag = True
break break
def parse_item_2(self, response): # para las primeras noticias def parse_item_2(self, response): # para las primeras noticias
item = response.meta['item'] item = response.meta['item']
# item = NoticiasItem() # item = NoticiasItem()
......
...@@ -8,21 +8,22 @@ import re ...@@ -8,21 +8,22 @@ import re
TAG_RE = re.compile(r'<[^>]+>') TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text): def remove_tags(text):
return TAG_RE.sub('', text) return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item): class NoticiasItem(scrapy.Item):
title = scrapy.Field() title = scrapy.Field()
text = scrapy.Field() text = scrapy.Field()
date = scrapy.Field() date = scrapy.Field()
location = scrapy.Field() location = scrapy.Field()
author = scrapy.Field() author = scrapy.Field()
topic = scrapy.Field() topic = scrapy.Field()
url = scrapy.Field() url = scrapy.Field()
class QuotesSpider(scrapy.Spider): class QuotesSpider(scrapy.Spider):
name = "noticias" name = "noticias"
def start_requests(self): def start_requests(self):
self.baseURL = 'https://www.lajornadamaya.mx' self.baseURL = 'https://www.lajornadamaya.mx'
# section_list = [('yucatan',123,'Yucatan'), ('quintana-roo',52,'Quintana Roo'), # section_list = [('yucatan',123,'Yucatan'), ('quintana-roo',52,'Quintana Roo'),
......
import scrapy import scrapy
from datetime import date
#scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22 #scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22
import re import re
...@@ -26,6 +26,7 @@ class QuotesSpider(scrapy.Spider): ...@@ -26,6 +26,7 @@ class QuotesSpider(scrapy.Spider):
year = getattr(self, 'year', None) year = getattr(self, 'year', None)
month = getattr(self, 'month', None) month = getattr(self, 'month', None)
day = getattr(self, 'day', None) day = getattr(self, 'day', None)
self.date = date(int(year), int(month), int(day))
self.baseURL='http://diariodelyaqui.mx/'+year+'/'+month+'/'+day self.baseURL='http://diariodelyaqui.mx/'+year+'/'+month+'/'+day
urls = [ urls = [
self.baseURL, self.baseURL,
...@@ -56,7 +57,7 @@ class QuotesSpider(scrapy.Spider): ...@@ -56,7 +57,7 @@ class QuotesSpider(scrapy.Spider):
def parse_item(self, response): def parse_item(self, response):
item = NoticiasItem() item = NoticiasItem()
text = '' text = ''
item['date'] = response.xpath('//meta[@property="article:published_time"]/@content').extract_first() item['date'] = self.date
item['title'] = response.xpath('//h1[@class="entry-title"]/text()').extract_first() item['title'] = response.xpath('//h1[@class="entry-title"]/text()').extract_first()
item['topic'] = response.xpath('//ul[@class="post-categories"]/li/a/text()').extract() item['topic'] = response.xpath('//ul[@class="post-categories"]/li/a/text()').extract()
for paragraph in response.xpath('//div[@class="clearfix"]/p').extract(): for paragraph in response.xpath('//div[@class="clearfix"]/p').extract():
......
import scrapy
from datetime import datetime, date, timedelta
"""
USO:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=2 -a day=21
"""
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
date = scrapy.Field()
location = scrapy.Field()
author = scrapy.Field()
topic = scrapy.Field()
url = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = 'noticias'
def start_requests(self):
self.section = ''
self.page = 0
self.count = 0
self.found = False
self.not_found = False
self.next_section = False
self.year = getattr(self, 'year', None)
self.month = getattr(self, 'month', None)
self.day = getattr(self, 'day', None)
self.date = date(int(self.year), int(self.month), int(self.day))
self.data_list = []
self.baseURL = 'http://jornadabc.mx'
# section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo', 'cultura']
section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo', 'cultura', 'espectaculos', 'deportes']
for section in section_list:
self.section = section
# print self.section
self.page = 0
self.count = 0
self.found = False
self.not_found = False
self.next_section = False
self.data_list = []
page = -1
while True:
if ( self.found ):
self.found = False
break
if ( self.not_found ):
self.not_found = False
self.next_section = True
break
page += 1
# print 'page '+str(page)
yield scrapy.Request(url=self.baseURL+'/seccion/'+section+'?page='+str(page), callback=self.parse, dont_filter=True)
# if not ( self.next_section ):
# self.next_section = False
# self.page -= self.count
# if ( self.page > 0 ):
# self.page -= 1
# for pag in range(self.page, self.page+25):
# yield scrapy.Request(url=self.baseURL+'/seccion/'+section+'?page='+str(pag), callback=self.parse_page, dont_filter=True)
if ( len(self.data_list) > 0 ):
for link in self.data_list:
# link_date = datetime.strptime(link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:], '%d-%m-%Y').date()
# if ( link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:] == self.day.zfill(2)+'-'+self.month.zfill(2)+'-'+self.year ):
yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item, dont_filter=True)
def parse(self, response):
# print response.url
count = 0
this_page = int(response.url[response.url.rfind('=')+1:])
if ( self.section == 'espectaculos' or self.section == 'deportes' ):
path = '//*[@class="region region-soft-first"]'
else:
path = '//*[@class="region region-hard-first"]'
link_list = response.xpath(path).css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
if ( len(link_list) > 0 ):
if ( this_page > 0 ):
del link_list[0]
# print link_list
for link in link_list:
link_date = datetime.strptime(link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:], '%d-%m-%Y').date()
# print self.section+' '+link_date.isoformat()+' ? '+self.date.isoformat()
if ( link_date.month == self.date.month and link_date.year == self.date.year ):
# self.page = int(response.url[response.url.rfind('=')+1:])
# self.count += 1
# self.found = True
# print '************founddddd********'
# break
self.data_list.append(link)
if ( this_page >= 300 ):
self.found = True
break
if ( this_page > 300 and len(link_list) == 0 ):
self.not_found = True
# if ( link_date < stop_date ):
# # count += 1
# # print count
# # if ( count > 3 ):
# self.not_found = True
# break
def parse_page(self, response):
if ( self.section == 'espectaculos' or self.section == 'deportes' ):
path = '//*[@class="region region-soft-first"]'
else:
path = '//*[@class="region region-hard-first"]'
link_list = response.xpath(path).css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
if ( len(link_list) > 0 ):
for link in link_list:
link_date = datetime.strptime(link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:], '%d-%m-%Y').date()
if ( link_date == self.date ):
yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item)
def parse_item(self, response):
item = NoticiasItem()
text = ''
item['date'] = response.xpath('//meta[@property="article:published_time"]/@content').extract_first()
title = response.xpath('//*[@class="block-content"]/h1/a/text()').extract()
if ( len(title) > 0 ):
item['title'] = title[0]
else:
item['title'] = response.xpath('//*[@class="block-content"]/h1/text()').extract_first()
item['topic'] = response.xpath('//span[@class="section"]/text()').extract_first()
for paragraph in response.xpath('//*[@class="field-item even"]/p/text()').extract():
text += paragraph
item['text'] = text
item['url'] = response.url
# print item['title']
yield item
import scrapy
# scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
date = scrapy.Field()
location = scrapy.Field()
author = scrapy.Field()
topic = scrapy.Field()
url = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "noticias"
def start_requests(self):
self.link_list = []
self.year = getattr(self, 'year', None)
self.month = getattr(self, 'month', None)
self.day = getattr(self, 'day', None)
self.baseURL='http://www.lajornadaguerrero.com.mx/'+self.year+'/'+self.month.zfill(2)+'/'+self.day.zfill(2)
urls = [
self.baseURL,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
paths = ['//div[@class="viewlet"]/h1/a/@href', '//div[@class="viewlet"]/h2/a/@href',
'//div[@class="viewlet"]/h3/a/@href', '//div[@class="viewlet image"]/h1/a/@href',
'//div[@class="viewlet image"]/h2/a/@href', '//div[@class="viewlet image"]/h3/a/@href',
'//div[@class="text_block_200"]/p/a/@href']
for path in paths:
links = response.xpath(path).extract()
if ( len(links) > 0 ):
for link in links:
if not ( link in self.link_list ):
self.link_list.append(link)
yield scrapy.Request(url=self.baseURL+'/'+link, callback=self.parse_item)
for link in response.xpath('//*[@class="text_block_200"]/h2/a/@href').extract():
yield scrapy.Request(url=self.baseURL+'/'+link, callback=self.parse_page)
def parse_page(self, response):
paths = ['//*[@id="article_list"]/h2/a/@href',
'//*[@id="article_list"]/h3/a/@href']
for path in paths:
for link in response.xpath(path).extract():
if not ( link in self.link_list ):
yield scrapy.Request(url=self.baseURL+'/'+link, callback=self.parse_item)
def parse_item(self, response):
item = NoticiasItem()
text = ''
item['date'] = self.year+'-'+self.month.zfill(2)+'-'+self.day.zfill(2)
item['title'] = response.xpath('//*[@class="documentContent"]/h1/text()').extract_first()
item['topic'] = response.xpath('//*[@id="portal-breadcrumbs"]/a[2]/text()').extract()
for paragraph in response.xpath('//*[@class="documentContent"]/p/text()').extract():
text += paragraph
item['text'] = text
item['url'] = response.url
# print item['title']
yield item
...@@ -106,7 +106,9 @@ class QuotesSpider(scrapy.Spider): ...@@ -106,7 +106,9 @@ class QuotesSpider(scrapy.Spider):
def parse_item(self, response): def parse_item(self, response):
item = NoticiasItem() item = NoticiasItem()
text = '' text = ''
item['date'] = response.xpath('//meta[@property="DC.date.issued"]/@content').extract_first() item['date'] = response.xpath('//meta[@property="article:published_time"]/@content').extract_first()
if item['date'] is None or item['date'] == '':
item['date'] = response.xpath('//meta[@property="DC.date.issued"]/@content').extract_first()
item['title'] = response.xpath('//*[@class="page_title_inner"]/h1/text()').extract_first() item['title'] = response.xpath('//*[@class="page_title_inner"]/h1/text()').extract_first()
item['topic'] = response.xpath('//*[@class="post_info_cat"]/a/text()').extract_first() item['topic'] = response.xpath('//*[@class="post_info_cat"]/a/text()').extract_first()
for paragraph in response.xpath('//*[@class="post_content_wrapper"]/p').extract(): for paragraph in response.xpath('//*[@class="post_content_wrapper"]/p').extract():
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment