Commit a46d9212 authored by Renán Sosa Guillen's avatar Renán Sosa Guillen

crawlers

parent 962bbeec
Crawlers para medios de información en linea nacionales basados en [Scrapy](http://scrapy.org/)
Se incluyen los siguientes medios:
* [Al Chile](http://alchile.com.mx/)
* [Al Chile](http://alchile.com.mx)
Acceso por día: http://alchile.com.mx/2017/9/13/
Uso:
```bash
......@@ -11,9 +11,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22
```
No se encontró notas antes del 2014.06.01.
* [Desde el Balcón](http://www.desdeelbalcon.com/)
* [Desde el Balcón](http://www.desdeelbalcon.com)
Acceso por día: http://www.desdeelbalcon.com/2017/9/13/
Uso:
```bash
......@@ -23,7 +23,7 @@ Se incluyen los siguientes medios:
No se encontró notas antes del 2014.01.11.
* [Diario del Yaqui](http://diariodelyaqui.mx)
Acceso por día: http://diariodelyaqui.mx/2017/9/13/
Uso:
```bash
......@@ -31,7 +31,7 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2016 -a month=12 -a day=24
```
No se encontró notas antes del 2016.12.30.
* [Diario de Yucatán](http://yucatan.com.mx/)
* [Diario de Yucatán](http://yucatan.com.mx)
Uso:
......@@ -41,9 +41,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2016 -a month=12 -a day=24
```
No se encontró notas antes del 2012.04.02.
* [El Grillo Porteño](http://grilloporteno.com/)
* [El Grillo Porteño](http://grilloporteno.com)
Acceso por día: http://grilloporteno.com/2017/9/13/
Uso:
```bash
......@@ -51,9 +51,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=26
```
No se encontró notas antes del 2013.11.04.
* [La Jornada (Nueva versión)](http://www.jornada.unam.mx)
* [La Jornada](http://www.jornada.unam.mx)
Nueva versión del Crawler de La Jornada
Acceso por día: http://www.jornada.unam.mx/2017/09/13/
Uso:
```bash
......@@ -61,9 +61,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=4 -a day=23
```
No se encontró notas antes del 2005.02.01.
* [La Jornada Aguascalientes](http://www.lja.mx/)
* [La Jornada Aguascalientes](http://www.lja.mx)
Acceso por día: http://www.lja.mx/2017/9/13/
Uso:
```bash
......@@ -82,16 +82,6 @@ Se incluyen los siguientes medios:
```
No se encontró notas antes del 2015.02.09.
Esta versión descarga por fecha. Se recomienda usar principalmente para fechas recientes.
* [La Jornada Baja California 2](http://jornadabc.mx)
Uso:
```bash
cd laJornadaBC2
scrapy crawl noticias -t json --nolog -o noticias.json
```
Esta versión descarga todas las noticias de cada las categorías.
* [La Jornada Guerrero](http://www.lajornadaguerrero.com.mx)
......@@ -112,9 +102,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=4 -a day=25
```
No se encontró notas antes del 2015.03.12.
* [La Jornada de Oriente](http://www.lajornadadeoriente.com.mx/)
* [La Jornada de Oriente](http://www.lajornadadeoriente.com.mx)
Acceso por día: http://www.lajornadadeoriente.com.mx/2017/9/13/
Uso:
```bash
......@@ -124,7 +114,7 @@ Se incluyen los siguientes medios:
No se encontró notas antes del 2013.06.01.
* [La Jornada San Luis](http://lajornadasanluis.com.mx)
Acceso por día: http://lajornadasanluis.com.mx/2017/9/13/
Uso:
```bash
......@@ -134,7 +124,7 @@ Se incluyen los siguientes medios:
No se encontró notas antes del 2014.10.08.
* [La Jornada Veracruz](http://www.jornadaveracruz.com.mx)
Acceso por día: http://www.jornadaveracruz.com.mx/Archive.aspx?date=13/09/2017
Uso:
```bash
......@@ -144,7 +134,7 @@ Se incluyen los siguientes medios:
No se encontró notas antes del 2009.05.11.
* [La Jornada Zacatecas](http://ljz.mx)
Acceso por día: http://ljz.mx/2017/9/13/
Uso:
```bash
......@@ -152,9 +142,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=26
```
No se encontró notas antes del 2013.06.10.
* [La Verdad Yucatán](http://laverdadnoticias.com/)
* [La Verdad Yucatán](http://laverdadnoticias.com)
Acceso por día: https://laverdadnoticias.com/2017/9/16/
Uso:
```bash
......@@ -162,18 +152,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=24
```
No se encontró notas antes del 2015.12.01.
* [La Jornada](http://www.jornada.unam.mx)
Uso:
```bash
cd lajornada
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2016 -a month=12 -a day=24
```
* [Lector MX](http://lectormx.com/)
* [Lector MX](http://lectormx.com)
Acceso por día: http://lectormx.com/2017/9/16/
Uso:
```bash
......@@ -181,9 +162,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=30
```
No se encontró notas antes del 2015.10.23.
* [Mi Punto de Vista](http://www.mipuntodevista.com.mx/)
* [Mi Punto de Vista](http://www.mipuntodevista.com.mx)
Acceso por día: http://www.mipuntodevista.com.mx/2017/9/16/
Uso:
```bash
......@@ -191,9 +172,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=28
```
No se encontró notas antes del 2008.10.04.
* [Notirivas](http://gruporivas.com.mx/notirivas/)
* [Notirivas](http://gruporivas.com.mx/notirivas)
Acceso por día: http://gruporivas.com.mx/notirivas/2017/9/16/
Uso:
```bash
......@@ -201,9 +182,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=30
```
No se encontró notas antes del 2016.11.22.
* [Notisureste](http://www.notisureste.com/)
* [Notisureste](http://www.notisureste.com)
Acceso por día: http://www.notisureste.com/2017/9/13/
Uso:
```bash
......@@ -211,7 +192,7 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22
```
No se encontró notas antes del 2011.11.28.
* [Proceso](http://www.proceso.com.mx/)
* [Proceso](http://www.proceso.com.mx)
Uso:
......@@ -221,9 +202,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3
```
No se encontró notas antes de Nov. 1976.
* [Punto Medio](https://www.puntomedio.mx/)
* [Punto Medio](https://www.puntomedio.mx)
Acceso por día: https://www.puntomedio.mx/2017/9/13/
Uso:
```bash
......@@ -231,9 +212,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=27
```
No se encontró notas antes del 2015.08.02.
* [Sona 89.3](http://sona893.fm/)
* [Sona 89.3](http://sona893.fm)
Acceso por día: http://sona893.fm/2017/09/13/
Uso:
```bash
......@@ -241,9 +222,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=24
```
No se encontró notas antes del 2012.04.09.
* [Yucatán a la Mano](http://www.yucatanalamano.com/)
* [Yucatán a la Mano](http://www.yucatanalamano.com)
Acceso por día: http://www.yucatanalamano.com/2017/9/13/
Uso:
```bash
......@@ -251,7 +232,7 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=25
```
No se encontró notas antes del 2015.05.11.
* [Yucatán al Minuto](http://www.yucatanalminuto.com/)
* [Yucatán al Minuto](http://www.yucatanalminuto.com)
Uso:
......@@ -261,9 +242,9 @@ Se incluyen los siguientes medios:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=03 -a day=22
```
No se encontró notas antes del 2017.01.17.
* [Yucatán en Corto](http://florcastillo.mx/noticias/)
* [Yucatán en Corto](http://florcastillo.mx/noticias)
Acceso por día: http://florcastillo.mx/noticias/2017/9/13
Uso:
```bash
......
......@@ -23,53 +23,6 @@ class NoticiasItem(scrapy.Item):
url = scrapy.Field()
# class QuotesSpider(scrapy.Spider):
# name = "noticias"
# def start_requests(self):
# year = getattr(self, 'year', None)
# month = getattr(self, 'month', None)
# day = getattr(self, 'day', None)
# self.baseURL='http://yucatan.com.mx/'+year+'/'+month+'/'+day
# urls = [
# self.baseURL,
# ]
# for url in urls:
# yield scrapy.Request(url=url, callback=self.parse)
# def parse(self, response):
# pagination = response.css('div.pagination').css('a::attr(href)').extract()
# if ( len(pagination) > 0 ):
# pagination = pagination[-1]
# pages = int(pagination[pagination.rfind('/')+1:])
# for page in range(0,pages):
# if ( page == 0 ):
# yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True)
# else:
# yield scrapy.Request(url=self.baseURL+'/page/'+str(page+1), callback=self.parse_page)
# else:
# yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True)
# def parse_page(self, response):
# for link in response.css('div.bp-head').css('h2').css('a::attr(href)').extract():
# yield scrapy.Request(url=link, callback=self.parse_item)
# def parse_item(self, response):
# text = ''
# item = NoticiasItem()
# item['title'] = response.css('h1.entry-title::text').extract_first()
# item['date'] = response.css('div.base-box').css('span.entry-date::attr(datetime)').extract_first()
# for paragraph in response.css('div.entry-content').css('p').extract():
# text += remove_tags(paragraph) + '\n'
# item['text'] = text
# item['topic'] = response.xpath('//span[@itemprop="title"]/text()').extract()
# item['url'] = response.url
# # print item['title']
# yield item
class QuotesSpider(scrapy.Spider):
name = "noticias"
......@@ -84,10 +37,8 @@ class QuotesSpider(scrapy.Spider):
self.date = date(int(year), int(month), int(day))
self.parsing_month = { 'enero': 1, 'febrero': 2, 'marzo': 3, 'abril': 4, 'mayo': 5, 'junio': 6, 'julio': 7,
'agosto': 8, 'septiembre': 9, 'octubre': 10, 'noviembre': 11, 'diciembre': 12 }
self.stop = False
urls = [
self.baseURL,
]
self.pages = 100
for s in section_list:
yield scrapy.Request(url=self.baseURL+s, callback=self.parse)
......@@ -102,24 +53,12 @@ class QuotesSpider(scrapy.Spider):
yield scrapy.Request(url=response.url, callback=self.parse_pagination, dont_filter=True)
def errback_http(self, failure):
if failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
self.stop = True
def parse_pagination(self, response):
pagination = response.xpath('//*[@class="pagination"]/a/@href').extract()
if ( len(pagination) > 0 ):
pagination = pagination[-1]
pages = int(pagination[pagination.rfind('/')+1:])
p = 1
while p <= pages:
if ( self.stop ):
p = pages+1
else:
while p <= self.pages:
if ( p == 1 ):
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
elif ( p > 1 ):
......@@ -142,16 +81,17 @@ class QuotesSpider(scrapy.Spider):
if news_date == self.date and link is not None:
yield scrapy.Request(url=link, callback=self.parse_item)
elif news_date < self.date:
self.stop = True
def parse_item(self, response):
text = ''
item = NoticiasItem()
item['title'] = response.css('h1.entry-title::text').extract_first()
item['date'] = response.css('div.base-box').css('span.entry-date::attr(datetime)').extract_first()
if item['date'] is None:
item['date'] = response.xpath('//meta[@itemprop="datePublished"]/@content').extract_first()
if item['date'] is None:
item['date'] = response.xpath('//time[@class="updated"]/@datetime').extract_first()
for paragraph in response.css('div.entry-content').css('p').extract():
text += remove_tags(paragraph) + '\n'
item['text'] = text
......
......@@ -8,7 +8,7 @@
import scrapy
class LajornadagroItem(scrapy.Item):
class Diarioyucatan2Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
......@@ -8,7 +8,7 @@
from scrapy import signals
class LajornadagroSpiderMiddleware(object):
class Diarioyucatan2SpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
......
......@@ -6,6 +6,6 @@
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class LajornadagroPipeline(object):
class Diarioyucatan2Pipeline(object):
def process_item(self, item, spider):
return item
# -*- coding: utf-8 -*-
# Scrapy settings for laJornadaGro project
# Scrapy settings for diarioYucatan2 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
......@@ -9,17 +9,17 @@
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'laJornadaGro'
BOT_NAME = 'diarioYucatan2'
SPIDER_MODULES = ['laJornadaGro.spiders']
NEWSPIDER_MODULE = 'laJornadaGro.spiders'
SPIDER_MODULES = ['diarioYucatan2.spiders']
NEWSPIDER_MODULE = 'diarioYucatan2.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'laJornadaGro (+http://www.yourdomain.com)'
#USER_AGENT = 'diarioYucatan2 (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
......@@ -33,7 +33,7 @@ ROBOTSTXT_OBEY = True
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
......@@ -47,13 +47,13 @@ ROBOTSTXT_OBEY = True
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'laJornadaGro.middlewares.LajornadagroSpiderMiddleware': 543,
# 'diarioYucatan2.middlewares.Diarioyucatan2SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'laJornadaGro.middlewares.MyCustomDownloaderMiddleware': 543,
# 'diarioYucatan2.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
......@@ -65,7 +65,7 @@ ROBOTSTXT_OBEY = True
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'laJornadaGro.pipelines.LajornadagroPipeline': 300,
# 'diarioYucatan2.pipelines.Diarioyucatan2Pipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
......
import scrapy
## scrapy crawl noticias -t json --nolog -o noticias.json
import re
# from datetime import datetime, date, timedelta
# from scrapy.spidermiddlewares.httperror import HttpError
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
date = scrapy.Field()
location = scrapy.Field()
author = scrapy.Field()
topic = scrapy.Field()
url = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "noticias"
def start_requests(self):
section_list = ['merida', 'yucatan', 'mexico', 'internacional', 'deportes',
'espectaculos', 'imagen', 'economia', 'tecnologia', 'salud']
# year = getattr(self, 'year', None)
# month = getattr(self, 'month', None)
# day = getattr(self, 'day', None)
self.baseURL='http://yucatan.com.mx/seccion/'
# self.date = date(int(year), int(month), int(day))
self.parsing_month = { 'enero': 1, 'febrero': 2, 'marzo': 3, 'abril': 4, 'mayo': 5, 'junio': 6, 'julio': 7,
'agosto': 8, 'septiembre': 9, 'octubre': 10, 'noviembre': 11, 'diciembre': 12 }
# self.pages = 100
for s in section_list:
yield scrapy.Request(url=self.baseURL+s, callback=self.parse)
def parse(self, response):
if ( len(response.xpath('//a[@class="show-more-link"]/@href').extract()) > 0 ):
for link in response.xpath('//a[@class="show-more-link"]/@href').extract():
yield scrapy.Request(url=link, callback=self.parse_pagination)
elif ( len(response.xpath('//a[@class="show-more-link"]/@href').extract()) == 0 ):
yield scrapy.Request(url=response.url, callback=self.parse_pagination, dont_filter=True)
def parse_pagination(self, response):
pagination = response.xpath('//*[@class="pagination"]/a/@href').extract()
if ( len(pagination) > 0 ):
pagination = pagination[-1]
pages = int(pagination[pagination.rfind('/')+1:])
p = 1
while p <= pages:
if ( p == 1 ):
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
elif ( p > 1 ):
yield scrapy.Request(url=response.url+'/page/'+str(p+1), callback=self.parse_link)
p += 1
else:
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
def parse_link(self, response):
for entry in response.xpath('//*[@class="bp-entry"]'):
# entry_date = entry.xpath('./*[@class="bp-head"]/div/span/text()').extract_first()
# entry_date = entry_date[entry_date.rfind(',')+2:][:entry_date[entry_date.rfind(',')+2:].rfind('-')-2]
# news_date = date(int(entry_date[-4:]), self.parsing_month[entry_date[:-8][entry_date[:-8].rfind(' ')+1:]], int(entry_date[:entry_date.find(' ')]))
link = entry.xpath('./*[@class="bp-head"]/h2/a/@href').extract_first()
if link is not None:
yield scrapy.Request(url=link, callback=self.parse_item)
def parse_item(self, response):
text = ''
item = NoticiasItem()
item['title'] = response.css('h1.entry-title::text').extract_first()
item['date'] = response.css('div.base-box').css('span.entry-date::attr(datetime)').extract_first()
if item['date'] is None:
item['date'] = response.xpath('//meta[@itemprop="datePublished"]/@content').extract_first()
if item['date'] is None:
item['date'] = response.xpath('//time[@class="updated"]/@datetime').extract_first()
for paragraph in response.css('div.entry-content').css('p').extract():
text += remove_tags(paragraph) + '\n'
item['text'] = text
item['topic'] = [response.xpath('//*[@class="breadcrumbs-plus"]/span/a/span/text()').extract()[1]]
item['url'] = response.url
# print item['title']
yield item
......@@ -4,8 +4,8 @@
# https://scrapyd.readthedocs.org/en/latest/deploy.html
[settings]
default = laJornadaGro.settings
default = diarioYucatan2.settings
[deploy]
#url = http://localhost:6800/
project = laJornadaGro
project = diarioYucatan2
import scrapy
from datetime import datetime, date, timedelta
"""
USO:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=2 -a day=21
"""
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
date = scrapy.Field()
location = scrapy.Field()
author = scrapy.Field()
topic = scrapy.Field()
url = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "noticias"
def start_requests(self):
section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo',
'cultura', 'espectaculos', 'deportes']
year = getattr(self, 'year', None)
month = getattr(self, 'month', None)
day = getattr(self, 'day', None)
self.baseURL = 'http://jornadabc.mx/'
self.date = date(int(year), int(month), int(day))
self.pages = 100
for s in section_list:
yield scrapy.Request(url=self.baseURL+'seccion/'+s, callback=self.parse_pagination)
def parse_pagination(self, response):
pagination = response.xpath('//li[@class="pager-last odd last"]/a/@href').extract()
if len(pagination) > 0:
p = 1
while p <= self.pages:
if p == 1:
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
elif p > 1:
yield scrapy.Request(url=response.url+'?page='+str(p+1), callback=self.parse_link)
p += 1
else:
yield scrapy.Request(url=response.url, callback=self.parse_link, dont_filter=True)
def parse_link(self, response):
section = response.url[response.url.rfind('/')+1:]
if section == 'espectaculos' or section == 'deportes':
path = '//*[@class="region region-soft-first"]'
else:
path = '//*[@class="region region-hard-first"]'
link_list = response.xpath(path).css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
if len(link_list) > 0:
for link in link_list:
news_date = datetime.strptime(link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:], '%d-%m-%Y').date()
if news_date == self.date:
yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item)
def parse_item(self, response):
item = NoticiasItem()
text = ''
item['date'] = response.xpath('//meta[@property="article:published_time"]/@content').extract_first()
title = response.xpath('//*[@class="block-content"]/h1/a/text()').extract()
if len(title) > 0:
item['title'] = title[0]
else:
item['title'] = response.xpath('//*[@class="block-content"]/h1/text()').extract_first()
item['topic'] = response.xpath('//span[@class="section"]/text()').extract_first()
for paragraph in response.xpath('//*[@class="field-item even"]/p/text()').extract():
text += paragraph
item['text'] = text
item['url'] = response.url
# print item['title']
yield item
import scrapy
import scrapy, re
"""
USO:
......@@ -6,8 +6,6 @@ USO:
scrapy crawl noticias -t json --nolog -o noticias.json
"""
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
......@@ -25,38 +23,45 @@ class NoticiasItem(scrapy.Item):
class QuotesSpider(scrapy.Spider):
name = 'noticias'
def start_requests(self):
self.section = ''
self.baseURL = 'http://jornadabc.mx'
section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo',
'cultura', 'espectaculos', 'deportes']
for section in section_list:
self.section = section
yield scrapy.Request(url=self.baseURL+'/seccion/'+section, callback=self.parse)
def parse(self, response):
pagination = response.xpath('//ul[@class="pager"]/li[@class="pager-last odd last"]/a/@href').extract()
if ( len(pagination) > 0 ):
section = response.url[response.url.rfind('/')+1:]
pagination = pagination[0]
pages = int(pagination[pagination.rfind('=')+1:])
for page in range(0, pages):
if ( page == 0 ):
yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True)
else:
yield scrapy.Request(url=self.baseURL+'/seccion/'+self.section+'?page='+str(page), callback=self.parse_page, dont_filter=True)
yield scrapy.Request(url=self.baseURL+'/seccion/'+section+'?page='+str(page), callback=self.parse_page, dont_filter=True)
else:
yield scrapy.Request(url=response.url, callback=self.parse_page, dont_filter=True)
def parse_page(self, response):
if ( self.section == 'espectaculos' or self.section == 'deportes' ):
path = '//*[@class="region region-soft-first"]'
section = response.url[response.url.rfind('/')+1:]
section = section[:section.rfind('?')]
if ( section == 'espectaculos' or section == 'deportes' ):
link_list = response.xpath('//*[@class="region region-soft-first"]').css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
else:
path = '//*[@class="region region-hard-first"]'
link_list = response.xpath('//*[@class="region region-hard-first"]').css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
if len(link_list) <= 0:
link_list = response.xpath('//*[@class="view-content"]').css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
for link in response.xpath(path).css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract():
for link in link_list:
yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item)
......
......@@ -24,6 +24,7 @@ class NoticiasItem(scrapy.Item):
class QuotesSpider(scrapy.Spider):
name = "noticias"
def start_requests(self):
# self.found = False
# self.flag = False
......@@ -68,6 +69,7 @@ class QuotesSpider(scrapy.Spider):
else:
yield scrapy.Request(url=self.baseURL+'/notas?opinion', callback=self.parse_page)
def parse_2(self, response): # para las primeras noticias
path_list = ['//h1[@class="title"]/a/@href', '//h2[@class="title"]/a/@href']
link_list = []
......@@ -106,6 +108,7 @@ class QuotesSpider(scrapy.Spider):
self.flag = True
break
def parse_item_2(self, response): # para las primeras noticias
item = response.meta['item']
# item = NoticiasItem()
......
......@@ -23,6 +23,7 @@ class NoticiasItem(scrapy.Item):
class QuotesSpider(scrapy.Spider):
name = "noticias"
def start_requests(self):
self.baseURL = 'https://www.lajornadamaya.mx'
# section_list = [('yucatan',123,'Yucatan'), ('quintana-roo',52,'Quintana Roo'),
......
import scrapy
from datetime import date
#scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22
import re
......@@ -26,6 +26,7 @@ class QuotesSpider(scrapy.Spider):
year = getattr(self, 'year', None)
month = getattr(self, 'month', None)
day = getattr(self, 'day', None)
self.date = date(int(year), int(month), int(day))
self.baseURL='http://diariodelyaqui.mx/'+year+'/'+month+'/'+day
urls = [
self.baseURL,
......@@ -56,7 +57,7 @@ class QuotesSpider(scrapy.Spider):
def parse_item(self, response):
item = NoticiasItem()
text = ''
item['date'] = response.xpath('//meta[@property="article:published_time"]/@content').extract_first()
item['date'] = self.date
item['title'] = response.xpath('//h1[@class="entry-title"]/text()').extract_first()
item['topic'] = response.xpath('//ul[@class="post-categories"]/li/a/text()').extract()
for paragraph in response.xpath('//div[@class="clearfix"]/p').extract():
......
import scrapy
from datetime import datetime, date, timedelta
"""
USO:
scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=2 -a day=21
"""
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
date = scrapy.Field()
location = scrapy.Field()
author = scrapy.Field()
topic = scrapy.Field()
url = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = 'noticias'
def start_requests(self):
self.section = ''
self.page = 0
self.count = 0
self.found = False
self.not_found = False
self.next_section = False
self.year = getattr(self, 'year', None)
self.month = getattr(self, 'month', None)
self.day = getattr(self, 'day', None)
self.date = date(int(self.year), int(self.month), int(self.day))
self.data_list = []
self.baseURL = 'http://jornadabc.mx'
# section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo', 'cultura']
section_list = ['baja-california', 'chihuahua', 'mexico', 'mundo', 'cultura', 'espectaculos', 'deportes']
for section in section_list:
self.section = section
# print self.section
self.page = 0
self.count = 0
self.found = False
self.not_found = False
self.next_section = False
self.data_list = []
page = -1
while True:
if ( self.found ):
self.found = False
break
if ( self.not_found ):
self.not_found = False
self.next_section = True
break
page += 1
# print 'page '+str(page)
yield scrapy.Request(url=self.baseURL+'/seccion/'+section+'?page='+str(page), callback=self.parse, dont_filter=True)
# if not ( self.next_section ):
# self.next_section = False
# self.page -= self.count
# if ( self.page > 0 ):
# self.page -= 1
# for pag in range(self.page, self.page+25):
# yield scrapy.Request(url=self.baseURL+'/seccion/'+section+'?page='+str(pag), callback=self.parse_page, dont_filter=True)
if ( len(self.data_list) > 0 ):
for link in self.data_list:
# link_date = datetime.strptime(link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:], '%d-%m-%Y').date()
# if ( link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:] == self.day.zfill(2)+'-'+self.month.zfill(2)+'-'+self.year ):
yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item, dont_filter=True)
def parse(self, response):
# print response.url
count = 0
this_page = int(response.url[response.url.rfind('=')+1:])
if ( self.section == 'espectaculos' or self.section == 'deportes' ):
path = '//*[@class="region region-soft-first"]'
else:
path = '//*[@class="region region-hard-first"]'
link_list = response.xpath(path).css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
if ( len(link_list) > 0 ):
if ( this_page > 0 ):
del link_list[0]
# print link_list
for link in link_list:
link_date = datetime.strptime(link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:], '%d-%m-%Y').date()
# print self.section+' '+link_date.isoformat()+' ? '+self.date.isoformat()
if ( link_date.month == self.date.month and link_date.year == self.date.year ):
# self.page = int(response.url[response.url.rfind('=')+1:])
# self.count += 1
# self.found = True
# print '************founddddd********'
# break
self.data_list.append(link)
if ( this_page >= 300 ):
self.found = True
break
if ( this_page > 300 and len(link_list) == 0 ):
self.not_found = True
# if ( link_date < stop_date ):
# # count += 1
# # print count
# # if ( count > 3 ):
# self.not_found = True
# break
def parse_page(self, response):
if ( self.section == 'espectaculos' or self.section == 'deportes' ):
path = '//*[@class="region region-soft-first"]'
else:
path = '//*[@class="region region-hard-first"]'
link_list = response.xpath(path).css('div.views-row').xpath('./*[@class="views-field views-field-title"]/span[@class="field-content"]/a/@href').extract()
if ( len(link_list) > 0 ):
for link in link_list:
link_date = datetime.strptime(link[:link.rfind('/')][link[:link.rfind('/')].rfind('/')+1:], '%d-%m-%Y').date()
if ( link_date == self.date ):
yield scrapy.Request(url=self.baseURL+link, callback=self.parse_item)
def parse_item(self, response):
item = NoticiasItem()
text = ''
item['date'] = response.xpath('//meta[@property="article:published_time"]/@content').extract_first()
title = response.xpath('//*[@class="block-content"]/h1/a/text()').extract()
if ( len(title) > 0 ):
item['title'] = title[0]
else:
item['title'] = response.xpath('//*[@class="block-content"]/h1/text()').extract_first()
item['topic'] = response.xpath('//span[@class="section"]/text()').extract_first()
for paragraph in response.xpath('//*[@class="field-item even"]/p/text()').extract():
text += paragraph
item['text'] = text
item['url'] = response.url
# print item['title']
yield item
import scrapy
# scrapy crawl noticias -t json --nolog -o noticias.json -a year=2017 -a month=3 -a day=22
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
class NoticiasItem(scrapy.Item):
title = scrapy.Field()
text = scrapy.Field()
date = scrapy.Field()
location = scrapy.Field()
author = scrapy.Field()
topic = scrapy.Field()
url = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "noticias"
def start_requests(self):
self.link_list = []
self.year = getattr(self, 'year', None)
self.month = getattr(self, 'month', None)
self.day = getattr(self, 'day', None)
self.baseURL='http://www.lajornadaguerrero.com.mx/'+self.year+'/'+self.month.zfill(2)+'/'+self.day.zfill(2)
urls = [
self.baseURL,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
paths = ['//div[@class="viewlet"]/h1/a/@href', '//div[@class="viewlet"]/h2/a/@href',
'//div[@class="viewlet"]/h3/a/@href', '//div[@class="viewlet image"]/h1/a/@href',
'//div[@class="viewlet image"]/h2/a/@href', '//div[@class="viewlet image"]/h3/a/@href',
'//div[@class="text_block_200"]/p/a/@href']
for path in paths:
links = response.xpath(path).extract()
if ( len(links) > 0 ):
for link in links:
if not ( link in self.link_list ):
self.link_list.append(link)
yield scrapy.Request(url=self.baseURL+'/'+link, callback=self.parse_item)
for link in response.xpath('//*[@class="text_block_200"]/h2/a/@href').extract():
yield scrapy.Request(url=self.baseURL+'/'+link, callback=self.parse_page)
def parse_page(self, response):
paths = ['//*[@id="article_list"]/h2/a/@href',
'//*[@id="article_list"]/h3/a/@href']
for path in paths:
for link in response.xpath(path).extract():
if not ( link in self.link_list ):
yield scrapy.Request(url=self.baseURL+'/'+link, callback=self.parse_item)
def parse_item(self, response):
item = NoticiasItem()
text = ''
item['date'] = self.year+'-'+self.month.zfill(2)+'-'+self.day.zfill(2)
item['title'] = response.xpath('//*[@class="documentContent"]/h1/text()').extract_first()
item['topic'] = response.xpath('//*[@id="portal-breadcrumbs"]/a[2]/text()').extract()
for paragraph in response.xpath('//*[@class="documentContent"]/p/text()').extract():
text += paragraph
item['text'] = text
item['url'] = response.url
# print item['title']
yield item
......@@ -106,6 +106,8 @@ class QuotesSpider(scrapy.Spider):
def parse_item(self, response):
item = NoticiasItem()
text = ''
item['date'] = response.xpath('//meta[@property="article:published_time"]/@content').extract_first()
if item['date'] is None or item['date'] == '':
item['date'] = response.xpath('//meta[@property="DC.date.issued"]/@content').extract_first()
item['title'] = response.xpath('//*[@class="page_title_inner"]/h1/text()').extract_first()
item['topic'] = response.xpath('//*[@class="post_info_cat"]/a/text()').extract_first()
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment