especs_sitio_proceso.txt 2.29 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
Para el crawler de la seccion 'Hemeroteca' del sitio 'Proceso' se requirio lo siguiente (a parte de scrapy):
-docker
-splash
-scrapy-splash

=================================================
Instalacion DOCKER, ubuntu 16.04
=================================================
	$ sudo apt-get update
	$ sudo apt-get upgrade
	
	$ sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D

	$ sudo apt-add-repository 'deb https://apt.dockerproject.org/repo ubuntu-xenial main'
	$ sudo apt-get update
	$ sudo apt-get install docker-engine


> Inicializar/detener Docker:
	$ sudo service docker start/stop

> Version:
	$ docker --version


Consulta: https://thishosting.rocks/install-docker-on-ubuntu/





=================================================
Instalacion SPLASH
=================================================
	$ sudo docker pull scrapinghub/splash


> Inicializar el contenedor:
Renán Sosa Guillen's avatar
Renán Sosa Guillen committed
39 40
	$ sudo docker run -p 8050:8050 scrapinghub/splash     ## con esto splash esta disponible en puerto 
	                                                      ## 8050 (http) en navegador (localhost:8050)
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86


Consulta: http://splash.readthedocs.io/en/latest/install.html





=================================================
Instalacion scrapy-splash
=================================================
	$ pip install scrapy-splash


> Configuraciones en settings.py:

	* Para usar scrapy-splash en un proyecto primero se necesita habilitar el middleware:

		DOWNLOADER_MIDDLEWARES = {
			'scrapy_splash.SplashCookiesMiddleware': 723,
			'scrapy_splash.SplashMiddleware': 725,
			'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
		}


	* Agregar el SPLASH_URL:

		SPLASH_URL = 'http://localhost:8050/'


	* Habilitar SplashDeduplicateArgsMiddleware, que permite ahorrar espacio en disco evitando almacenar multiples argumentos duplicados de Splash.

		SPIDER_MIDDLEWARES = {
			'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
		}


	* Configurar DUPEFILTER_CLASS y almacenamiento chache:

		DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
		HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'


Consulta: https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/
					https://github.com/scrapy-plugins/scrapy-splash