How to web scrap: get links and download files from these links
My project about this subject is in my github profile https://github.com/rodrigomariamorgao/scrapy_and_download.
Objective and Structure
My objective: I needed to download several files, where each one was in an url. But first I had to get these links. So I had to use web scraping on the site. For this article I used Flickr site, just for example.
This project uses Python3 and framework Selenium. Inside the execute folder we have two files: scrapy.py and download.py.
The file scrapy.py we executed scrapping in the site https://www.flickr.com as an example. In the file download.py we download the files hosted in the urls in url.csv, created earlier.
We have others folders too: chromedriver_linux64 and downloaded_files: the first have the driver used by scrapping process, and the second we have the files download after.
For this project, I recommend to you to create a virtual environment, using the instructions inside the file README.MD.
scrapy.py
In this file, we created the urls.csv file, who receive our urls for download, used in the next step.
After, we created the Chrome instance, using the headless argument, for execute in background.
We access the Flickr site defined on ADDRESS constant, to search the div using the classes view photo-list-photo-view requiredToShowOnServer awake, present at each photo.
Inside that div we have the attribute style, that contains the url that we want.
We need to do a treatment to get only the url we need, after saving it in the csv file, ending after the driver instance.
download.py
After acquiring the urls, we need to download from the files. Using the urllib.request, we use the for structure for each row in the csv, and are invocating the method urlretrieve to download each photo.
I used the Python dict to eliminate an if validation, because the first line from csv contains only the word URLS and the rest of the file contains our urls.
To generate the unique name, we use the Python’s enumerate() for use the number in the filename.
Thanks Vinicius Ferreira for review and tips. Contact me on LinkedIn if you have any questions. See you later!