Class 7 WGET Assignment Explained

After visiting the website of the Perseus Collection of the Richmond Times Dispatch we found out that the URLs of the XML versions of the individual issues are of the following format:

http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3atext%3a2006.05.0001

After checking out the source of the main page, we found out that links to the individual issues are written in this format:

<a href=”text?doc=Perseus%3atext%3a2006.05.0001” class=”aResultsHeader”>The Daily Dispatch: November 1, 1860., [Electronic resource]</a>.

Therefore, I opened the source HTML file in Sublime and made a Find All query using the following RegEx to sort out the relevant part of the links:

text\?doc.+(\.\d\d\d\d)

By this I received the following format: text?doc=Perseus%3atext%3a2006.05.0001, pasting all of it into a new text file.

After this I made the following Replace All query using RegExes:

Find All: ^text Replace All: http://www.perseus.tufts.edu/hopper/dltext

I saved it in .txt format and used the file in the WGET scheme wget -i links.txt -P ./folderYouWantToSaveTo/ -nc.