Want to download - construct your own Linux network ant

xiaoxiao2021-03-06  67

Want to download - construct your own Linux network ant

August 04, 2002 23:53:28

LinuxAid

Have you downloaded a very huge file, so that you have to open your web browser for a few hours or even a few days? If you have a 40 file link on a web page, you need it - are you willing to open them with a point you are not tired? Again if the browser has failed before work? Linux already has a range of manual tools to deal with this situation, it does not have to use a browser. Support breakpoint resume, mirror download, plan download, etc. Windows download tools :). Cool? Below, come with me!

Interacts in this way

The web browser is interactive to the web - click and then hope that the result can come out within a few seconds. However, even in a very fast line, downloading a lot of files still requires a quite long time. For example, ISO mirror files are generally used in GNU / Linux's CD-ROM release. Some web browsers, especially only simple-encoded browsers, for a long time, does not work well, it may miss the memory storage or fail to fail. Although some browsers and file managers have been combined, the download and bundle transfer of multiple files cannot be supported (that is to bundle several files together for easy delivery). So you have to keep your login until the entire file is downloaded. Finally, you have to go to your office to click on the link to start downloading, and this will make him very unhappy because of the bandwidth shared by colleagues.

Downloading large files This task is more suitable for another set of tools. This article will tell you how to combine all kinds of GNU / Linux applications, which is Lynx, Wget, AT, crontab, etc. to solve problems in various file transfer. We will use some simple scripts, so there is a bit of knowledge of Bash Shell helps the following.

Wget application

Its main classification includes WGET download tools.

Bash $ wget http://play.Your.ur/here

It can also handle FTP, timestamp and recursive image of the entire web site directory tree - if you are not careful, the entire web site and all other sites will be linked to:

Bash $ wget -m http://target.web.site/subdirectory

Due to potentially high loads, this tool is placed in the server, which automatically downloads according to the mirror optimization in "Robots.txt" during the download. There are several command options here to control those types that are downloaded and linked to the links that follow. For example: only follow the relative link and skip GIF:

Bash $ wget -m -l --reject = gif http://target.web.site/subdirectory

Of course it supports breakpoints. When an incomplete file is given to stitching the remaining data, the WGET can restore the interrupted download ("-c" option). This operation requires the support of the server.

Bash $ wget -c http: //the.url.of/incumplete/file

The breakpoint renewal can be combined with the mirror function, allowing a big file to download and put up again through different sessions. How to make this process automatically complete it later.

If you often download it frequently as I am often downloaded, you can let WGET try a few times:

Bash $ wget -t 5 http://place.your.url/here

Here is to give up after 5 times, you can also use "-t inf" not to give up until it gets the result.

So how do you download it with a firewall proxy? Use the http_proxy environment variable or .wgetrc profile to specify a proxy server, download it through it. If you use a breakpoint, you will fail through the proxy server. Because the agent is resumed, the proxy server can only store an incomplete copy of a file. When you try to use "wget ​​-c" to get the remainder of the file, the proxy server will check the storage file and give the error information. You already have the entire file. In order to successfully bypass the stored procedure, we deceive the agent server by adding a special header information in the download request: Bash $ wget -c --header = "pragma: no-cache" http: // place. Your.url / Here

"--Header" options can join the number of any header information or the agreed string so we can modify the performance of the web server and proxy. Some sites refuse to provide services to files from the source link, only the file content can be transferred to the browser when derived from other pages that have been agreed. You can send files by adding a "Referr:" message:

Bash $ wget --header = "Referer: http://coming.from.this/page" http://surfing.to.this/page

Some special non-public Web sites can only send content into some special types of browsers. You can use the "User-Agent:" header information:

Bash $ wget --header = "User-agent: mozilla / 4.0 (compatible; msie 5.0; windows nt; Digext) http: //msie.only.url/here

(Note: The skills provided above are used in the case where the content licensing mechanism is met. Otherwise, it will be illegal.)

Specify the time of download

If you want to download large files in your computer with your colleagues, imagine that they suddenly be slowly angry like a crawler as the line like a smooth creek, you will Consider transferring your files to the non-peak use period. You don't have to wait until everyone leaves, staying in the office, nor does it have to log in at home after a meal. You just do the following settings in the Work Schedule:

Bash $ at 2300 Warning: Commands Will Be Executed Using / BIN / SH AT> Wget http://play.your.url/here at> Press Ctrl-D

We set up to start downloading at 23 o'clock in the evening. What we have to do is Confident ATD's schedule daemon is still working ^ & ^.

How many days have you spent for a few days?

When you download one or more files contain a lot of data, the bandwidth of the machine makes its transmission speed can be compared with the pigeons, you will find that when you arrive in the company the next morning, the scheduled download has not been completed. You terminate this work and submit another AT job. This time you use "wget ​​-c", just repeat this job every day. It is best to automatically perform it with "crontab.txt". Build a plain text file called "crontab.txt", the content is as follows:

0 23 * * 1-5 wget -c -n http://play.your.url/here 0 6 * * 1-5 KILLALL WGET

This will be a crontab file, which specifies what work is performed. When the previous five column specifies when to start the command, the back section of each line specifies what is executed. The first two columns specified time - at 23 o'clock in the evening, WGET, 6 o'clock in the morning, Killall Wget. Located in the third fourth * indicates that this work can be made every day. The fifth column indicates which day, which day is a working time process - "1-5" means Monday to Friday. At the beginning of each working day, the download work started, and the download work was stopped at 6 o'clock. In order to make this crontab working schedule play, you need to type the following command:

Bash $ crontab crontab.txt

"-N" parameter will check the timetric timmark of the target file, when it finds a matching timestamp, it will terminate the download because it indicates that the entire file has been transmitted. "Crontab -r" can cancel this schedule. I use this way to download many ISO files over the dial-up.

Download dynamic webpage

Some dynamic web pages need to be generated as needed, often frequent changes. Since the target document can't be considered from the technical point of view, it doesn't matter what the file length, it will become meaningless - "- c" option is difficult to work. For example: Generating a PHP page in Linux Weekend News:

Bash $ wget http://lwn.net/bigpage.php3

If you interrupt the download, then you want to continue, it will start downloading from the beginning. The NET line in my office sometimes can't stand it, so I wrote a simple script used to determine when interrupting the HTML page:

#! / bin / bash

#create it if Absent touch BigPage.php3

#Check if We got the whole thing! grep -qi '' BigPage.php3 do rm -f BigPage.php3

# Download LWN in One Big Page Wget/BigPage.php3

DONE

This Bash script will not download the document, "" is the end of the file before finding "".

SSL and cookies

The remote file can be accessed with SSL (Secure Sockets Layer, Security Case Interface) with "https: //". You will find another download software called CURL, which will be quite convenient in some cases.

Some web sites will have a lot of cookies for the browser before providing the content you want. Plus a "cookie:" head and correct information, this information can be obtained from your browser's cookie file.

Bash $ cookie = $ (grep nytimes ~ / .lynx_cookies | awk '{printf ("% s =% s;", $ 6, $ 7)}')

In order to download information from http://www.nytimes.com/, the above script can build the cookies you need. Of course, the premise is that you have already registered for sites using this browser. W3M uses a slightly different cookie file format:

Bash $ cookie = $ (GREP NYTIMES ~ / .W3M / Cookie | awk '{printf ("% s =% s;", $ 2, $ 3)}')

You can do this with this BASH:

Bash $ wget - HEADER = "Cookie: $ cookie" http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html

You can also use the CURL tool:

Bash $ curl -v -b $ cookie -o supercomp.html http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html

URLS list

So far, our downloaded files are single files or mirroring throughout the site, but also store the entire website directory. But sometimes we want to download a few files, of course, its URL has been given in the web page, and do not want to store the entire site. A simple example is that we just want to arrange 20 downloads of the 100 music files in the site. Here "--Accept" and "--reject" because it is not available in the file extension. So, we use "Lynx-Dump".

Bash $ lynx -dump ftp://ftp.ssc.com/pub/lg/ | Grep 'GZ $' | tail -10 | awk '{print $ 2}'> urllist.txt

Using a lot of GNU text processing tools can filter out the output from Lynx. In this example, we will draw out in the end of the URL in GZ, and store the last 10 files. A small Bash script command can automatically download all URLs listed in the file:

Bash $ for x in $ (cat urllist.txt)> do> wget $ x> done

We have successfully downloaded the last ten period of Linux Gazette.

How to deal with bandwidth issues

If you are not very familiar with bandwidth, and your file is downloaded because of the terminal of the web server, the following tips can help you unimpeded transfer files. It needs to use the curl and several mirror web sites, where there are several other backups of the target file. For example, suppose you want to download Mandrake 8.0 ISO from the following three sites:

URL1 = http://ftp.eecs.umich.edu/pub/linux/mandrake/iso/mandrake80-inst.iso url2 = http://ftp.rpmfind.net/linux/mandrake/iso/mandrake80-inst.iso URL3 = http://ftp.wayne.edu/linux/mandrake/iso/mandrake80-inst.iso

The file length is 677281792, using the "--Range" option of CURL to download simultaneously.

Bash $ curl -r 0-199999999 -o mdk-iso.part1 $ URL1 & BASH $ curl -r 200000000-39999999 -o mdk-iso.part2 $ URL2 & BASH $ curl -r 4000000- -o mdk-iso.part3 $ URL3 &

Divided into three download procedures, different servers transmit different parts of the ISO mirror file. "-R" option specifies the range range that is selected from the target file. After completing, three parts are combined - CAT MDK-ISO.PART?> MDK-80.ISO. (I suggest you check the MD5 hash before burnt CD-R) CURL uses the "-verbose" option and running in its own window. You can track each transfer process.

to sum up

Don't be afraid to download remote files using non-interactive ways. Web designers try to force us to interact in their site, but still have free tools to help us do this, save us a lot of trouble.

转载请注明原文地址:https://www.9cbs.com/read-85970.html

New Post(0)