PHP 抓網頁內容－風影淚的部落格

PHP抓網頁內容

1.file_get_contents

<?
$url = http://www.xxx.com/;
$contents = file_get_contents($url);
//如果出現中文亂碼使用下面代碼
//$getcontent = iconv("gb2312", "utf-8",file_get_contents($url));
//echo $getcontent;
echo $contents;
?>

2.curl
我們必須先建立一個「curl」的連線，也因此，必須使用到「$ch = curl_init()」這個函式。而為了怕建立連線忘了關閉。因此，必須先寫好關閉的函式，「curl_close($ch)」。

接下來，你可以設定他截取網頁的選項，一般來說常用的有「CURLOPT_RETURNTRANSFER」、「CURLOPT_URL」、「CURLOPT_HEADER」、「CURLOPT_FOLLOWLOCATION」、「CURLOPT_USERAGENT」這幾個選項。而這幾個選項分別代表「將結果回傳成字串」、「設定截取網址」、「是否截取header的資訊」、「是否抓取轉址」及「瀏覽器的user agent」。最後，再執行「curl_exec($ch)」以取出結果就可以了。

<?
$url = "http://www.xxx.com/";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//在需要用戶檢測的網頁裡需要增加下面兩行
//curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
//curl_setopt($ch, CURLOPT_USERPWD, US_NAME.":".US_PWD);
$contents = curl_exec($ch);
curl_close($ch);
echo $contents;
?>

以抓取yahoo為例，若我們要偽裝成google bot去抓取，那麼我們可以寫成下列的樣子：

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, "www.yahoo.com.tw");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_USERAGENT, "Google Bot");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
curl_close($ch);
echo $output;

也可以將選項們設定一個陣列，以增加設定時的閱讀度。這時就得動用「curl_setopt_array()」這個函式了：

$ch = curl_init();
$options = array(CURLOPT_URL => 'www.yahoo.com.tw',
                 CURLOPT_HEADER => false,
   CURLOPT_RETURNTRANSFER => true,
   CURLOPT_USERAGENT => "Google Bot",
   CURLOPT_FOLLOWLOCATION => true
           );
curl_setopt_array($ch, $options);
$output = curl_exec($ch);
curl_close($ch);
echo $output;

3.fopen->fread->fclose

<?
$handle = fopen ("http://www.xxx.com/", "rb");
$contents = "";
do {
   $data = fread($handle, 8192);
   if (strlen($data) == 0) {
   break;
   }
   $contents .= $data;
} while(true);
fclose ($handle);
echo $contents;
?>

風影淚

風影淚的部落格

風影淚發表在痞客邦留言(0) 人氣()

E-mail轉寄

風影淚的部落格

歡迎光臨風影淚在痞客邦的小天地

PHP 抓網頁內容

歷史上的今天

留言列表

文章分類

電玩相關 (1)

流行資訊 (1)

電腦相關 (8)

熱門文章

最新文章

站方公告

活動快報

網路脈...

我的好友

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY