首页 > 编程知识 正文

php采集网页数据不完整问题(php采集网页数据不完整问题怎么解决)

时间:2023-12-24 12:05:25 阅读:320263 作者:BSXS

本文目录一览:

php抓取网页内容不完整

用CURL可以抓取到的 可能是你网速太慢超时了 所以抓取不完整 用 curl_setopt($ch, CURLOPT_TIMEOUT, 360) 试试看

用PHP获取网页内容的时候获取不完全 求能完全获取的方法

curl是获取的服务器端编译后返回的代码 . 是原始的.

curl 里 没法解析执行js . 所以得到的一直都是原始的代码.

而浏览器在拿到服务器返回的代码的时候, 会执行页面加载js ,

js 会在DOM 里动态添加或修改删除一些节点元素.

查看元素看到的就是经过js一顿处理之后的html内容 不是原始的了. ..

所以单纯使用curl 没法获取到"所见即所存"的代码...

php获取数据为什么curl获取不完整

因为,PHP CURL库默认1024字节的长度不等待数据的返回,所以你那段代码需增加一项配置:

curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:'));

给你一个更全面的封装方法:

function req_curl($url, $status = null, $options = array())

{

$res = '';

$options = array_merge(array(

'follow_local' = true,

'timeout' = 30,

'max_redirects' = 4,

'binary_transfer' = false,

'include_header' = false,

'no_body' = false,

'cookie_location' = dirname(__FILE__) . '/cookie',

'useragent' = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1',

'post' = array() ,

'referer' = null,

'ssl_verifypeer' = 0,

'ssl_verifyhost' = 0,

'headers' = array(

'Expect:'

) ,

'auth_name' = '',

'auth_pass' = '',

'session' = false

) , $options);

$options['url'] = $url;

$s = curl_init();

if (!$s) return false;

curl_setopt($s, CURLOPT_URL, $options['url']);

curl_setopt($s, CURLOPT_HTTPHEADER, $options['headers']);

curl_setopt($s, CURLOPT_SSL_VERIFYPEER, $options['ssl_verifypeer']);

curl_setopt($s, CURLOPT_SSL_VERIFYHOST, $options['ssl_verifyhost']);

curl_setopt($s, CURLOPT_TIMEOUT, $options['timeout']);

curl_setopt($s, CURLOPT_MAXREDIRS, $options['max_redirects']);

curl_setopt($s, CURLOPT_RETURNTRANSFER, true);

curl_setopt($s, CURLOPT_FOLLOWLOCATION, $options['follow_local']);

curl_setopt($s, CURLOPT_COOKIEJAR, $options['cookie_location']);

curl_setopt($s, CURLOPT_COOKIEFILE, $options['cookie_location']);

if (!empty($options['auth_name']) is_string($options['auth_name']))

{

curl_setopt($s, CURLOPT_USERPWD, $options['auth_name'] . ':' . $options['auth_pass']);

}

if (!empty($options['post']))

{

curl_setopt($s, CURLOPT_POST, true);

curl_setopt($s, CURLOPT_POSTFIELDS, $options['post']);

//curl_setopt($s, CURLOPT_POSTFIELDS, array('username' = 'aeon', 'password' = '111111'));

}

if ($options['include_header'])

{

curl_setopt($s, CURLOPT_HEADER, true);

}

if ($options['no_body'])

{

curl_setopt($s, CURLOPT_NOBODY, true);

}

if ($options['session'])

{

curl_setopt($s, CURLOPT_COOKIESESSION, true);

curl_setopt($s, CURLOPT_COOKIE, $options['session']);

}

curl_setopt($s, CURLOPT_USERAGENT, $options['useragent']);

curl_setopt($s, CURLOPT_REFERER, $options['referer']);

$res = curl_exec($s);

$status = curl_getinfo($s, CURLINFO_HTTP_CODE);

curl_close($s);

return $res;

}

php获取数据为什么curl获取不完整?而用file_get_contents能获取完整?

因为,PHP CURL库默认1024字节的长度不等待数据的返回,所以你那段代码需增加一项配置:

curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:'));

给你一个更全面的封装方法:

function req_curl($url, $status = null, $options = array())

{

    $res = '';

    $options = array_merge(array(

        'follow_local' = true,

        'timeout' = 30,

        'max_redirects' = 4,

        'binary_transfer' = false,

        'include_header' = false,

        'no_body' = false,

        'cookie_location' = dirname(__FILE__) . '/cookie',

        'useragent' = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1',

        'post' = array() ,

        'referer' = null,

        'ssl_verifypeer' = 0,

        'ssl_verifyhost' = 0,

        'headers' = array(

            'Expect:'

        ) ,

        'auth_name' = '',

        'auth_pass' = '',

        'session' = false

    ) , $options);

    $options['url'] = $url;

    $s = curl_init();

    if (!$s) return false;

    curl_setopt($s, CURLOPT_URL, $options['url']);

    curl_setopt($s, CURLOPT_HTTPHEADER, $options['headers']);

    curl_setopt($s, CURLOPT_SSL_VERIFYPEER, $options['ssl_verifypeer']);

    curl_setopt($s, CURLOPT_SSL_VERIFYHOST, $options['ssl_verifyhost']);

    curl_setopt($s, CURLOPT_TIMEOUT, $options['timeout']);

    curl_setopt($s, CURLOPT_MAXREDIRS, $options['max_redirects']);

    curl_setopt($s, CURLOPT_RETURNTRANSFER, true);

    curl_setopt($s, CURLOPT_FOLLOWLOCATION, $options['follow_local']);

    curl_setopt($s, CURLOPT_COOKIEJAR, $options['cookie_location']);

    curl_setopt($s, CURLOPT_COOKIEFILE, $options['cookie_location']);

    if (!empty($options['auth_name'])  is_string($options['auth_name']))

    {

        curl_setopt($s, CURLOPT_USERPWD, $options['auth_name'] . ':' . $options['auth_pass']);

    }

    if (!empty($options['post']))

    {

        curl_setopt($s, CURLOPT_POST, true);

        curl_setopt($s, CURLOPT_POSTFIELDS, $options['post']);

        //curl_setopt($s, CURLOPT_POSTFIELDS, array('username' = 'aeon', 'password' = '111111'));

    }

    if ($options['include_header'])

    {

        curl_setopt($s, CURLOPT_HEADER, true);

    }

    if ($options['no_body'])

    {

        curl_setopt($s, CURLOPT_NOBODY, true);

    }

    if ($options['session'])

    {

        curl_setopt($s, CURLOPT_COOKIESESSION, true);

        curl_setopt($s, CURLOPT_COOKIE, $options['session']);

    }

    curl_setopt($s, CURLOPT_USERAGENT, $options['useragent']);

    curl_setopt($s, CURLOPT_REFERER, $options['referer']);

    $res = curl_exec($s);

    $status = curl_getinfo($s, CURLINFO_HTTP_CODE);

    curl_close($s);

    return $res;

}

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。