抓取js网页内容,js获取浏览器

本文目录一览：

第一步，查看网页源代码，找到ajax请求的URL。

比如，js代码为：

$.ajax({

url: 'ajax.php?id=100',

data: {ad_num:num,ad_str:str,cart_update_time:cart_update_time},

type: 'POST',

dataType: 'text',

async : false,

success: function(data){

}

其中的ajax.php?id=100就是ajax请求的URL。

第二步，拼接URL，用网站的域名加上这个找到的请求路径。

比如，网站域名为：拼接后的URL为：

第三步，用PHP读取第二步拼接出的URL即可。

抓取动态页面有两种常用的方法，一是通过JavaScript逆向工程获取动态数据接口（真实的访问路径），另一种是利用selenium库模拟真实浏览器，获取JavaScript渲染后的内容。但selenium库用起来比较繁琐，抓取速度相对较慢，所以第一种方法日常使用较多。

直接用net/http请求返回json的地址。有些数据可能需要cookie，可以直接用浏览器的或者模拟登陆。代码如下：

final WebClient webClient = new WebClient();

String url="";

final HtmlPage page = webClient.getPage("");

WebClient client = new WebClient( BrowserVersion.INTERNET_EXPLORER_8 ,"127.0.0.1", 28089 );

final WebClient client = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);

final HtmlPage page =client.getPage(url);

client.waitForBackgroundJavaScript(300000);

client.waitForBackgroundJavaScript(120*1000);

get list of all divs

final List divs = (List) page.getByXPath("//div");

HtmlElement he =page.getElementById("dealList");

get div which has a 'name' attribute of 'John'

final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[@name='John']").get(0);

System.out.println(he.asXml());

System.out.println(he.getFirstChild());

System.out.println(he.getFirstChild().asXml());

client.closeAllWindows();