mica-spider
是基于 mica-http
的爬虫工具,目前的主要功能为基于 cglib 和 jsoup 的 html 解析功能,后期会继续完善。
使用
maven
<dependency> <groupId>net.dreamlu</groupId> <artifactId>mica-spider</artifactId> <version>${version}</version> </dependency>
|
gradle
compile("net.dreamlu:mica-spider:${version}")
|
注意:清先阅读 mica-http
部分文档。
DomMapper 工具
DomMapper
工具采用 cglib
动态代理和 Jsoup
html 解析,不到 200
行代码实现了 html
转 java Bean
工具,爬虫必备。
主要方法有:
- DomMapper.asDocument
- DomMapper.readDocument
- DomMapper.readValue
- DomMapper.readList
CssQuery 注解说明
public @interface CssQuery {
String value();
String attr() default "";
String regex() default "";
int DEFAULT_REGEX_GROUP = 0;
int regexGroup() default DEFAULT_REGEX_GROUP;
boolean inner() default false; }
|
示例代码
爬取开源中国首页
Oschina oschina = HttpRequest.get("https://www.oschina.net") .execute() .onSuccess(responseSpec -> responseSpec.asDomValue(Oschina.class)); if (oschina == null) { return; } System.out.println(oschina.getTitle());
System.out.println("热门新闻");
List<VNews> vNews = oschina.getVNews(); for (VNews vNew : vNews) { System.out.println("title:\t" + vNew.getTitle()); System.out.println("href:\t" + vNew.getHref()); System.out.println("时间:\t" + vNew.getDate()); }
System.out.println("热门博客"); List<VBlog> vBlogList = oschina.getVBlogList(); for (VBlog vBlog : vBlogList) { System.out.println("title:\t" + vBlog.getTitle()); System.out.println("href:\t" + vBlog.getHref()); System.out.println("阅读数:\t" + vBlog.getRead()); System.out.println("评价数:\t" + vBlog.getPing()); System.out.println("点赞数:\t" + vBlog.getZhan()); }
|
模型1
@Getter @Setter public class Oschina {
@CssQuery(value = "head > title", attr = "text") private String title;
@CssQuery(value = "#v_news .page .news", inner = true) private List<VNews> vNews;
@CssQuery(value = ".blog-container .blog-list div", inner = true) private List<VBlog> vBlogList;
}
|
模型2
@Setter @Getter public class VNews {
@CssQuery(value = "a", attr = "title") private String title;
@CssQuery(value = "a", attr = "href") private String href;
@CssQuery(value = ".news-date", attr = "text") @DateTimeFormat(pattern = "MM/dd") private Date date;
}
|
模型3
@Getter @Setter public class VBlog {
@CssQuery(value = "a", attr = "title") private String title;
@CssQuery(value = "a", attr = "href") private String href;
@CssQuery(value = "span", attr = "text", regex = "^\\d+") private Integer read;
@CssQuery(value = "span", attr = "text", regex = "(\\d*).*/(\\d*).*/(\\d*).*", regexGroup = 2) private Integer ping;
@CssQuery(value = "span", attr = "text", regex = "(\\d*).*/(\\d*).*/(\\d*).*", regexGroup = 3) private Integer zhan;
}
|
微信 vs 公众号

精彩内容每日推荐!!!