基于Jsoup的网文爬虫

一、正则表达式

用处：检验字符串是否满足一定的规则（用来查找特定内容），并用来核验数据结构的合法性。

字符类（只匹配一个字符）

[abc]	只能是abc
[^abc]	除abc以外的其他字符
[a-zA-Z]	a到z，A到Z
[a-d[m-p]]	a到z或m到p
[a-z&&[ ^bc]]	a到z和非bc的交集

预定义字符（只匹配一个字符）

.	任何字符
\d	一个数字：[0-9]
\D	非数字：[ ^0-9]
\s	一个空白字符[\t\n\x0B\f\r]
\S	非空白字符
\w	[a-zA-Z_0-9]英文、数字、下划线
\W	[ ^\w]

数量用词

X?	X,一次或零次
X*	X,零次或多次
X+	X,一次或多次
X{n}	X,正好n次
X{n,}	X,至少n次
X{n,m}	X,至少n次且不超过m次

例子：

//判断网址输入是否合法
        Scanner sc = new Scanner(System.in);
        System.out.println("请输入要爬取的小说网址");
        String URL = sc.next();
        if(URL.matches("[h-t]{5}://w{3}.aixiaxsw.\\w*")){
            System.out.println("1");
        }else{
            System.out.println("0");
        }

二、try,catch and throw

try：定义一个代码块，以便在执行时进行错误调试。当发生错误时，拋出一个相应的异常对象。之后程序会跳过 try 语句块中剩余的语句，转到 catch 语句块后面的第一条语句开始执行。
catch：当try中发生错误时，依据所拋出异常对象的类型进行捕获，并执行catch中的代码。如果 try 语句块中没有异常发生，那么 try 正常结束，后面的 catch 语句被跳过，程序将从 catch 语句块后的第一条语句开始执行。
finally：在try…catch后执行代码，无论结果。

例子：

try {
            document = Jsoup.connect(menuUrl).get();
        } catch (IllegalArgumentException e) {                         //判断输入内容是否为网址
            e.printStackTrace();
            System.out.println("输入类型非法，请输入网址");
            System.exit(0);
        } catch (UnknownHostException e){                              //判断网络状况
            e.printStackTrace();
            System.out.println("网络状况不佳，请检查您的网络连接");
            System.exit(0);

注：try后面可以加多个catch，匹配多种错误类型。

throws：当一个方法产生一个它不处理的异常时，那么就需要在该方法的头部声明这个异常，以便将该异常传递到方法的外部进行处理。使用 throws 声明的方法表示此方法不处理异常。public static void main(String[] args) throws IOException {（来自zxgg原码）
throw：throw 语句用来直接拋出一个异常，后接一个可拋出的异常类对象。

throws和throw不是很明白，只是会用一点点 ~~什么复制粘贴大师~~

Java中的部分异常类型

算数异常：ArithmeticException
空指针异常：NullPointerException
类型强制转换异常：ClassCastException
文件未找到异常：FileNotFoundException
数组下标越界异常;ArrayIndexOutOfBoundsException

依靠异常类型的不同，可以实现对爬取小说中用户网络质量的判断（UnknownHostException）、网址不合法（IllegalArgumentException）等的判断。（代码见上）

三、Jsoup.connect的部分知识

Jsoup含义：Jsoup是一款Java的HTML解析器，主要用来对HTML解析。

org.jsoup.Jsoup把输入的HTML转换成一个org.jsoup.nodes.Document对象，然后从Document对象中取出想要的元素。然后利用Jsoup.connect方法可以返回一个org.jsoup.Connection对象，在该对象中就可以使用get和post来执行请求了。

四、简单CSS选择器

选择器	例子	描述
.class（类选择器）	.intro	选取所有 class="intro" 的元素
#（id 选择器）	#firstname	选取 id="firstname" 的那个元素
*（通用选择器）	*	选取所有元素
element(元素选择器)	p	选取所有 \ 元素
element,element,..（分组选择器）	div,p	选取所有 \ 元素和所有 \ 元素

元素选择器和分组选择器都只需要打标签名，分组选择器可以节省代码。

五、产品介绍

该产品实现了爬取爱小说网站上所有小说的功能。在运行时输入2将爬取示例小说《天灾之龙》，输入1则需要用户输入爱小说网站上的任意一本小说，而后进行爬取。
该爬虫程序在爬取小说目录时可自动为目录添加超链接，以实现了在.md格式下小说标题、目录、正文的合理排版。
实现了对用户输入网址不正确、输入内容非网址以及网络质量不佳三种情况的处理，并能有对用户的明显提示信息。
附代码：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.select.Selector;

import java.io.*;
import java.net.UnknownHostException;
import java.util.Scanner;

/*作者
* 沈怀瑾
 */

public class spider {
    public static void main(String[] args) throws IOException {

        String menuUrl = "https://www.aixiaxsw.com/105/105503/";
        String menuUrl2 = "https://www.aixiaxsw.com";
        final String fileAddr = "./";
        Scanner sc = new Scanner(System.in);
        System.out.println("输入1则根据下次输入的目录爬取，输入2则爬取示例小说《天灾之龙》");
        if(sc.nextInt()==1){
            menuUrl = sc.next();
        }

        //访问目录，获取小说名、作者名、章节名、章节链接
        Document document =null;
        try {
            document = Jsoup.connect(menuUrl).get();
        } catch (IllegalArgumentException e) {                         //判断输入内容是否为网址
            e.printStackTrace();
            System.out.println("输入类型非法，请输入网址");
            System.exit(0);
        } catch (UnknownHostException e){                              //判断网络状况
            e.printStackTrace();
            System.out.println("网络状况不佳，请检查您的网络连接");
            System.exit(0);
        }
        String title = null;
        String author = null;
        try{
            title = document.body().selectFirst("h1").text();
            author = document.body().selectFirst("p").text();
            System.out.println("开始爬取："+title);
        }catch(Exception e){
            System.out.println("网址错误，请输入正确的网址!!");
            System.exit(0);                                      //判断网址是否正确
        }
        Elements menu = document.body().select("dl dd");
        Elements as = menu.select("a[href]");

        //新建文件、文件流

        System.out.println("小说将保存在："+fileAddr+title+".md 中");
        File file = new File(fileAddr+title +".md");
        OutputStream fileOut = null;
        try {
            fileOut = new FileOutputStream(file);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        fileOut.write(("# "+title+"\n\n").getBytes());
        fileOut.write(("##### "+author+"\n\n").getBytes());

        //循环每个章节

        //写入目录
        int count1 = 1;
        for (Element a : as ){
            if(count1<=9){
                count1++;
                continue;
            }
            String chapterName = a.text();
            String subLink = a.attr("href");
            fileOut.write(( "#### "+"["+chapterName+"]"+"("+menuUrl2+subLink+")"+"\n").getBytes());
        }

        //写入文章内容
        int count2 = 1;
        for (Element a : as ){
            if(count2<=9){
                count2++;
                continue;
            }
            String subLink = a.attr("href");
            String chapterName = a.text();
            System.out.println("当前爬取章节："+chapterName);
            Document chapter = null;
            try {
                chapter = Jsoup.connect(menuUrl2+subLink).get();
            } catch (IOException ewww) {
                ewww.printStackTrace();
            }
            Element chapterContent = chapter.selectFirst("#content");
            fileOut.write(("\n\n" + "### "+chapterName + "\n\n " + chapterContent.text()).getBytes());
        }

        System.out.println("小说爬取完成");
        fileOut.close();
    }
}

沈怀瑾

软件园学生在线

【后端二】沈怀瑾