学习到的东西　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　

正则表达式

正则表达式定义了字符串的模式。正则表达式可以用来搜索、编辑或处理文本。
正则表达式在线测试工具：https://regex101.com/
学到的匹配符匹配符：
d? d出现0/1次
a＊ a可以出现0/多次
a+ a出现一次以上
a｛6｝ a出现6次
a｛2，｝ a出现2次以上
a｛2，6｝ a出现2-6次
匹配多个字符：
(ab)+ ab出现一次以上
或运算：
a (cat|dog) 匹配 a cat or a dog
a cat|dog 匹配 a cat or dog
字符类：
匹配由abc构成的数据【abc】+ abc出现一次以上 abc aabbcc
【a-zA-Z0-9】 ABCabc123
^ 排除【^0-9】匹配0-9之外的数据(包括换行符)
元字符
\d 数字字符 \d+ 匹配一个以上的数字
\D 非数字字符
\w 单词字符单词数字下划线即英文字符
\W 非单词字符
\s 空白符包含空格和换行符
\S 非空白字符
\b 单词的边界单词的开头或结尾单词与符号之前的边界
\B 非单词的边界符号与符号单词与单词的边界
. 任意字符不包含换行符
. 表示. 通过\进行了转意
^ 匹配行首 $ 匹配行尾
＊+｛｝贪婪匹配
https://www.wondershare. com
会匹配整串因为是贪婪匹配
只匹配两个标签代码，➕? 设置为懒惰匹配

CSS选择器

CSS 选择器用于“查找”（或选取）要设置样式的 HTML 元素。
还停留在原理上

idea打jar包，运行jar包　　　　　　　　　　　

出现的问题　　　　　　　　　　　　　　　　　　　

1. 爬取的地址只能为爱下书网站的小说

本来想做一个普适一点的，但是输入的url和筛选前端提取出来的部分重复了
如https://www.aixiaxsw.com//114/114038/43214204.html会变成https://www.aixiaxsw.com//114/114038//114/114038/43214204.html
暂未想出解决方法，只能固定地址为爱下书网站

2. 爬取的章节从第四章开始

这是最奇怪的，为什么会直接把一二三章自动略过

3. 段落的问题

仅在章节开头加了全角空格，还不知道如何将文章中的段落还原

代码


import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;
import java.util.Scanner;

public class spider {
    public static void main(String[] args) throws IOException {
        Scanner sc = new Scanner(System.in);
        System.out.println("请输入你需要爬取小说的目录（爱下书小说站）");
        String menuUrl = sc.next();
        String menuurl = "https://www.aixiaxsw.com/";
        final String fileAddr = "./";
        Document document = null;
        try {
            document = Jsoup.connect(menuUrl).get();
        }catch(IOException ewww){
            ewww.printStackTrace();
        }
        String title = document.body().selectFirst("h1").text();
        System.out.println("开始爬取："+title);
        Elements menu = document.body().select("dl dd");
        Elements as = menu.select("a[href]");

        System.out.println("小说将保存在："+fileAddr + title+".md  中");
        File file = new File(fileAddr + title+".md");
        OutputStream fileOut = null;
        try {
            fileOut = new FileOutputStream(file);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        fileOut.write(("# " + title + "\n\n").getBytes());
        int count = 1;

        for (Element a : as){
            if(count<=12){
                count++;
                continue;
            }
            String subLink = a.attr("href");
            String chapterName = a.text();
            System.out.println("当前爬取章节："+chapterName);
            Document chapter = null;
            try {
                chapter = Jsoup.connect(menuurl+subLink).timeout(10000).get();
            } catch (IOException ewww) {
                ewww.printStackTrace();
            }
            Element chapterContent = chapter.selectFirst("#content");
//            System.out.println(chapterContent.text().substring(28));
            fileOut.write(("\n\n" + "## "+chapterName +"\n"+""+ "\n\n " + "　　"+chapterContent.text()).getBytes());
        }
        System.out.println("小说爬取完成");
        fileOut.close();
    }

}

guocaogejunjing

软件园学生在线

【后端二】葛军靖