今天来学一学爬虫……

Mad_Fish 2022年10月23日22:12:52

这几天感冒得厉害……

所以这篇文章我恐怕没有精力再去像往常那样写那么多了……

真是非常地抱歉啊……

不过呢

课程是认真学了……

代码的爬虫部分自然是照搬

接下来便是在其上方做出了一些改进：

1.Markdown排布~~（自认为弄得还算观感良好）~~
2.~~（很简陋的）~~检验URL
3.自定义爬取章节
4.用正则表达式查找并删掉了HTML标签，也就是说爬取得到的文本内容是符合要求的

这其实上也都是很基本的东西。

没有再往下拓展了……至少已经做出了符合标准的爬虫吧！

代码奉上

package org.example;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args)throws IOException  {
        Scanner sc=new Scanner(System.in);
        System.out.print("请输入URL（如爬取默认网址，请输入1）：\n");
        String url=null;

        //首先先检验url的合法性……
        while(true) {
            url = sc.next();
            final String URL="https://www.aixiaxsw.com/105/105503/";
            if (url.charAt(0) == '1' && url.length() == 1) {
                url = URL;
                break;
            }
            else {
                if(url.length()<25)System.out.print("请输入正确的URL\n");
                else if(!url.substring(0,25).equals("https://www.aixiaxsw.com/"))System.out.print("请输入正确的URL\n");
                else break;//然而只是检验了前缀是否合适……
            }
        }

        //然后可以选择爬取所需要的章节
        System.out.print("请输入需要爬取的章节（两个正整数以表示区间）：\n");
        int a=0,b=0;
        while(true){
            a=sc.nextInt();
            b=sc.nextInt();
            if(a<1||b<1||a>b)System.out.print("请重新输入\n");
            else break;
        }

        //接下来就是正常的爬取标题和作者名
        Document doc=null;
        doc=Jsoup.connect(url).get();
        String title=doc.body().selectFirst("h1").text();
        String writer=doc.body().selectFirst("p").text();
        Elements tag=doc.body().select("dl dd a[href]");
        File f=new File("./"+(title+"（爬虫）")+".md");
        OutputStream op=new FileOutputStream(f);
        op.write(("# ["+title+"]("+url+")\n#### "+writer+"\n").getBytes());
        //使用了Markdown

        //然后开始爬取章节内容……
        int ct=0;
        for(Element e:tag){

            ct++;
            if(ct-9<a)continue;//这两行很奇怪，猜猜为什么要加？
            if(ct-9>b)break;
            System.out.printf("正在爬取第%d节...\n",ct-9);

            String surl=e.attr("href");
            String stitle=e.text();
            Document docc=Jsoup.connect("https://www.aixiaxsw.com/"+surl).timeout(10000).get();
            Elements txt=docc.select("#content");
            String s=txt.toString();

            //利用正则表达式来简单地剔除HTML标签和HTML空格
            Pattern pa=Pattern.compile("<br>");
            Matcher ma=pa.matcher(s);
            s=ma.replaceAll("\n");
            pa=Pattern.compile("<[^>]+>");
            ma=pa.matcher(s);
            s=ma.replaceAll("");
            pa=Pattern.compile("( )+");
            ma=pa.matcher(s);
            s=ma.replaceAll(" ");

            //使用了Markdown
            op.write(("## ["+stitle+"]("+"https://www.aixiaxsw.com/"+surl+")"+"\n"+s+"\n---\n").getBytes());
        }
        System.out.print("爬取完毕\n");
        op.close();
    }
}

唔……希望不要评价太低……

正则表达式花了一些时间才搞明白怎么用呢……

Mad_Fish

软件园学生在线

【后端二】冯羽

今天来学一学爬虫……

不过呢

接下来便是在其上方做出了一些改进：

代码奉上