
因为觉得写爬虫类的教程挺有意思,so在接下来的日子里,我将会不定时更新我的博客,并且会将我的劳动成果分享给大家唷, 如果您也喜欢爬虫这门技术,欢迎一起交流哦! 目前暂定目标为尽可能完成更多的廉航网站数据查询爬虫,在这过程中,大家可以通过源码学习到我是如何从无到有的拿到我想要的东西,在开始之前温馨提示下,由于在实现爬虫的过程中难免会遇到部分反扒措施的网站,在本文中我也将会提供我已经完成反反扒操作的脚本给大家,但该脚本仅限于学习研究之用喔!

本文将以当下最热门的亚洲航空(AK)为目标,该航司位于马来西亚吉隆坡,是马来西亚第二家国际航空公司,也是亚洲规模最大的廉价航空公司,主要枢纽机场为吉隆国际机场第二航站楼, 集团伙伴公司泰国亚洲航空、印尼亚洲航空与菲律宾亚洲航空分别选定曼谷廊曼国际机场、苏加诺-哈达国际机场与克拉克国际机场作为枢纽机场,整个集团的国内及国际定期航班共超过165个航点,遍布25个国家。(摘自:wiki自由百科全书)

我们将以python3作为主要语言进行开发/编写爬虫,当然这里我只实现查询,虽然从查询到扣位我都已经实现,但鉴于商业趋势以及一系列版权问题,故不再提供扣位源码已经相关的教学信息;要完成该查询接口,需要还原一个加密,这里仅提供脚本不作任何解释;如冒犯了您的权益,请与我联系,我将配合删除!

要实现该爬虫需要先实现以下3个接口;请自行部署好Python开发环境以及安装好以下模块!
- https://k.airasia.com/basecurrency/{currency} 用作获取货币(国内始发无视);
- https://k.airasia.com/ssr/getssrdata 用于获取加密信息(新版网站后续headers加密全在此response里,另Cookie加密除外);
- https://k.airasia.com/shopprice-pwa/1/0/{dep}/{arr}/{date}/{adt}/{chd}/{inf}/prom 返回航班数据;
- 需要用到的类库/模块(基于python3):requests、random、threading、execjs、re、time 等,其中本文所使用的代理是基于socks5
如需使用http/https协议可自行更改。 - 由于接口使用了伪会员协议,返回的数据与浏览器查询(非登录状态)展示的价格稍有偏差,一般比浏览器展示的价格少10元。
取货币部分源码
def Basecurrency(self): if self.start_place not in self.place: dict_data = { "sessionId": self.sessionId, "status": "SUCCESS", "msg": "This airlines has not scheduled flight from {} to {} The current number of threads as:{} The Query Proxy as:{} The timeout as:{}".format(self.start_place, self.end_place, threading.activeCount(), self.proxyInfo, round(time.perf_counter() - self.time_start, 2)), "pricedItineraries": [], "validTime": None, "needAdjust": True, "needPushPrice": False } return dict_data url = f'https://k.airasia.com/basecurrency/{self.start_place}' self.headers['Referer'] = 'https://www.airasia.com' self.headers["Origin"] = 'https://www.airasia.com' if (self.proxy != None): response = self.Session.get(url, headers = self.headers, proxies = self.proxy, timeout = self.timeout / 2) else : response = self.Session.get(url, headers = self.headers, timeout = self.timeout / 2) response.encoding = 'utf-8' return response.text
取加密数据部分源码
def GetSsrData(self): url = 'https://k.airasia.com/ssr/getssrdata' self.headers["channel_hash"] = "ac9b6baea0f0cbd04b974fee6e2f3f2c29beac2e5814d66555191736" self.headers["ContentType"] = 'Content-Type' if (self.proxy != None): response = self.Session.post(url, headers = self.headers, proxies = self.proxy, json = {}, timeout = self.timeout / 2) else : response = self.Session.post(url, headers = self.headers, json = {}, timeout = self.timeout / 2) response.encoding = 'utf-8' return response.text
获取航班数据部分源码
def Search(self):
url = f'https://k.airasia.com/shopprice-pwa/1/0/{self.start_place}/{self.end_place}/{self.depDate}/{self.adtCount}/{self.chdCount}/{self.infCount}/40PRCT'
if (self.proxy != None):
response = self.Session.get(url, headers = self.headers, proxies = self.proxy, timeout = self.timeout / 2)
else:
response = self.Session.get(url, headers = self.headers, timeout = self.timeout / 2)
response.encoding = 'utf-8'
return response.text
完整源码
import requests, json, random, threading, execjs, re, time class AK(object): def __init__(self, depCode, arrCode, depDate, adtCount, chdCount, infCount, sessionId, Code, Author): self.Side = ["PEK", "CAN", "PVG", "CTU", "CKG", "KWL", "HGH", "SWA", "KMG", "LHW", "KHN", "NKG", "NNG", "NGB", "JJN", "SZX", "SHE", "TSN", "WUH", "XIY", "CSX"] self.place = ['M1B', 'M1A', 'AMD', 'ATQ', 'MFM', 'PDG', 'IXB', '3AA', 'BCD', 'DPS', 'BTJ', 'BLR', 'UTP', '3AK', 'PKU', 'PEK', 'PEN', 'BBI', 'BNE', 'IXC', 'CSX', 'MPH', 'CTU', 'CKG', 'CJM', 'KIX', 'DAC', 'DVO', 'TGG', 'TRZ', 'NRT', 'HND', 'TST', 'TWU', 'FUK', 'PQC', 'PUS', 'KHH', 'KBR', '1AI', 'KUA', 'CAN', 'KWL', 'KCH', 'GOI', 'GAU', 'HYD', 'HGH', 'HDY', 'HAN', 'ROI', 'OOL', 'HHQ', 'SGN', 'CCU', '3AD', 'KBV', 'GAY', 'JED', 'SWA', 'KUL', 'PNH', 'CJU', 'PLM', 'CGY', 'KLO', '3AG', 'CRK', 'CMB', 'COK', 'KKC', 'PNK', 'KMG', 'LBJ', 'UNN', '1AD', 'LPQ', 'LGK', 'IXR', 'LHW', 'LOE', '1AE', 'LOP', 'NST', '3AR', 'KJT', '3AE', 'MLE', 'MKZ', 'MDL', 'DMK', 'MNL', 'MYY', 'BOM', 'KNO', 'BTU', 'NGO', 'AVV', '3AC', 'NAG', 'KOP', 'NAW', 'LBU', 'KHN', 'NNT', 'NKG', 'NNG', '1AF', 'NGB', '1AB', 'PHS', '1AC', 'PER', 'HKT', 'PPS', 'PNQ', '3AJ', 'CEI', 'CNX', 'VCA', 'MAA', 'JJN', 'JOG', 'YIA', 'SYX', 'SNO', '1AJ', 'SRG', 'SDK', 'PVG', 'SHE', 'SZX', 'SBW', 'ICN', 'SXR', 'SUB', '3AB', 'STV', 'URT', '3AI', '1AA', 'SOQ', 'SOC', 'CEB', 'TAG', 'TPE', 'TAC', 'HNL', '1AH', 'TRV', 'TSN', 'UPG', 'BDO', 'VTZ', 'BWN', 'WUH', 'TJQ', 'BFV', '3AV', 'UTH', 'UBP', 'XIY', 'DAD', 'HKG', 'REP', 'SDJ', '1AG', 'KOS', 'DTB', 'DEL', 'SYD', 'SIN', 'JHB', 'BKI', 'CGK', 'AOR', 'RGN', 'CXR', 'IPH', 'ILO', 'IDR', 'IMF', 'VTE', 'CTS', 'JAI'] self.timeout = 15 self.time_start = time.perf_counter() self.Session = requests.session() self.Code = execjs.compile(Code) self.Author = execjs.compile(Author) self.currency = 'CNY' self.ua_list = ["Mozilla/5.0 (Linux; Android 5.1.1; Nexus 5 Build/LMY48B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.78 Mobile Safari/537.36", "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1", "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Mobile Safari/537.36", "Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Mobile Safari/537.36", "Mozilla/5.0 (Linux; Android 8.0.0; Pixel 2 XL Build/OPD1.170816.004) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Mobile Safari/537.36", "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1", "Mozilla/5.0 (iPad; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1", "Mozilla/5.0 (Linux; U; Android 8.1.0; zh-cn; BLA-AL00 Build/HUAWEIBLA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.9 Mobile Safari/537.36", "Mozilla/5.0 (Linux; U; Android 8.0.0; zh-CN; MHA-AL00 Build/HUAWEIMHA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 UCBrowser/12.1.4.994 Mobile Safari/537.36", "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0(iPad;U;CPUOS4_3_3likeMacOSX;en-us)AppleWebKit/533.17.9(KHTML,likeGecko)Version/5.0.2Mobile/8J2Safari/6533.18.5", "Mozilla/5.0(iPhone;U;CPUiPhoneOS4_3_3likeMacOSX;en-us)AppleWebKit/533.17.9(KHTML,likeGecko)Version/5.0.2Mobile/8J2Safari/6533.18.5", "MQQBrowser/26Mozilla/5.0(Linux;U;Android2.3.7;zh-cn;MB200Build/GRJ22;CyanogenMod-7)AppleWebKit/533.1(KHTML,likeGecko)Version/4.0MobileSafari/533.1", "Opera/9.80(Android2.3.4;Linux;OperaMobi/build-1107180945;U;en-GB)Presto/2.8.149Version/11.10", "Mozilla/5.0 (Mobile; rv:18.0) Gecko/18.0 Firefox/18.0", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60", "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0", "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) ", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)"] self.proxyInfo = None index = random.randint(1, 10) if (index > 0): socks5 = ['0.0.0.0:0000', '0.0.0.0:0000']# 请自行解决代理 self.proxyInfo = random.choice(socks5) self.proxy = {'https': f'socks5://{self.proxyInfo}@{self.proxyInfo}'} self.ua = random.choice(self.ua_list) self.headers = { 'Content-type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9', 'user-agent': self.ua, 'X-Forwarded-For': '{}.{}.{}.{}'.format(random.randint(1, 200), random.randint(0, 255), random.randint(0, 255), random.randint(0, 248)) } self.start_place = depCode.upper() self.end_place = arrCode.upper() self.depDate = depDate self.adtCount = int(adtCount) self.chdCount = int(chdCount) self.infCount = int(infCount) self.sessionId = sessionId def Basecurrency(self): if self.start_place not in self.place: dict_data = { "sessionId": self.sessionId, "status": "SUCCESS", "msg": "This airlines has not scheduled flight from {} to {} The current number of threads as:{} The Query Proxy as:{} The timeout as:{}".format(self.start_place, self.end_place, threading.activeCount(), self.proxyInfo, round(time.perf_counter() - self.time_start, 2)), "pricedItineraries": [], "validTime": None, "needAdjust": True, "needPushPrice": False } return dict_data url = f'https://k.airasia.com/basecurrency/{self.start_place}' self.headers['Referer'] = 'https://www.airasia.com' self.headers["Origin"] = 'https://www.airasia.com' if (self.proxy != None): response = self.Session.get(url, headers = self.headers, proxies = self.proxy, timeout = self.timeout / 2) else : response = self.Session.get(url, headers = self.headers, timeout = self.timeout / 2) response.encoding = 'utf-8' return response.text def GetSsrData(self): url = 'https://k.airasia.com/ssr/getssrdata' self.headers["channel_hash"] = "ac9b6baea0f0cbd04b974fee6e2f3f2c29beac2e5814d66555191736" self.headers["ContentType"] = 'Content-Type' if (self.proxy != None): response = self.Session.post(url, headers = self.headers, proxies = self.proxy, json = {}, timeout = self.timeout / 2) else : response = self.Session.post(url, headers = self.headers, json = {}, timeout = self.timeout / 2) response.encoding = 'utf-8' return response.text def Search(self): url = f'https://k.airasia.com/shopprice-pwa/1/0/{self.start_place}/{self.end_place}/{self.depDate}/{self.adtCount}/{self.chdCount}/{self.infCount}/40PRCT' if (self.proxy != None): response = self.Session.get(url, headers = self.headers, proxies = self.proxy, timeout = self.timeout / 2) else: response = self.Session.get(url, headers = self.headers, timeout = self.timeout / 2) response.encoding = 'utf-8' return response.text def FormatFlight(self, sellKey): sellKeyMatch = re.findall(r'(\w{2}~{1,}\s?\d{3,4}|\d{2}/\d{2}/\d{4}|\d{2}:\d{2})', sellKey) if (len(sellKeyMatch) > 0): return sellKeyMatch else : return [] def ParseDate(self, date, isFormat): es = time.strptime(date, '%m-%d-%Y %H:%M:%S') if (isFormat): return time.strftime('%Y-%m-%dT%H:%M:%S', es) else : return time.mktime(es) def Prase(self, data): try: jsons = json.loads(data) if (len(jsons["GetAvailability"]) == 0): dict_data = { "sessionId": self.sessionId, "status": "SUCCESS", "msg": "The choose days has not any scheduled flight. Please re-select,The Query Proxy as:{},The timeout as:{}".format(self.proxyInfo, round(time.perf_counter() - self.time_start, 2)), "pricedItineraries": [], "validTime": None, "needAdjust": True, "needPushPrice": False } return dict_data flights = jsons["GetAvailability"][0]["FaresInfo"] itemlist = [] for i in flights: Flight = i["BrandedFares"]["LowFare"] if Flight is None or len(Flight) == 0: Flight = i["BrandedFares"]["PremiumFlex"] JourneySellKey = i["JourneySellKey"] info = self.FormatFlight(JourneySellKey) if (len(info) == 5): AKCode = info[0].strip().split('~')[0] FlightNum = info[0].strip().split('~')[1] depData = info[1].replace('/', '-') + ' ' + info[2] + ":00" arrData = info[3].replace('/', '-') + ' ' + info[4] + ":00" item = {} item["depCode"] = self.start_place item["arrCode"] = self.end_place item["airline_code"] = AKCode item["flight_number"] = FlightNum.strip() item["departureDateTime"] = self.ParseDate(depData, True) item["arrivalDateTime"] = self.ParseDate(arrData, True) item["elapsedTime"] = int((self.ParseDate(arrData, False) - self.ParseDate(depData, False)) / 60) item["total_fare"] = float(Flight['TotalPrice']) item["tax"] = float(Flight["FareItems"][-1]['tax']) item["base_fare"] = round(item["total_fare"] - item["tax"], 2) item["booking_classes"] = Flight["FareClassOfService"] item["seatsRemaining"] = 9 if "AvailableCount" not in Flight else int(Flight["AvailableCount"]) item["currency"] = self.currency item_data = { "gdsSource": None, "ipcc": None, "currencyCode": item["currency"], "pricingInfos": [ { "baseFare": item["base_fare"], "baseFareCurrency": item["currency"], "equivFare": item["base_fare"], "equivFareCurrency": item["currency"], "verifiedFare": None, "verifiedFareCurrency": None, "originalFare": item["base_fare"], "originalFareCurrency": item["currency"], "fareRuleFare": None, "fareRuleFareCurrency": None, "fareRuleId": None, "disablePriceCheck": None, "supplierMarkupFare": None, "supplierMarkupCurrency": None, "supplierMarkupInfo": None, "tax": item["tax"], "taxCurrency": item["currency"], "verifiedTax": None, "verifiedTaxCurrency": None, "originalTax": item["tax"], "originalTaxCurrency": item["currency"], "fareRuleTax": None, "fareRuleTaxCurrency": None, "totalFare": item["total_fare"], "totalFareCurrency": item["currency"], "passengerType": "ADULT", "passengerQuantity": self.adtCount, "airlineCode": None, "fareType": None, "changeFare": None, "changePercentage": None, "changeFareCurrency": None, "refundFare": None, "refundFareCurrency": None, "refundPercentage": None, "fareBasisCodes": [], "baggageInfos": [], "replBgs": None, }], "airItinerary": { "airTripType": "ONE_WAY", "originDestinationOptions": [ { "elapsedTime": None, "flightSegments": [ { "departureCode": self.start_place, "departureName": None, "departureTerminal": None, "departureDateTime": item["departureDateTime"], "departureTimeZone": None, "arrivalCode": self.end_place, "arrivalName": None, "arrivalTerminal": None, "arrivalDateTime": item["arrivalDateTime"], "arrivalTimeZone": None, "elapsedTime": item["elapsedTime"], "cabin": item["booking_classes"], "replCabin": None, "replClass": None, "cabinClass": None, "airEquipType": None, "marketingAirlineCode": item["airline_code"], "marketingAirlineName": None, "marketingFlightNumber": item["flight_number"], "operatingAirlineCode": item["airline_code"], "operatingFlightNumber": item["flight_number"], "mealCode": None, "stopQuantity": None, "stopLocationCode": None, "seatsRemaining": item["seatsRemaining"], "codeShare": None, "eTicket": None, "onTimePercent": None, "onTimeRate": None, "marriageGrp": None, "availabilitySource": None, "eticket": None }] }] }, "validatingCarrier": item["airline_code"], "validatingCarrierName": None, "lastTicketingDate": None, "data": None, "createTime": time.strftime("%m-%d %H:%M:%S", time.localtime()), "sk": None, "st": None } itemlist.append(item_data) else : continue if (len(itemlist) == 0): dict_data = { "sessionId": self.sessionId, "status": "SUCCESS", "msg": "The choose days has not any scheduled flight. Please re-select,The Query Proxy as:{},The timeout as:{}".format(self.proxyInfo, round(time.perf_counter() - self.time_start, 2)), "pricedItineraries": [], "validTime": None, "needAdjust": True, "needPushPrice": False } return dict_data dict_data = { "sessionId": self.sessionId, "status": "SUCCESS", "msg": "None,The Query Proxy as:{},The timeout as:{}".format(self.proxyInfo, round(time.perf_counter() - self.time_start, 2)), "pricedItineraries": itemlist, "validTime": None, "needAdjust": True, "needPushPrice": False } return dict_data except Exception as ex: dict_data = { "sessionId": self.sessionId, "status": "SUCCESS", "msg": "The choose days has not any scheduled flight. Please re-select,The Query Proxy as:{},The timeout as:{}".format(self.proxyInfo, round(time.perf_counter() - self.time_start, 2)), "pricedItineraries": [], "validTime": None, "needAdjust": True, "needPushPrice": False } return dict_data def main(self): if self.start_place not in self.Side: JsonData = self.Basecurrency() if not isinstance(JsonData, str): return JsonData self.currency = JsonData JsonData = self.GetSsrData() try: Authorization = self.Author.call('A', JsonData) encyle = Authorization.split('\n') for item in encyle: if 'search' in item: self.headers["Authorization"] = item.split('|')[1] break self.headers["X-Custom-Flag"] = '1' self.headers["channel_hash"] = 'ac9b6baea0f0cbd04b974fee6e2f3f2c29beac2e5814d66555191736' self.headers.pop('ContentType') JsonData = self.Search() JsonData = self.Prase(JsonData) return JsonData except: pass if __name__ == '__main__': with open('Authorization.js', 'r') as df: Author = df.read() with open('DotRezSignature.js', 'r') as df: Code = df.read() ak = AK("HKG", "KUL", "2019-10-10", 1, 0, 0, "dsgdsgsdg-dsgdsgds", Code, Author) bb = ak.main() print(bb)
附件中将包含一个JS可执行脚本,执行本程序将返回json数据,至于怎么运用就看你们了,再次声明,附件中的JS可执行脚本仅限于学习与研究之用,切勿 滥用之;下期我将会更新较热门的航司:泰国狮子航空(SL);欢迎大家参阅!
附件下载地址
原文引用地址
it猫之家官方博客: 基于Python爬虫应用开发系列教学之亚洲航空航班查询爬虫
0 评论:
发表评论