-
Notifications
You must be signed in to change notification settings - Fork 142
Description
Issue Description
I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and is visible in solr.
I face problem while crawling and see given below error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ip-172-31-39-218.ap-southeast-1.compute.internal executor driver): org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
How to reproduce it
- git clone the main branch.
- build sparkler-core
- modify /home/ubuntu/sparkler/sparkler-core/build/conf/sparkler-default.yaml
crawldb.backend: solr # "solr" is default until "elasticsearch" becomes usable.
solr.uri: http://localhost:8983/solr/crawldb
- Run following command to inject -
java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main inject -id sjob-1 -su https://news.bbc.co.uk
- Run following command to crawl -
java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main crawl -id sjob-1 -tn 10 -i 1
Additional changes: I have modified Crawler.scala and have added below code at line 171
conf.set("spark.io.compression.codec", "snappy")
Please let me know how to pass spark-conf in the runtime configurations so that I can avoid doing this.
Environment and Version Information
Please indicate relevant versions, including, if relevant:
-
Java Version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode) -
Spark Version - 3.0.3, Scala version 2.12.10
-
Operating System name and version - AWS Instance based on 20.04.1-Ubuntu
-
Solr - 8.5.0 (in local mode)
I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr, then why it is expected while fetching. The same error I see in the solr.log
2022-01-31 04:44:54.871 ERROR (qtp1984990929-17) [ x:crawldb] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:226)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:109)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:977)
StackTrace from sparkler-crawl -
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManagerMaster - Updated info of block rdd_7_0
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManager - Told master about block rdd_7_0
04:44:54.880 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 3.0 (TID 3)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:665)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:156)
at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResource(SolrProxy.scala:121)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:158)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:37)
at scala.collection.Iterator.toStream(Iterator.scala:1417)
at scala.collection.Iterator.toStream$(Iterator.scala:1416)
at edu.usc.irds.sparkler.pipeline.FairFetcher.toStream(FairFetcher.scala:37)
at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:336)
at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:336)
at edu.usc.irds.sparkler.pipeline.FairFetcher.toSeq(FairFetcher.scala:37)
at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$3(Crawler.scala:258)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1418)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1345)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1409)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1230)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
04:44:54.908 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.executor.ExecutorMetricsPoller - removing (3, 0) from stageTCMP