Skip to content

Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=<>] unknown field 'contenthash' #247

@ravindrabajpai

Description

@ravindrabajpai

Issue Description

I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and is visible in solr.
I face problem while crawling and see given below error -

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ip-172-31-39-218.ap-southeast-1.compute.internal executor driver): org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'

How to reproduce it

  1. git clone the main branch.
  2. build sparkler-core
  3. modify /home/ubuntu/sparkler/sparkler-core/build/conf/sparkler-default.yaml
crawldb.backend: solr  # "solr" is default until "elasticsearch" becomes usable.
solr.uri: http://localhost:8983/solr/crawldb
  1. Run following command to inject -
    java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main inject -id sjob-1 -su https://news.bbc.co.uk
  2. Run following command to crawl -
    java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main crawl -id sjob-1 -tn 10 -i 1

Additional changes: I have modified Crawler.scala and have added below code at line 171
conf.set("spark.io.compression.codec", "snappy")
Please let me know how to pass spark-conf in the runtime configurations so that I can avoid doing this.

Environment and Version Information

Please indicate relevant versions, including, if relevant:

  • Java Version
    openjdk version "1.8.0_312"
    OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
    OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

  • Spark Version - 3.0.3, Scala version 2.12.10

  • Operating System name and version - AWS Instance based on 20.04.1-Ubuntu

  • Solr - 8.5.0 (in local mode)

I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr, then why it is expected while fetching. The same error I see in the solr.log

2022-01-31 04:44:54.871 ERROR (qtp1984990929-17) [   x:crawldb] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:226)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:109)
        at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:977)

StackTrace from sparkler-crawl -

04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManagerMaster - Updated info of block rdd_7_0
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManager - Told master about block rdd_7_0
04:44:54.880 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 3.0 (TID 3)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:665)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:156)
	at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResource(SolrProxy.scala:121)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:158)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:37)
	at scala.collection.Iterator.toStream(Iterator.scala:1417)
	at scala.collection.Iterator.toStream$(Iterator.scala:1416)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.toStream(FairFetcher.scala:37)
	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:336)
	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:336)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.toSeq(FairFetcher.scala:37)
	at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$3(Crawler.scala:258)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1418)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1345)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1409)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1230)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
04:44:54.908 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.executor.ExecutorMetricsPoller - removing (3, 0) from stageTCMP

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions