Apache PIG can only register Jars using local path on the head node - by John Diss

Status : 

  External<br /><br />
		This item may be valid but belongs to an external system out of the direct control of this product team.<br /><br />
		A more detailed explanation for the resolution of this particular item may have been provided in the comments section.

Sign in
to vote
ID 782898 Comments
Status Closed Workarounds
Type Bug Repros 0
Opened 4/3/2013 4:17:28 AM
Access Restriction Public


from the spec "Use the REGISTER statement inside a Pig script to specify a JAR file or a Python/JavaScript module. Pig supports JAR files and modules stored in local file systems as well as remote, distributed file systems such as HDFS and Amazon S3 (see Pig Scripts)."

When attempting to register a jar residing on hdfs or asv PIG throws an error e.g
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: Local file 'hdfs://namenodehost:9000/Jar/piggybank-jd.jar' does not exist.
Sign in to post a comment.
Posted by Microsoft on 5/17/2013 at 11:48 AM
Hi John, I was unable to reproduce this on a HDInsight cluster that I created. Here's what I did:

1. Copy a local jar into asv using hadoop command-line:
hadoop fs -copyFromLocal D:\temp\myudfs.jar asv://<container name>@<account name>.blob.core.windows.net/myudfs.jar

2. Register the jar in pig:
REGISTER 'asv://<container name>@<account name>.blob.core.windows.net/myudfs.jar'

3. Run a pig query using the jar:
A = LOAD '/user/test.log' USING PigStorage('\t') AS (name: chararray, number: int);

Notice that I put the jar in ASV, not HDFS. When pig starts, you can see that the file system that is loaded is ASV, not HDFS:
2013-05-17 18:31:45,730 [main] INFO org.apache.pig.Main - Logging error messages to: C:\apps\dist\hadoop-1.1.0-SNAPSHOT\logs\pig_1368815505728.log
2013-05-17 18:31:46,021 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: asv://<container name>@<account name>.blob.core.windows.net

Given that Azure VMs can be reimaged/migrated, HDFS is better used for temporary scratch space used internally by Hadoop components and ASV should be used for persistent storage. Please let me know if you have any other questions.