Home Dashboard Directory Help
Search

Apache PIG can only register Jars using local path on the head node by John Diss


Status: 

Closed
 as External Help for as External


1
0
Sign in
to vote
Type: Bug
ID: 782898
Opened: 4/3/2013 4:17:28 AM
Access Restriction: Public
1
Workaround(s)
view
0
User(s) can reproduce this bug

Description

from the spec "Use the REGISTER statement inside a Pig script to specify a JAR file or a Python/JavaScript module. Pig supports JAR files and modules stored in local file systems as well as remote, distributed file systems such as HDFS and Amazon S3 (see Pig Scripts)."

When attempting to register a jar residing on hdfs or asv PIG throws an error e.g
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: Local file 'hdfs://namenodehost:9000/Jar/piggybank-jd.jar' does not exist.
Details
Sign in to post a comment.
Posted by Microsoft on 5/17/2013 at 11:48 AM
Hi John, I was unable to reproduce this on a HDInsight cluster that I created. Here's what I did:

1. Copy a local jar into asv using hadoop command-line:
hadoop fs -copyFromLocal D:\temp\myudfs.jar asv://<container name>@<account name>.blob.core.windows.net/myudfs.jar

2. Register the jar in pig:
REGISTER 'asv://<container name>@<account name>.blob.core.windows.net/myudfs.jar'

3. Run a pig query using the jar:
A = LOAD '/user/test.log' USING PigStorage('\t') AS (name: chararray, number: int);
DUMP A;
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;

Notice that I put the jar in ASV, not HDFS. When pig starts, you can see that the file system that is loaded is ASV, not HDFS:
c:\apps\dist\pig-0.9.3-SNAPSHOT\bin>pig
2013-05-17 18:31:45,730 [main] INFO org.apache.pig.Main - Logging error messages to: C:\apps\dist\hadoop-1.1.0-SNAPSHOT\logs\pig_1368815505728.log
2013-05-17 18:31:46,021 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: asv://<container name>@<account name>.blob.core.windows.net

Given that Azure VMs can be reimaged/migrated, HDFS is better used for temporary scratch space used internally by Hadoop components and ASV should be used for persistent storage. Please let me know if you have any other questions.
Sign in to post a workaround.
Posted by John Diss on 4/3/2013 at 4:22 AM
use fs -copyToLocal to copy a file from remote filesystem to local storage and then use the local path in your script.
File Name Submitted By Submitted On File Size  
HDInsightRegisterBug.txt (restricted) 4/3/2013 -