Clang for fuzzy parsing C++

2020-06-07 08:29发布

Is it at all possible to parse C++ with incomplete declarations with clang with its existing libclang API ? I.e. parse .cpp file without including all the headers, deducing declarations on the fly. so, e.g. The following text:

A B::Foo(){return stuff();}

Will detect unknown symbol A, call my callback that deducts A is a class using my magic heuristic, then call this callback the same way with B and Foo and stuff. In the end I want to be able to infer that I saw a member Foo of class B returning A, and stuff is a function.. Or something to that effect. context: I wanna see if I can do sensible syntax highlighting and on the fly code analysis without parsing all the headers very quickly.

[EDIT] To clarify, I'm looking for very heavily restricted C++ parsing, possibly with some heuristic to lift some of the restrictions.

C++ grammar is full of context dependencies. Is Foo() a function call or a construction of a temporary of class Foo? Is Foo<Bar> stuff; a template Foo<Bar> instantiation and declaration of variable stuff, or is it weird-looking 2 calls to overloaded operator < and operator > ? It's only possible to tell in context, and context often comes from parsing the headers.

What I'm looking for is a way to plug my custom convention rules. E.g. I know that I don't overload Win32 symbols, so I can safely assume that CreateFile is always a function, and I even know its signature. I also know that all my classes start with a capital letter and are nouns, and functions are usually verbs, so I can reasonably guess that Foo and Bar are class names. In a more complex scenario, I know I don't write side-effect-free expressions like a < b > c; so I can assume that a is always a template instantiation. And so on.

So, the question is whether it's possible to use Clang API to call back every time it encounters an unknown symbol, and give it an answer using my own non-C++ heuristic. If my heuristic fails, then the parse fails, obviously. And I'm not talking about parsing Boost library :) I'm talking about very simple C++, probably without templates, restricted to some minimum that clang can handle in this case.

4条回答
乱世女痞
2楼-- · 2020-06-07 08:30

I know the question is fairly old, but have a look here :

LibFuzzy is a library for heuristically parsing C++ based on Clang's Lexer. The fuzzy parser is fault-tolerant, works without knowledge of the build system and on incomplete source files. As the parser necessarily makes guesses, the resulting syntax tree may be partially wrong.

It is a sub-project from clang-highlight, an (experimental?) tool which seems to be no longer developed.

I'm only interested in the fuzzy parsing part and forked it on my github page where I fixed several minor issues and made the tool autonomous (it can be compiled outside clang's source tree). Don't try to compile it with C++14 (which G++ 6's default mode), because there will be conflicts with make_unique.

According to this page, clang-format has its own fuzzy parser (and is actively developed), but the parser was (is ?) more tighly coupled to the tool.

查看更多
家丑人穷心不美
3楼-- · 2020-06-07 08:36

OP doesn't want "fuzzy parsing". What he wants is full context-free parsing of the C++ source code, without any requirement for name and type resolution. He plans to make educated guesses about the types based on the result of the parse.

Clang proper tangles parsing and name/type resolution, which means it must have all that background type information available when it parses. Other answers suggest a LibFuzzy that produces incorrect parse trees, and some fuzzy parser for clang-format about which I know nothing. If one insists on producing a classic AST, none of these solutions will produce the "right" tree in the face of ambiguous parses.

Our DMS Software Reengineering Toolkit with its C++ front end can parse C++ source without the type information, and produces accurate "ASTs"; these are actually abstract syntax dags where forks in trees represent different possible interpretations of the source code according to a language-precise grammar (ambiguous (sub)parses).

What Clang tries to do is avoid producing these multiple sub-parses by using type information as it parses. What DMS does is produce the ambiguous parses, and in (optional) post-parsing (attribute-grammar-evaluation) pass, collect symbol table information and eliminate the sub-parses which are inconsistent with the types; for well-formed programs, this produces a plain AST with no ambiguities left.

If OP wants to make heuristic guesses about the type information, he will need to know these possible interpretations. If they are eliminated in advance, he cannot straightforwardly guess what types might be needed. An interesting possibility is the idea of modifying the attribute grammar (provided in source form as part of the DMS's C++ front end), which already knows all of C++ type rules, to do this with partial information. That would be an enormous head start over building the heuristic analyzer from scratch, given that it has to know about 600 pages of arcane name and type resolution rules from the standard.

You can see examples of the (dag) produced by DMS's parser.

查看更多
我想做一个坏孩纸
4楼-- · 2020-06-07 08:38

Unless you heavily restrict the code that people are allowed to write, it is basically impossible to do a good job of parsing C++ (and hence syntax highlighting beyond keywords/regular expressions) without parsing all the headers. The pre-processor is particularly good at screwing things up for you.

There are some thoughts on the difficulties of fuzzy parsing (in the context of visual studio) here which might be of interest: http://blogs.msdn.com/b/vcblog/archive/2011/03/03/10136696.aspx

查看更多
Juvenile、少年°
5楼-- · 2020-06-07 08:52

Another solution which I think will suit more the OP than fuzzy parsing.

When parsing, clang maintains Semantic information through the Sema part of the analyzer. When encountering an unknown symbol, Sema will fallback to ExternalSemaSource to get some information about this symbol. Through this, you could implement what you want.

Here is a quick example how to set up it. It is not entirely functional (I'm not doing anything in the LookupUnqualified method), you might need to do further investigations and I think it is a good start.

// Declares clang::SyntaxOnlyAction.
#include <clang/Frontend/FrontendActions.h>
#include <clang/Tooling/CommonOptionsParser.h>
#include <clang/Tooling/Tooling.h>
#include <llvm/Support/CommandLine.h>
#include <clang/AST/AST.h>
#include <clang/AST/ASTConsumer.h>
#include <clang/AST/RecursiveASTVisitor.h>
#include <clang/Frontend/ASTConsumers.h>
#include <clang/Frontend/FrontendActions.h>
#include <clang/Frontend/CompilerInstance.h>
#include <clang/Tooling/CommonOptionsParser.h>
#include <clang/Tooling/Tooling.h>
#include <clang/Rewrite/Core/Rewriter.h>
#include <llvm/Support/raw_ostream.h>
#include <clang/Sema/ExternalSemaSource.h>
#include <clang/Sema/Sema.h>
#include "clang/Basic/DiagnosticOptions.h"
#include "clang/Frontend/TextDiagnosticPrinter.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Basic/TargetOptions.h"
#include "clang/Basic/TargetInfo.h"
#include "clang/Basic/FileManager.h"
#include "clang/Basic/SourceManager.h"
#include "clang/Lex/Preprocessor.h"
#include "clang/Basic/Diagnostic.h"
#include "clang/AST/ASTContext.h"
#include "clang/AST/ASTConsumer.h"
#include "clang/Parse/Parser.h"
#include "clang/Parse/ParseAST.h"
#include <clang/Sema/Lookup.h>

#include <iostream>
using namespace clang;
using namespace clang::tooling;
using namespace llvm;

class ExampleVisitor : public RecursiveASTVisitor<ExampleVisitor> {
private:
  ASTContext *astContext;

public:
  explicit ExampleVisitor(CompilerInstance *CI, StringRef file)
      : astContext(&(CI->getASTContext())) {}

  virtual bool VisitVarDecl(VarDecl *d) {
    std::cout << d->getNameAsString() << "@\n";
    return true;
  }
};

class ExampleASTConsumer : public ASTConsumer {
private:
  ExampleVisitor visitor;

public:
  explicit ExampleASTConsumer(CompilerInstance *CI, StringRef file)
      : visitor(CI, file) {}
  virtual void HandleTranslationUnit(ASTContext &Context) {
    // de cette façon, on applique le visiteur sur l'ensemble de la translation
    // unit
    visitor.TraverseDecl(Context.getTranslationUnitDecl());
  }
};

class DynamicIDHandler : public clang::ExternalSemaSource {
public:
  DynamicIDHandler(clang::Sema *Sema)
      : m_Sema(Sema), m_Context(Sema->getASTContext()) {}
  ~DynamicIDHandler() = default;

  /// \brief Provides last resort lookup for failed unqualified lookups
  ///
  /// If there is failed lookup, tell sema to create an artificial declaration
  /// which is of dependent type. So the lookup result is marked as dependent
  /// and the diagnostics are suppressed. After that is's an interpreter's
  /// responsibility to fix all these fake declarations and lookups.
  /// It is done by the DynamicExprTransformer.
  ///
  /// @param[out] R The recovered symbol.
  /// @param[in] S The scope in which the lookup failed.
  virtual bool LookupUnqualified(clang::LookupResult &R, clang::Scope *S) {
     DeclarationName Name = R.getLookupName();
     std::cout << Name.getAsString() << "\n";
    // IdentifierInfo *II = Name.getAsIdentifierInfo();
    // SourceLocation Loc = R.getNameLoc();
    // VarDecl *Result =
    //     // VarDecl::Create(m_Context, R.getSema().getFunctionLevelDeclContext(),
    //     //                 Loc, Loc, II, m_Context.DependentTy,
    //     //                 /*TypeSourceInfo*/ 0, SC_None, SC_None);
    // if (Result) {
    //   R.addDecl(Result);
    //   // Say that we can handle the situation. Clang should try to recover
    //   return true;
    // } else{
    //   return false;
    // }
    return false;
  }

private:
  clang::Sema *m_Sema;
  clang::ASTContext &m_Context;
};

// *****************************************************************************/

LangOptions getFormattingLangOpts(bool Cpp03 = false) {
  LangOptions LangOpts;
  LangOpts.CPlusPlus = 1;
  LangOpts.CPlusPlus11 = Cpp03 ? 0 : 1;
  LangOpts.CPlusPlus14 = Cpp03 ? 0 : 1;
  LangOpts.LineComment = 1;
  LangOpts.Bool = 1;
  LangOpts.ObjC1 = 1;
  LangOpts.ObjC2 = 1;
  return LangOpts;
}

int main() {
  using clang::CompilerInstance;
  using clang::TargetOptions;
  using clang::TargetInfo;
  using clang::FileEntry;
  using clang::Token;
  using clang::ASTContext;
  using clang::ASTConsumer;
  using clang::Parser;
  using clang::DiagnosticOptions;
  using clang::TextDiagnosticPrinter;

  CompilerInstance ci;
  ci.getLangOpts() = getFormattingLangOpts(false);
  DiagnosticOptions diagnosticOptions;
  ci.createDiagnostics();

  std::shared_ptr<clang::TargetOptions> pto = std::make_shared<clang::TargetOptions>();
  pto->Triple = llvm::sys::getDefaultTargetTriple();

  TargetInfo *pti = TargetInfo::CreateTargetInfo(ci.getDiagnostics(), pto);

  ci.setTarget(pti);
  ci.createFileManager();
  ci.createSourceManager(ci.getFileManager());
  ci.createPreprocessor(clang::TU_Complete);
  ci.getPreprocessorOpts().UsePredefines = false;
  ci.createASTContext();

  ci.setASTConsumer(
      llvm::make_unique<ExampleASTConsumer>(&ci, "../src/test.cpp"));

  ci.createSema(TU_Complete, nullptr);
  auto &sema = ci.getSema();
  sema.Initialize();
  DynamicIDHandler handler(&sema);
  sema.addExternalSource(&handler);

  const FileEntry *pFile = ci.getFileManager().getFile("../src/test.cpp");
  ci.getSourceManager().setMainFileID(ci.getSourceManager().createFileID(
      pFile, clang::SourceLocation(), clang::SrcMgr::C_User));
  ci.getDiagnosticClient().BeginSourceFile(ci.getLangOpts(),
                                           &ci.getPreprocessor());
  clang::ParseAST(sema,true,false);
  ci.getDiagnosticClient().EndSourceFile();

  return 0;
}

The idea and the DynamicIDHandler class are from cling project where unknown symbols are variable (hence the comments and the code).

查看更多
登录 后发表回答